Traditional Culture Encyclopedia - Traditional stories - Why is hbase a column-oriented database?

Why is hbase a column-oriented database?

Before I say HBase, I want to say a few more words. Anyone who does Internet applications should know that you can't predict when your system will be visited by many people and how many users you will face. Maybe there are fewer users today and more users tomorrow. As a result, your system can't handle it and quit. Isn't this the sorrow of brothers, and following the fashion is called "a tragedy"?

In fact, to put it bluntly, these are the most important things that were not recognized in advance. From the perspective of system architecture, Internet applications pay more attention to system performance and scalability, while traditional enterprise applications pay more attention to data integrity and data security. Then let's talk about the scalability of Internet applications. For the sake of scalability, I also wrote several blog posts. Brothers who want to see it can refer to my previous blog post. For the scalability of web server and app server, I won't talk about it here for the time being, because this part of scalability is relatively easy. I will mainly review how a slow-growing Internet application responds to the scalability of the database layer.

First of all, at the beginning, there were not many people and little pressure. Just set up a database server. At this time, everything is stuffed into a server, including web server, app server and db server. However, with more and more people, the system pressure is increasing. At this point, you can separate the Web server, application server and database server. At least for a while, but with more and more users, you will find that the database buddy is dying, the speed is always slow, and sometimes it will fall, so at this time, you have to find some partners for the database buddy. At this moment, the master appeared. At this time, one master server is responsible for receiving write operations, and several other slave servers are dedicated to read operations. In this way, the master finally stopped complaining, finally separated reading from writing, and finally relieved the pressure. At this time, it is mainly to expand the reading operation horizontally, and overcome the CPU bottleneck by adding multiple slaves. Generally, your system can cope with certain pressure, but with the increase of users, the pressure is getting bigger and bigger, and you will find that the writing pressure of the main server is still too great, so there is no way. What should I do at this time? If you want to split, as the saying goes, "only splitting can make you scalable", so at this time, you can only split the database, which is what we often call "vertical splitting" of the database. For example, you can store some irrelevant data in different databases and deploy them separately, so that you can finally take away some of the reading and writing pressure, and experts can relax a little, but with the continuous increase of data, the data in your database table will become very different. This query is inefficient. At this time, "horizontal partition" is needed, such as dividing the data in the user table according to 10W, so that each table will not exceed 10W.

To sum up, a popular website will go through a painful process from single DB to master-slave replication, to vertical partition, and then to horizontal partition. In fact, the principle of database segmentation seems very simple. If it does, I think all the buddies who have had a database in the slice will suffer greatly. For the article on database expansion, you can see the resource introduction at the back.

Well, from the above nonsense, we also find how painful it is to expand the scale beyond the database storage level, but fortunately, technology is improving and other brothers in the industry are working hard. In 2009, there were many NoSQL databases, or more accurately, there were no relational databases. Most of these databases will provide transparent horizontal expansion capabilities for unstructured data, greatly reducing the design pressure of friends. I think of Hbase as a distributed column storage system.

What is Hbase?

Before saying who Hase is, let's look at two concepts, row-oriented storage and column-oriented storage. Row-oriented storage, I believe everyone should know that the RDBMS we are familiar with is this type. The database oriented to row storage is mainly suitable for occasions with strict transactional requirements, or the storage system oriented to row storage is suitable for OLTP. However, according to CAP theory, in order to achieve strong consistency, traditional RDBMS synchronize through strict ACID transactions, which greatly reduces the availability and scalability of the system. At present, many NoSQL products, including Hbase, are ultimately consistent systems, and they sacrifice some consistency for high availability. It seems that the above is column-oriented storage, so what exactly is column-oriented storage? Hbase, Cassandra and Bigtable all belong to distributed storage systems oriented to column storage. Seeing this, if you don't understand what Hbase is, it doesn't matter, let me summarize it again:

Hbase is a distributed storage system for column storage. Its advantage is that it can realize high-performance concurrent read and write operations, and at the same time, Hbase can transparently divide data, making the storage itself horizontally scalable.

Double cardinality data model

The data models of Hbase and Cassandra are very similar, and their ideas all come from Google's Bigtable, so their data models are very similar. The only difference is that Cassandra has the concept of super cloud family, but HBase has not been discovered so far. Ok, without further ado, let's take a look at what the data model of Hbase is.

There are two main concepts in Hbase, Rowkey and Column family. Let's look at the column family, which is also called "column family" in Chinese. ColumnFamily is predefined before the system starts, and each ColumnFamily can have multiple columns according to the "qualifier". Let's give an example to make it clear.

If there is a user table in the system, the columns in the user table are fixed according to the traditional RDBMS. For example, the schema defines attributes such as name, age and gender, and the attributes of users cannot be dynamically increased. However, if a column storage system is adopted, such as Hbase, then we can define the user table and then define the info column family. User data can be divided into: info: name = Zhang San, info: age = 30, info: gender = male, etc. If you want to add another property in the future, it is convenient to just need info:newProperty.

Perhaps the previous example is not clear enough. Let's give another example to illustrate. Friends who are familiar with SNS should all know the feeds of friends. Generally, we will design feeds according to "someone did something titled XXX for a certain period of time", but at the same time, we will also reserve keywords. For example, sometimes a feed may need a url and a feed needs an image attribute. In this way, the attribute of the feed itself is uncertain, and it will be very troublesome to use the traditional relational database. Moreover, relational databases will waste some empty cells, and column storage will not have this problem. In Hbase, if each column cell has no value, it will take up space. Let's show this relationship vividly through two pictures:

The above picture shows the Feed table designed by traditional RDBMS. We can see how many columns in the feed are fixed and cannot be added. Empty columns waste space. But let's take a look at the following figure, which is the data model diagram of Hbase, Cassandra and Bigtable. As can be seen from the following figure, the columns of the Feed table can be dynamically added, and the empty columns are not stored, which greatly saves space. The key is that with the operation of the system, there will be a variety of feeds, and we can't predict how many feeds there are in advance. Then there is no way to determine how many columns there are in the Feed table, so the data model based on column storage of Hbase, Cassandra and Bigtable is very suitable for this scenario. Speaking of which, another very important advantage of using Hbase is that the feed will be automatically segmented. When the data in the Feed table exceeds a certain threshold, Hbase will automatically segment the data for us. In this way, the query will be scalable, and coupled with the weak transaction characteristics of Hbase, the writing operation to Hbase will become very fast.

The column family is mentioned above, so what is the row key I mentioned before? In fact, you can understand that the row key is the primary key of a row in RDBMS, but since Hbase does not support queries such as conditional query and Order by, the design of the row key should be designed according to the query requirements of your system. I also take the column of feed just now as an example. Usually, we query someone's latest feed, so the line key of the feed can be composed of the following three parts < userId & gt& lt timestamp & gt< fed>. In this way, when we want to query someone's highest feed, we can specify Start Rowkey as

Advantages and disadvantages of three kinds of Hbase

The column 1 can be dynamically increased. If the column is empty, no data will be stored, saving storage space.

2 Hbase automatically partitions data, which makes data storage automatically expand horizontally.

3 Hbase can provide support for high concurrent read and write operations.

Disadvantages of Hbase:

1 cannot support conditional query, and can only query by line key.

Failover of the primary server is temporarily not supported. When the Master is down, the entire storage system will hang.

Four. supplement

1. data type, HBase has only simple character types, all of which are handled by users themselves, and it only saves strings. Relational databases are rich in types and storage methods.

2. Data operation: HBase only has simple operations such as inserting, querying, deleting and emptying, and tables are separated, and there is no complicated relationship between tables, while traditional databases usually have various functions and connection operations.

3. Storage mode: HBase is based on column storage, each column family is saved by several files, and the files of different column families are separated. The traditional relational database is based on table structure and row pattern.

4. Data maintenance, the update operation of HBase should not be called update, it is actually inserting new data, while the traditional database is replaced and modified.

5. Scalability, distributed databases such as Hbase are developed for this purpose, so it can easily increase or decrease the number of hardware, and it is highly error-tolerant. However, traditional databases usually need to add an intermediate layer to achieve similar functions.