Traditional Culture Encyclopedia - Traditional stories - Qingyun Li Wei: What are the unique challenges of doing big data platforms on the cloud?

Qingyun Li Wei: What are the unique challenges of doing big data platforms on the cloud?

July 18th, "Cloud User Eco-Development Forum and the Third China Cloud Computing User Conference" was held at the China National Convention Center in Beijing. In the afternoon session, QingCloud system engineer and big data platform responsible person Li Wei brought the theme of "big data cloud platform best practices" wonderful speech, the following is the transcript of his speech:

Li Wei: Hello, I am QingCloud QingCloud system engineer Li Wei. The topic I'm talking about today may be a bit technical and may require a bit of brainstorming. It's divided into a few chunks. First, let's talk about the relationship between cloud computing and big data. Second, what are the unique challenges of doing a big data platform on the cloud. Third, we'll talk a little bit about what it looks like to have a more basic, or generalized, system architecture for a big data platform. Finally, we'll share some of our own best practices related to big data, both with our customers and in the cloud.

Without going into too many examples of big data, I'm going to talk about some of our enterprise customers. For example, the first one is a very large multinational Internet social enterprise. Then they will use our big data platform on the cloud, including some specific technology, will do, for example, user profiling. That is, you are in the social network, and then why recommended to your friends happen to be you may know, and then why recommended to you the information may be you interested in. This is all user profiling with big data to do.

Second, like a very large Internet financial enterprises, it will use big data to do some risk control analysis. Because in the Internet finance, especially the Internet financial industry, it can be and traditional finance PK, is because it can use big data technology in this aspect of risk control to control the risk is very small. We can think about it, in the P2P platform above, why there is no traditional bank like before, all kinds of people to investigate you, there is no collateral, but you can let you use the money. Including government departments massive information retrieval, for example, it needs to combine various departments across the country, and then I need to have a suspect he has no possibility of some other data in various places, I can search, can be mined, and then carry out some analysis.

Big data is very hot, it with cloud computing in the end what relationship? In fact, we think that big data now we may feel to where we hear big data, in fact, it is likely that everyone said different, but also have to say is a big data platform, some people say is a big data of a product, some people may say is a big data of a certain application, such as Alpha Go.

Especially in the enterprise, we talk to the customer when the customer's first comparatively want to understand is the big data of the product, but also to say the big data of the product. What they don't understand is that there are too many products and technologies for big data, and the difference between each scenario is not so obvious. So, in the big data technology, the first thing we have to solve is how to choose big data solutions, how to do big data solutions for the enterprise. However, the needs of each enterprise change and particularly large, or there are many enterprises, that is, traditional enterprises, their needs for big data is not very clear, the Internet enterprises, their needs change very quickly. According to the traditional, such as building a set of big data platform, may cost a lot of costs, time costs, labor costs, including money. But the cloud platform, we know that IaaS, PaaS, SaaS, and finally everything becomes a server. It's less costly when you have to build a very complex solution because you just have to follow the service build, and that's very flexible, and if you find a problem with one part of the solution, you can replace it very quickly because a lot of it is services on the platform. So it meets the needs of your business uncertainty, including the needs of business elasticity. Because we all know that change is so fast now.

Second, what are the benefits of cloud computing for big data? For example, it can automate the operation and maintenance, some complex system installation, deployment, monitoring do not have to do it yourself, in the interface very fast can be, very simple to do. Then there are some including stability, performance, this is not much to say, the benefits of cloud computing we must know especially much, say a few interesting.

For example, the network and storage, computing engine switching, this is more interesting. That is, when your platform is complex enough, large enough, each piece of the part is a server, after each piece becomes a server, you can be very flexible to replace it, replace him with other products to achieve, or other technology to achieve. The latter is Service Orchestration, that is, for example, you have an interface, you need to draw a variety of diagrams, or tools, but they have a very fatal shortcomings, you drew that diagram is not executable, that is, it can not be deployed, can not be executed. Service Orchestration is to give you a large topology map, which is the Qingyun This is also a product released by Qingyun earlier this year, called resource orchestration. You can deploy a whole set of architectures out in the cloud platform, and that's some of the benefits they bring to these on the cloud.

The challenges of big data platforms on the cloud. Many organizations do big data platforms on physical machines, why not on the cloud? Because there are so many challenges. First, stability challenges, such as high availability, disaster recovery. Second, performance. It has been criticized because you are a virtual machine, certainly not as fast as the hard disk of the network machine. The stability of the first IaaS layer in Qingyun has been running for several years, and there is not much to say. Dirt performance piece, we did last year's software-defined network 2.0, 2.0 out of this is for cloud computing, for the big IaaS platform specifically developed a set of SDN, can be done between the point-to-point network transmission, can reach the physical network card. Secondly, in the hard disk this piece has been dirt disease, we container technology, can put the hard disk technology down very low. The third benefit is migration, migration technology is very good, because now there are already some relatively well established, such as relational and non-relational databases.

We said that after solving these challenges, we will have a big data platform system architecture out of this architecture is actually all a very common architecture. That is, you may be in a lot of enterprises, whether it is Jingdong, Meituan, Amazon, may see basically look like this. In fact, first from the left to start to see, in fact, is a data lifecycle, that is, the data from which the place to collect, may be logs, may be sensors, collected over to the middle of the core platform, the bottom layer is the IaaS, Qingyun all the PaaS layer of services are based on the IaaS to do is all in the cloud above. Then to the first is storage. In the middle of the three blocks, the first is called real-time computing, called Storm, of course, Twitter now out may claim to be stronger than Storm. The second one is Batch Processing, and the third one is Big SQL, including things like Kylim. On the right is what you might do with all the platforms, including its data management, monitoring, security, including one of the things used to make a distributed configuration center.

After all the data is stored and computed, you might use it in some, you know, you want to use it in some really nice user-friendly way, and we might generally submit the data to, for example, like some of the more interactive technical components, so that at the top, whether it's reporting or visualization, like the Hadoop ecosystem that's popular to do visualization is a little bit easier.

It's easy to do that.

The diagram I'm drawing now basically shows the core of the big data lifecycle, or the most mainstream products or technologies are covered, and Qingyun's own big data platform is also based on this architecture.

Next, I'm going to talk a little bit about this architecture, and I'm going to talk about it one by one. First, let's talk about computing. Computing above the most classic is Hadoop, this figure does not need too much to say. If you usually study big data, you can mention one point, from 2.0 onwards, its HDFS has high availability, before turning into Yarn to support, which will improve the performance. The second computational architecture is Spark, for example, it has some mainstream features. If you do real-time computing, Storm is definitely the first choice. mapReduce latency is very high, but throughput is very high. mapReduce hard disk is very high, Spark Streaming because it is hard disk computing, so the calculation is okay. If you have some Hadoop ecosystem foundation before, you may choose Spark better, if you do not require very real-time, because the Spark platform is very strong, it is a platform itself, and now the development of the platform is very fast, so you may choose Spark, the requirements are very high, and now we run into customers have. Secondly, in Big SQL, mention a few, one is Phoenix, which provides products packaged in SQ language. The second one is MPP's.

Storage. Initially is HDFS, first, must be designed for large files, not for massive small files. If you want to deal with massive small files, in the green cloud platform has an imaginary is the object storage, we were designing no matter what type of file, no matter what size of the file, can use this storage.HDFS why can not store massive small files, the reason is very simple, like the Linux inside all the data has an index, if the storage of massive small files, the index of the data has a characteristic, no matter how big or small the data file, the index of the data has a characteristic, no matter how big or small the data file, the index of the data has a characteristic. If you store a large number of small files, the indexed data has a characteristic that, regardless of whether the data file is large or small, the indexed data is the same size. When you store a large number of small files, the file is actually not very large, it will very much affect the performance, resulting in the data of the entire storage space is not utilized slowly, but the performance has been unusable.

The second more mainstream storage is Hbase, Hbase is architected on top of HDFS, it can store a very wide sample table, you can also store a very high sample table, all the table data is distributed on each node, in fact, it is much more complex than this architecture. In fact, you can look at it as corresponding to the concept of a table. I do not know if you have anyone to see Hbase, may just start to see Hbase more puzzling, because it is columnar storage, and previously seen the database solution is not the same. In fact, its definition is very simple, is the top, the second line of the sentence, is a sparse, distributed, multi-dimensional, persistent a shadow. Sparse is a unit cell ratio, Hbase in the storage format has solved this problem, you can store a sparse table. Second, the distributed does not need to be explained. This figure can be seen inside some of the concept of timestamp in it, this is a for example, the first is a record of the Row Key, and then there is a Column Families, and then there is a version number.

Storage inside the selection, just said a few, do storage selection how to choose? Not necessarily at the beginning will certainly hear a lot of people say Hbase must be faster than HDFS, these statements are not responsible for, all must be in what scenario. For example, Hadoop, this way is to do global file scanning is fast, but like Hbase to do random storage is fast, so it is also divided into scenarios. But like the middle of this KUDU, yesterday a customer said they are using a KUDU, belongs to an intermediate program, between HDFS and Hbase, a storage engine, have not yet seen large-scale production applications. This is the beginning of this year to do a data warehouse, Greenplum Database, is last year's open source. Before the core of Greenplum can be industrialized out of their own, it is one of the biggest benefits, we think there are a few, the first is the standard SQL, you may see a lot of products on the market said to support SQL, but in fact they are not standard. What does not standard mean? For example, a lot of syntax is different, you used to be like data engineers, data analysts, they use more advanced usage can not be used. But Greenplum Database is different because it's got a core compute engine that we think is better than MySQL, and it's got a lot of other features.

We've talked about compute products, we've talked about storage products, and then we've talked about data transfer. Data transfer we say one of the most classic Kafka, is distributed, partitionable, multi-copy, low latency. What does low latency mean? Left and right of these two charts look very similar, in fact, is the Kafka equivalent of the data into and out of the stay, Kafka is the open source of the Collage, because our platform provides a Kafka service, they are now also in use, which is they are using out of a product. It means that Kafka's latency is very low, basically the data does not fall down, directly out.

Why is it possible? There are two very essential reasons: first, it is directly written to the PageCatch when writing data, sent out directly through the Linux sent out, so its throughput latency is very low, this is the two core reasons.Kafka's architecture is very simple, it is the three loose coupled, for example, the uppermost layer is its producer, and then a cluster, and in the middle of it is a server, Kafka's server, and below is its consumer. Its producer a cluster can be sent to the broker inside the data, the equivalent of the broker to send data to the first Partition, the second sent to the second Partition inside the Partition, Partition the first main concept is what you publish the message, you produce a message relative to the message in Kafka there are several queues, each queue

The second cluster is its consumers, consumers can mention the more important point, it has a consumer group concept, this group concept is very important. When you want to multicast a Topic's message, you want to be handled by many consumers, this time you need to build multiple consumer groups, this message can be consumed by multiple consumers. If you build only one consumer group, even if there are several consumers in this consumer group, the message will be processed by one consumer each time. The second problem is the number of consumers in the consumer group, one of which is two, one is four, that is, there are four Partitions in a message, if there are four consumers, exactly one-to-one, each consumer consumes a Partition, if there is only one consumer, one will consume two Partitions. this situation is better. There is a situation to avoid, that is, for example, there are five consumers, you only have four queues for that Topic, you will waste a consumer. This is something to be aware of.

After we're done with computation, done with storage, done with outgoing, and then some of the problems we ran into. The first big issue is the replication factor, why don't you have to think about it natively, but why should you uniquely consider it on the cloud? The reason is very simple, because in the cloud above all the services are based on IaaS to do, IaaS this layer itself has high availability, is that its data itself is a copy of the copy, if you still copy the physical machine on the practice, you find three copies, you think about 2 × 3 is 6. So, the first one is to go to the replica, put it with two replicas, that's the solution we thought at the beginning, just use two replicas. But then we thought two copies is still 2×2=4, or a little bit more in terms of space waste.

Then we thought what is a more advanced solution? That is, we provide a capability in the IaaS layer, so that the PaaS layer can choose, say I want a few replicas, is to become an option, so that, for example, like big data, or very fragile applications, but sometimes, for example, do not need to have its own replica strategy, do not need the IaaS layer of the replica, this time according to your own configuration, or according to the needs of your own product, you can configure the IaaS layer, or according to your own product. The IaaS layer replica policy can be configured according to your own configuration, or according to the needs of your own product, so that it is the same as the physical one.

This parameter tuning, for example, like a typical big data inside each product or each platform has two to three hundred parameters, this is too normal, this time to do tuning the first important step is that you should know that we should try to know what the relationship between these tuning parameters, what is the relationship between them in the end, can not only know what each parameter is doing, or else tuning one, affecting another, or tuning the press is not the same. affects the other one, or the tuning press doesn't have any response, that's because you don't have the relationship figured out. Like this figure, you can make the Node Manager inside the yarn are smaller than it, and then is the yarn inside the allocated memory, the relationship between the understanding, in doing performance tuning is very important.

The last important best practice is in the data format, which many people will certainly ignore. But it's very important in big data, why? Because when the data is big, when the data is very big, if you don't focus on the data format it will lead to these problems. For example, the likelihood of performance will decline, and then your space instead of wasting a lot of exponential rise.

There are actually a lot of things to pay attention to when it comes to data formatting. Let's pick out two of the more important guidelines, the first is that the data format should be separable. Separable support for these formats, more like Avro, Parquet Lzop + index, SequenceFile, does not support the XML, JSON files.

Then can be block compression, support is Avro, Parquet, Lzop+index, SequenceFile, does not support is CSV, JSON records. We can think about it, we are in the big data platform inside the calculation are parallel computing, it is all the data are separated to calculate, and then each slice of the calculation of it, so the second is block compression. In fact, there are many points, such as the data format is not to support the glasses, like Avro support, that is, the old version of the data format and the new version is still compatible. Including things like SequenceFile, scalable, compressible, but it's only in the Hadoop ecosystem, unlike Avro and Parquet. we have a Qingyun own user conference on July 28th at the Beijing Hotel, we're only responsible for the service, and it's all about the elites in various industries talking about their own technical dry goods and products, and we're doing it in this form.

Previous article:How should college students inherit and carry forward the traditional culture of China?
Next article:Is worm soup really a nourishing soup? Is bug soup in the month good for maternal health?