Traditional Culture Encyclopedia - Traditional virtues - How to choose the construction scheme of data platform

How to choose the construction scheme of data platform

First of all, why build a data platform?

The business runs well and the system runs stably. Why build a data platform for enterprises?

Think about such questions in your head, don't ask them out loud. I directly answer, under what circumstances do companies generally need to build a data platform to reconstruct all kinds of data.

From a business perspective:

1, there are too many business systems, and the data of each other is not connected. In this case, it is more troublesome when it comes to data analysis. Analysts may need to extract data from multiple systems, and then integrate the data before analyzing. I can bear it once or twice. Can I stand doing this every day? How to control the high error rate of manual integration? Analysis is not timely and inefficient. Do you want to deal with it?

From the system point of view:

2. The business system is under great pressure, but unfortunately, data analysis is a resource-consuming job. Then it is natural to think that by extracting data, an independent server will handle the data query and analysis tasks to release the pressure on the business system.

3, performance issues, the company can get bigger and bigger, and the same data will get bigger and bigger. It may be the accumulation of historical data or the addition of new data content. When the original data platform cannot handle more data, or the efficiency is already low, it is necessary to rebuild a big data processing platform.

I have listed three situations above, but they are not independent, and often two or even three appear at the same time. The emergence of a data platform can not only bear the pressure of data analysis, but also integrate business data, and improve the performance of data processing to varying degrees, so as to realize more abundant functional requirements based on the data platform.

Second, what are the plans for data platform construction?

The following advantages and disadvantages are only from the perspective of enterprise choice, not the technical perspective of the scheme itself.

If you can answer in one word, it is: too many (this is nonsense, I admit), but there are indeed many options to choose from, and I know very little, so I can't introduce them one by one, so I am divided into the following categories, which I believe cover the needs of most enterprises to some extent.

1, general data warehouse:

Forget this concept. Since you are in the data business, I believe you know better than me. If you are not clear, you can go to Baidu. Its focus is on data integration and combing business logic. Although it can also be packaged into cubes like ssas to improve the reading performance of data, the role of data warehouse is more to solve the company's business problems than just performance problems. This point will be introduced in detail later.

Regarding the advantages and disadvantages of this scheme, I put it bluntly:

Advantages:

The scheme is relatively mature, and the architecture of data warehouse, whether it is Inmon architecture or Kimball architecture, has a very wide range of applications. Many people believe that both architectures can be realized.

The implementation is simple, and the technical aspects involved are mainly warehouse modeling and etl processing. Many software companies have the ability to implement data warehouse, and the difficulty of implementation depends more on the complexity of business logic than on technical implementation.

Flexible, this sentence should have a corresponding scene, and the construction of data warehouse is transparent. If necessary, the model and etl logic of the warehouse can be modified to meet the changing needs (of course, it is best to consider it comprehensively at the beginning of design). At the same time, for the upper analysis, the analysis and processing of warehouse data through sql or mdx has great flexibility.

Disadvantages:

"Long implementation period", please note that I put quotation marks corresponding to the following agile data mart, and this is relative. The length of the implementation cycle depends on the complexity of the business logic, and the time is spent sorting out the business logic, not the technical bottleneck. On this point, it will be introduced in detail later.

The data processing ability is limited, limited and relative. It can't handle massive data or non-relational data, but the data below TB level can still be managed (depending on the database system adopted). This level of data, as well as the data of a considerable number of enterprises, is still difficult to exceed this level.

2, business agile data mart:

The underlying data product is bound to the analysis layer, and the application layer can directly drag and drop the data in the underlying data product. The original intention of this kind of products is to integrate business data simply and quickly, realize agile modeling and greatly improve data processing speed. At present, these products have achieved the above goals. But its advantages and disadvantages are also obvious.

Advantages:

Simple deployment and agile development are the biggest advantages of this kind of products. Compared with data warehouse, the implementation cycle is much shorter. In fact, it does not have a strict concept of implementation, because this kind of products are only related to the data that needs to be analyzed, and it is enough to consider only the immediate problems to be solved, and the iterative ability is stronger.

It is well integrated with the upper analysis tools. After the upper analysis tools are connected to this kind of data products, the graphical display and olap analysis of data can be directly realized. In order to improve the performance of data processing, these products all deal with the performance of data analysis, although the methods are different, including memory mapped file storage, distributed architecture and column data storage. But there is no doubt that they all improve the data processing performance to some extent.

Disadvantages:

First of all, it is charged.

Can't handle complex business logic, this is just a tool, it can't solve business problems. This kind of tool has its own simple etl function, which realizes simple data processing and integration, but if we consider the logic and relationship between historical data and overall data, it will definitely not be solved. A simple example, when there are two fields in a table, one is to keep historical data and the other is to update historical data, how to realize automatic processing. There is a concept that needs to be clear, and one tool cannot be expected to solve business problems. This data product is just a simple integration of current business data. First, the data is local, and second, the time is current (the incremental update or complete update of its connotation cannot cope with complex logic, and I believe everyone familiar with etl knows how complicated this process is). Of course, for some companies, it may be enough to integrate and analyze the current business data. To be honest, many companies are really too lazy to think about the longer-term problems. Every day is not a day, who can say for sure? )

Low flexibility, which is inevitable. The simpler the tool, the more limited its flexibility. Because it is sealed, the product is opaque, and it is very convenient to use according to the conventional requirements. But if you encounter complexity, you can't modify it without knowing its interior, only the egg hurts.

In my opinion, it is difficult to become the company's data center.

3. Data products with 3.MPP (massively parallel processing) architecture, taking greenplum, a recently open source, as an example.

The traditional mainframe computing model is very weak in the face of massive data. The cost is very expensive, and at the same time, it can not meet the needs of high-performance computing in technology. Smp architecture is difficult to expand, and it can not meet the needs of massive data computing in cpu computing and io throughput of independent hosts. Distributed storage and distributed computing are the key to solve this problem, and MapReduce computing framework and MPP computing framework are both produced under this background.

Greenplum's database engine is based on postgresql, and it realizes efficient collaboration and parallel computing of multiple Postgresql instances in the same cluster through the interlink artifact.

At the same time, the construction of data platform based on greenplum can realize two levels of processing. One obvious thing is the performance of data processing. Greenplum Encyclopedia claims to support the processing of 50PB massive data. Considering that it is boastful, it is easy to understand the actual application of greenplum at present, and the data is about 100tb. The other is that the data warehouse can be built in greenplum, which also combs the business logic and integrates the company's business data.

Advantages:

With the support of massive data and a large number of mature application cases, I think this is beyond doubt.

Scalability, it is said that it can be linearly extended to 10000 nodes, and the query and loading performance will be linearly improved with each additional node.

Ease of use, no complex tuning requirements, parallel processing is automatically completed by the system. As a communication language, sql is still simple, flexible and powerful.

Advanced functions, greenplum has also developed many advanced data analysis and management functions, such as popular external tables, main/mirror protection mechanism, row/column mixed storage and so on.

Stability, greenplum has a long history as a pure commercial data product, and its stability is more secure than other products and agile data marts. Greenplum has many application cases. Nasdaq, NYSE, Ping An Bank, China Construction Bank and Huawei have all established data analysis platforms based on Greenplum. Its stability can be verified from the side. After the open source of 65438+in September, 2005, the major Internet companies were also jubilant. Now, they have contacted several customers who are using greenplum, and all of them are highly praised.

Disadvantages:

Being located in the olap field, I am not good at oltp trading system. Of course, our company's data center will not be used as a trading system.

Cost, two considerations, one is hardware cost, greenplum has its recommended hardware specifications, which have requirements for memory and network cards. Of course, in the choice of hardware, we need to achieve a balance, and we should consider performance, capacity, cost and other aspects. After all, we can't blindly pursue performance and scare the purchasing department. The other is the implementation cost, mainly people. Basically, the installation and configuration of greenplum, and then the construction of data warehouse in greenplum, all need people and time. (But I have to say that other people's software is open source, which also saves a sum of money. )

Technical threshold, this is relative to the last agile data mart, greenplum's threshold is definitely higher.

4.hadoop distributed system architecture

About hadoop, the fire is about to explode, and greenplum's open source is also inseparable from it. It has the characteristics of high reliability, high expansibility, high efficiency and high fault tolerance. Widely used in the Internet field, such as Yahoo, facebook, Baidu, Taobao and so on. Hadoop ecosystem is very huge, and the company's Hadoop-based implementation is not limited to data analysis, but also includes machine learning, data mining, real-time systems and so on.

When the enterprise data scale reaches a certain order of magnitude, I think hadoop is the first choice for major enterprises. At this level, I think the enterprise solves not only the performance problem, but also the timeliness problem and the realization of more complex analysis and mining functions. A very typical real-time computing system is also closely related to hadoop ecosystem.

In recent years, the usability of hadoop has also been greatly improved, and a large number of sql-on-hadoop technologies have emerged, including hive, impala, spark-sql and so on. Although it is handled in different ways, it is usually better than the original file-based Mapreduce in terms of performance and ease of use. Therefore, it puts pressure on the market of mpp products.

Hadoop has obvious advantages and disadvantages for enterprises to build a data platform: its big data processing ability, high reliability, high fault tolerance, open source and low cost (why the cost is low, we should try another scheme to deal with data of the same scale). The disadvantage is that his system is complex and the technical threshold is high (companies that can handle hadoop are generally large).

The advantages and disadvantages of hadoop have little influence on the choice of company data platform. When you need to go to hadoop, there is no other choice (either too expensive or not), and no one wants to touch this thing until you reach this amount of data. In short, don't big data for the sake of big data.

Third, there are many schemes. How should enterprises choose?

The environment is too complicated, but I think we should at least consider the following aspects.

1, use:

What kind of purpose? It is the three situations at the beginning of the article (sorry, arrogant, there must be other situations, welcome to add to King Jia Ge), or a combination of several of them.

The way of doing things is the same. Even if you go out to eat at noon, you should have a purpose in your mind. Do you want to eat well or flatter others before you choose what to eat?

Of course, it is not so easy to make clear the purpose of building a data platform, and the original intention may be inconsistent with the goal determined after discussion.

The original intention of the company to build a data platform may be very simple, just to reduce the pressure on the business system, and then analyze the data after pulling it out. If the purpose is really so simple, there is really no need to make a big fight. If it is an independent system, just copy the database of the business system directly; If it is a multi-system, it is enough to choose an agile business data product like finecube. Establish a model quickly, access it directly with finebi or finereport, and realize data visualization and olap analysis.

However, since you have decided to separate the data platform, don't you think more? Don't take the opportunity to sort out and integrate data from multiple systems? At present, we only need to analyze business data. Will historical data be considered in the future? Can this agile solution support the needs of next year and the year after?

It is not a trivial matter for any company to build a data platform. If you spend an extra month or two implementing it, you may feel tired. It is always possible to spend an extra week or two thinking it over. Didn't Lei Jun say that you can't use tactical diligence to cover up your strategic laziness?

2. Data volume:

According to the company's data scale, choose the appropriate scheme. Too much talk here is nonsense.

3. Cost:

Including time cost and money, needless to say. However, there is a question I want to mention. I found that many companies either don't go to the data platform. Once they have such a plan, they can't wait to set up the platform immediately, and the time cost is not willing to spend. This situation is easy to be underestimated and easily fooled by data implementers.

Scheme selection suggestions include the following three 1 scenarios.

Scenario a:

To achieve rapid extraction and analysis of business data, multiple business systems can not reach massive data, do not consider historical data, and do not need to systematically sort out data according to business logic. In this case, we can consider the underlying data of agile bi tools.

Simply put, this scenario only completes the data integration and speed-up at the technical level, but does not model the data at the business level. It can meet certain analysis needs, but it cannot be the company's data center.

Scenario b:

It is necessary to build a company-level data center to get through the data between systems. Obviously, a data warehouse needs to be established. At this time, it is necessary to further consider the magnitude of company data. If the amount of data is small, below TB, it is enough to build such a data warehouse in the traditional database. If the amount of data reaches tens or hundreds of TB, or in the next few years, you can build a warehouse in greenplum.

This scenario should be suitable for most companies. For most companies, the amount of data will not be PB level, but more below TB level.

Scene c:

With the explosive growth of company data, the original data platform can't handle massive data, so it is suggested to consider hadoop as a big data platform. It must be the company's data center, so there must be a warehouse, and the original warehouse can be moved directly to hive. How to present this situation with a large amount of data? Because hive's performance is poor, its ad hoc query can be connected by impala or greenplum, because impala's concurrency is not so high, and greenplum just has its external table (that is, greenplum creates a table, and the characteristics of the table are called external tables, and the contents read are in hadoop's hive), which is perfectly integrated with hadoop (of course, it is not necessary to use external tables).

Scene d:

This is added at the back. The company originally had a data warehouse, but the historical data accumulated too much and the analysis performance declined. What should I do? Two schemes can be considered. In the long run, warehouses and data can be migrated to greenplum to form a new data platform and an independent data platform, which can create more possibilities. Faster and more agile data products like finecube can be connected to the original warehouse, thus improving the data processing performance and meeting the analysis requirements.

Four, about the possible misunderstandings in the scheme selection.

Ignore the complexity of business and solve or bypass business logic with tools. )

This is the one I met recently. Customers want to be a report platform and need to integrate the data of three business systems. But eager to cash in, I don't want to build a traditional data warehouse, so I choose agile bi tools. The description of tool manufacturer's data products generally focuses on its rapid implementation, performance optimization and basic etl functions. It is easy to misunderstand customers, that is, through this product, a company-level data center can be quickly built to meet the top-level data needs.

But in the later period, I suddenly realized that the tools only solved the complexity of using tools on the technical level, encapsulated etl and data mart together, and improved the performance of data, but did not realize the data modeling on the business level, and many details could not be handled.

Although agile development is attractive, it is a very good solution if the business system is simple, or if you only need to analyze the current business data without a company-level data center. But these problems have not been considered clearly. You expect too much from agile products and you will encounter some troubles in the future.

In addition, there may be big data for the sake of big data, but I have never encountered these in my actual work.

Finally, to sum up, there are different reasons for enterprises to choose data platform. To make a reasonable choice, we must fully consider the purpose of building a data platform and fully understand various schemes.

Personally, as for the data level, I still prefer some flexible solutions, because the data center is too important for the company. I hope it is transparent and can be completely controlled by myself, so that I can make full use of the data center. Because, I don't know what role it needs to play in the future.

I hope I can help you and adopt it.

Previous article:How to make octopus balls
Next article:Model education experience court 8 articles