Traditional Culture Encyclopedia - Traditional stories - Characteristics and technical route analysis of big data storage and application

Characteristics and technical route analysis of big data storage and application

Characteristics and technical route analysis of big data storage and application

In the era of big data, data has exploded. From the development trend of storage services, on the one hand, the demand for data storage is increasing; On the other hand, higher requirements are put forward for the effective management of data. Big data puts forward higher requirements for the capacity, reading and writing performance, reliability and expansibility of storage devices, and needs to fully consider functional integration, data security, data stability, system scalability, performance and cost.

Analysis on the Characteristics of Big Data Storage and Application

"Big Data" is a data set composed of massive data with complex structure and many types. It is an intellectual resource and knowledge service ability formed by data integration, sharing and cross-multiplexing based on cloud computing. Its common characteristics can be summarized as 3V: quantity, speed and change (large-scale, high speed and diversity).

Big data has the characteristics of large amount of data and rapid growth. Its data scale has increased from PB level to EB level, and it is still expanding according to the needs of practical application and secondary development of enterprises, and is rapidly moving towards ZB(ZETA-BYTE) scale. Take Taobao, the largest e-commerce company in China, as an example. According to Taobao data, by the end of 20 1 1, the highest single-day independent user visits of Taobao exceeded1.200 million, an increase of 1.20% over the same period of 20/kloc-0, with more than 400 million registered users and online goods and pages. Taobao generates 400 million pieces of commodity information every day, and the amount of active data every day has exceeded 50TB. Therefore, the storage or processing system of big data can not only meet the current data scale demand, but also need strong scalability to meet the rapid growth demand.

(1) The storage and processing of big data not only lies in its large scale, but also requires its fast response speed in transmission and processing.

Compared with the small-scale data processing in the past, when processing large-scale data in the data center, the high throughput of the service cluster is needed to make the massive data complete the task in an "acceptable" time for application developers. This is not only a requirement for computing performance at various application levels, but also a requirement for reading and writing throughput of big data storage management systems. For example, individual users buy products they are interested in on the website, and the website recommends relevant advertisements in real time according to the user's buying or browsing behavior, which requires real-time feedback from the application; For example, the data analyst of the e-commerce website provides merchants with recommended product keywords according to the popular keywords searched by shoppers in the current season. Faced with hundreds of millions of daily visit records, machine learning algorithm is required to give more accurate recommendations within a few days, otherwise it will lose its effectiveness; Or taxis are driving on urban roads, and the big data processing system needs to constantly give more convenient path selection through the information fed back by GPS and the real-time road information of monitoring equipment. All these require the application layer of big data to obtain massive data from storage media with the fastest speed and the highest bandwidth. On the other hand, data exchange is also taking place between mass data storage management system and traditional database management system or tape-based backup system. Although this exchange can be done offline, due to the huge data scale, low data transmission bandwidth will also reduce data transmission efficiency and cause data migration bottleneck. Therefore, the storage and processing speed or bandwidth of big data is an important indicator of its performance.

(2) Big data has the characteristics of data diversity due to different sources.

Diversity refers to the degree of data structure, the diversity of storage formats and storage media. For traditional databases, the data they store are structured data with regular format. On the contrary, big data comes from logs, historical data, user behavior records and so on. Some are structured data, and more are semi-structured or unstructured data, which is one of the important reasons why traditional database storage technology can not adapt to big data storage. The so-called storage format is precisely because of its different data sources, various application algorithms, different data structures and various formats. For example, some are stored in text file format, some are web files, and some are serialized bit stream files. The diversity of storage media refers to the compatibility of hardware. Big data applications need to meet different response speed requirements, so its data management advocates hierarchical management mechanism. For example, the response of real-time or streaming data can be directly accessed from memory or flash memory (SSD), while offline batch processing can be built on a storage server with multiple disks, some of which can be stored on traditional SAN or NAS network storage devices, and backup data can even be stored on tape drives. Therefore, the storage or processing system of big data must be compatible with various data and software and hardware platforms to adapt to various application algorithms or data extraction, transformation and loading (ETL).

There are three typical * * * technologies for big data storage:

The first is a new database cluster based on MPP architecture, focusing on industry big data, adopting non-sharing architecture, supporting analysis and application through data processing technologies such as column storage and coarse-grained index, and combining MPP architecture with efficient distributed computing mode. The running environment is mostly low-cost PC server, which has the characteristics of high performance and high scalability, and has been widely used in enterprise analysis and application.

This MPP product can effectively support PB-level structured data analysis, which is beyond the traditional database technology. MPP database is the best choice for the new generation of data warehouse and structured data analysis.

Second, the expansion and encapsulation of Hadoop-based technology, around which related big data technologies are derived, deal with data and scenarios that are difficult to be processed by traditional relational databases, such as the storage and calculation of unstructured data, and make full use of the advantages of Hadoop's open source. With the continuous progress of related technologies, its application scenarios will gradually expand. At present, the most typical application scenario is to support the storage and analysis of Internet big data by expanding and encapsulating Hadoop. There are dozens of NoSQL technologies, which are further subdivided. Hadoop platform is better at unstructured and semi-structured data processing, complex ETL processes, and complex data mining and calculation models.

The third is the all-in-one big data machine, which is a combination of software and hardware specially designed for the analysis and processing of big data. It consists of a set of integrated servers, storage devices, operating systems, database management systems and specially pre-installed and optimized data query, processing and analysis software. The high-performance all-in-one big data machine has good stability and vertical scalability.

The above is Bian Xiao's sharing about the characteristics and technical route analysis of big data storage and application. For more information, you can pay attention to Global Ivy and share more dry goods.