Traditional Culture Encyclopedia - Traditional stories - Explain in detail how the idea of big data was formed and its value dimension.

Explain in detail how the idea of big data was formed and its value dimension.

Explain in detail how the idea of big data was formed and its value dimension.

For example, in the economic aspect, Mr. Huang Renyu found the wide application of "mathematical theory" (that is, quantitative analysis) in the analysis of the economy in the Song Dynasty (unfortunately, Wang Anshi's political reform began without end). Another example is the army. Whether it is true or not, the quantitative analysis thought behind the bridge of "learning data mining from Lin Biao" undoubtedly has its realistic foundation, even dating back more than 2,000 years. It was by fabricating the data of "reducing100000 stoves to 50000 stoves and then to 30000 stoves" that Sun Bin used Pang Juan's quantitative analysis habit to trap and kill them.

In 1950s and 1960s, magnetic tape replaced the punch card machine, which triggered the revolution of data storage. The disk drive immediately found that the biggest imagination space it brought was not capacity, but the ability to read and write randomly, which liberated the thinking mode of data workers and began the nonlinear expression and management of data. Databases came into being, from hierarchical database (designed by IBM for Apollo landing on the moon, and now CCB is still in use), to mesh database, and then to the current general relational database. Decision support system (DSS) originated from data management, and evolved into business intelligence (BI) and data warehouse in 1980s, which opened the way for data analysis, that is, giving data meaning.

At that time, the most powerful application of data management and analysis was business. The first data warehouse was made for Procter & Gamble, and the first terabyte data warehouse was in Wal-Mart. Wal-Mart has two typical applications: one is supply chain optimization based on retaillink, which shares data with suppliers to guide the whole process of product design, production, pricing, distribution and marketing, and suppliers can optimize inventory and replenish goods in time; The second is shopping basket analysis, which is often said to add beer and diapers. Almost all marketing books about beer and diapers are convincing. I'm telling you, it was compiled by a manager of Teradata, which has never been seen in human history. However, it is positive to educate the market first and then reap.

Tesco, second only to Wal-Mart, focuses on customer relationship management (CRM), subdivides customer groups, analyzes their behaviors and intentions, and makes precise marketing.

All this happened in the 1990s. In the 2000s, scientific research produced a lot of data, such as astronomical observations and particle collisions. The fourth paradigm is put forward by Jim Gray, a database master, and it is an improvement of data methodology. The first three paradigms are experiment (Galileo dropped from the leaning tower), theory (Newton was inspired by an apple and formed a classical law of physics) and simulation (particle acceleration is too expensive and nuclear test is too dirty, so calculation is used instead). The fourth paradigm is data exploration. In fact, this is not new. Kepler fits the elliptical orbit according to the observation data of the previous planetary position, which is the data method. But in the 1990s, there were too many scientific research data, and data exploration became a prominent research. In today's subject, there are twin brothers, who calculate XX and XX informatics. The former is an analog/computational paradigm, while the latter is a data paradigm, such as computational biology and bioinformatics. Sometimes computational XX includes data paradigms, such as computational sociology and computational advertising.

In 2008, Chris Anderson (the author of the long tail theory) wrote an article "The End of Theory" in Wired magazine, which caused an uproar. His main point is that with data, there is no need for a model, or it is difficult to obtain an interpretable model, so the theory represented by the model is meaningless. Tell you something about data, models and theories. Let's look at a rough picture first.

First, we collect data from three points when observing the objective world. According to these data, we can have a theoretical hypothesis about the objective world, which can be represented by a simplified model, such as a triangle. There can be more models, such as quadrangles and pentagons. With the deepening of observation, two more points were collected. At this time, I found that the models of triangle and quadrilateral were wrong, so I decided that the model was a pentagon, and the world reflected by this model was in that pentagon, but I didn't know that the real time was round.

The problem in the era of big data is that there are many and miscellaneous data, which can no longer be expressed by simple and clear models. In this way, the data itself becomes a model. Strictly speaking, data and applied mathematics (especially statistics) have replaced theory. Anderson takes Google Translate as an example. The unified statistical model replaces the theories/models of various languages (such as grammar). If you can translate from English into French, you can translate from Swedish into Chinese, as long as there is corpus data. Google can even translate Clayton (a language compiled Clayton(StarTrek). Anderson raised the question of relevance rather than causality, and Schoenberg (hereinafter referred to as Lao She) just picked up human wisdom.

Of course, the scientific community does not agree with the end of the theory, and believes that scientists' intuition, causality and interpretability are still important factors for human breakthroughs. With the data, the machine can find the unknown parts hidden in the current knowledge map. Without the model, the upper limit of the knowledge map is the computing power of the linear growth of the machine, which cannot be extended to a new space. In human history, every leap-forward expansion of knowledge territory is the first horn sounded by geniuses and their theories.

Around 20 10, the wave of big data rolled up, and these debates were quickly drowned out. Look at Google trends, and the word "bigdata" jumped up at that time. There are several trumpeters, one is IDC, which gives EMC a report of digitaluniverse every year, rising to the category of Zebyte (to give you an idea, now the hard disk is TB, 1000 is too = 1 beat, and the data of Ali and Facebook is several hundred beats, 1000 beat = 1 beat. One is McKinsey, which published Big Data: The Next Frontier of Innovation, Competition and Productivity. One is The Economist, and one of the important writers is Kenneth, who is in the era of big data with Lao She. Cookeye; Another is Gartner, who invented 3V (large, miscellaneous and fast). In fact, this 3V is compiled by 200 1, but it has a new interpretation in the context of big data.

In China, General Manager Huang and General Manager Huang also started to pay attention to big data around 20 1 1.

20 12 zi Pei's book big data has made great contributions to educating government officials. Lao She and Kukeye's "Big Data Times" put forward three major ideas, which are now regarded as the standard, but don't take them as universal truths.

For example, do not sample the entire data set. Realistically speaking, 1. Without a complete set of data, the data is an island; 2. The complete works are too expensive. In view of the low information density of big data, which is a poor mine, the input-output ratio is not necessarily good; 3. Sampling is still useful in macro-analysis, and Gallup's practice of defeating millions of surveys with 5,000 samples is still of practical significance; 4. Sampling should be random and representative. It is not a good sampling to interview the migrant workers on the train and come to the conclusion that they all bought tickets. It is not good to do fixed-line sampling survey now (mobile phone is the big head), and sampling based on foreign Twitter is not completely representative (excluding the elderly); The disadvantage of sampling is that there is a deviation of several percent, and even the signal of black swan will be lost. Therefore, under the premise that a complete set of data exists and can be analyzed, the total quantity is the first choice. Total amount > good sampling >; Uneven quality.

Besides, hybridity is because of accuracy. Embracing hybridity (such an objective phenomenon) is a good attitude, but it doesn't mean that you like hybridity. Data cleaning is more important than before. If the data loses its identifiability and validity, it should be thrown away. Lao She quoted Google's conclusion that a few high-quality data+complex algorithms were defeated by a large number of low-quality data+simple algorithms to prove this thinking. It is true that Peter's research is network text analysis. However, Google's deep learning has proved that this is not entirely correct. For voice and picture data with rich information dimensions, a lot of data and complex models are needed.

Finally, there should be relevance, not causality. Relevance is useful for a large number of small decisions, such as Amazon's personalized recommendation; Causality is still very important for major decisions in small batches. Just like Chinese medicine, it only reached the relevant stage, but there was no explanation, and it was impossible to conclude that some bark and insect shells were the cause of the cure. After discovering the relevance of western medicine, we should do a randomized controlled trial to eliminate all interference factors that may lead to "healing fruit" and obtain causality and interpretability. The same is true of business decisions. Relevance is just the beginning. It replaces the hypothesis and intuition of patting the head, and the process of verifying causality is still important.

It is also an ethical need to realize some analysis results of big data in relevance, and motivation does not represent behavior. The same is true for forecasting and analysis, otherwise the police will predict that people will commit crimes, and the insurance company will predict that people will get sick and society will be very troublesome. Big data algorithm has greatly affected our lives, and sometimes it feels quite sad. That is, the algorithm thinks that it can get money whether it is borrowed or not. Every time Google adjusts its algorithm, many online businesses will be affected by its low ranking.

Time is running out. I will post something about the value dimension. A very important point in the idea of big data is that in addition to the intelligence of decision-making, there is also the value of the data itself. I won't go into details on this point. To quote Ma Yun, "The starting point of information is that I think I am smarter than others, and the starting point of data is that others are smarter than me; Information is given to others after you edit the data, and the data is given to people smarter than you after you collect it. " What can big data do? How does the value V map to other 3V and space-time quadrants? I drew a picture:

And post an explanation. "Seeing the micro" and "knowing" in the dimension of volume space. Small data is subtle and personal. I once described it as "seeing myself" in "A Master"; Big data understands and reflects the characteristics and trends of nature and groups. I compare it to "seeing the world and seeing all beings". "Zhe" promotes "micro" (for example, dividing the crowd into buckets) and pulls "micro" (for example, recommending the preferences of similar people to individuals). "Micro" and "Zhe" also reflect the time dimension. Individual value is the largest when it decays at first, and eventually degenerates into collective value with time.

"Now" and "Clear All" in the time dimension of speed. At the origin of time, the present is the real-time wisdom between flashes of light. Combining the past (negative axis) and predicting the future (positive axis), we can all understand that we can gain eternal wisdom. The descriptions of the true and false Monkey King in the Journey to the West, one is "knowing the heaven and the earth and knowing the changes", and the other is "knowing the heaven and the earth and knowing before and after", just correspond. In order to realize universal knowledge, we need overall analysis, regulation analysis and disposal analysis (what actions are needed to make the set future happen).

"Error Discrimination" and "Meaning Understanding" in the Spatial Dimension of Variants. Based on massive multi-source heterogeneous data, we can identify and filter noise, check leaks and fill gaps, and eliminate the false and retain the true. Understanding has reached a higher level, extracting semantics from unstructured data, enabling machines to spy on people's ideological realm, reaching a height that structured data analysis could not reach in the past.

Look at it first, the research on the law of macro phenomena has long existed. The knowledge of big data has two new features. One is from sampling to total. For example, CCTV's survey "Are you happy?" Last year was street sampling. Not long ago, the conclusion of China Economic Life Survey on the ranking of happy cities was based on the sampling of 654.38+ 10,000 questionnaires (654.38+07 questions). The happiness index made by Tsinghua Behavior and Big Data Lab (following the participation of Xiong Ting, me and many friends in this group) is based on the complete data of Sina Weibo (thanks to Lao Wang). These data are people's natural expressions (rather than passive answers to questionnaires) and have context, so they are more real and explanatory. Is it air, housing prices or education that makes Beishangguang unhappy? Weibo's positive emotions or negative emotions are more easily spread, and the data tells you the answer. The China Economic Life Survey says that "we can hear even the smallest sound", which is an exaggeration. Sampling and traditional statistical analysis methods use some simplified models to distribute data, ignoring anomalies and long tails. The total analysis can see the black swan and hear the sound of the long tail.

Another feature is from qualitative to quantitative. Computational sociology is the application of quantitative analysis in sociology. A group of mathematicians and physicists have become economists and liberals, and now they can also choose to become sociologists. Guotai Junan 3I Index is also an example. Based on the data of hundreds of thousands of users, which mainly reflects the investment activity and investment income level, a quantitative model is established to infer the overall investment prosperity.

Looking at the micro, I think the real differentiation advantage of big data lies in the micro. Natural science is macroscopic, concrete, microscopic and abstract, so big data is very important. We pay more attention to social science, that is, micro-concreteness first, then macro-abstraction. Xu Xiaonian simply thinks that macroeconomics is pseudoscience. If the market is the sum of individual behaviors, what we see is an abstract painting, which we can't understand. Through customer segmentation, you can gradually form a realistic picture that is roughly understandable, but it is a mosaic, and then you can form a high-definition picture by differentiating and even positioning individuals. Each of us now lives in the bucket of a retailer (Tesco invented this concept as mentioned earlier), which simply reflects the background, such as high income and low income, and then reflects the behavior and lifestyle, such as "careful calculation" and "right-click group" (right-click comparison). Conversely, our consumers also hope to get personalized respect, and noble hopes to become the noble of today.

Knowing and mastering customers is more important than ever. Obama won the big data because he knew that George Clooney was the goddess of women aged 40-49 in the West Bank, and sarah jessica parker (the protagonist of Sex and the City) was the idol of women of the same age in the East Bank. He has to be more subdivided, what TV is being watched at every age and time in every county of swing state, the voting tendency of voters in swing state (Ohio) 1% for a period of time, and the swing voters in Reddit.

For enterprises, it is necessary to change from product-oriented to customer (buyer) or even user-oriented, from paying attention to user background to paying attention to user's behavior, intention and intention, from paying attention to the formation of transactions to paying attention to every interaction point/contact. The path from which users find my products determines what they have done before and what feedback they have after purchasing, whether through webpage, QQ, Weibo or WeChat.

Now let's talk about the third one. Time is money, and stock trading means quick fish eat slow fish. With free stock trading software, there is a delay of several seconds, and high-frequency programmed trading, which accounts for 60-70% of the US trading volume, will find millisecond trading opportunities as low as 1 cent. Time is life. The supercomputer of the National Oceanic and Atmospheric Administration of the United States issued a tsunami warning nine minutes after the 31/kloc-0 earthquake in Japan, which was too late. Time or opportunity. Now the so-called shopping basket analysis is actually not a real shopping basket, but a small receipt that has been settled. What is really valuable is that when a customer is still carrying a shopping basket and browsing, trying on and choosing goods, his/her choice is affected in every contact. The value of data has a half-life. When it is freshest, the personalized value is the greatest, and it gradually degenerates to only collective value. The wisdom of the moment is from carving a boat for a sword to knowing that the time is ripe. Originally, the census in 10 was about carving a boat for a sword, but now the Baidu migration map was reflected when something happened in Dongguan. Of course, the current one may not be completely accurate. In fact, if there is no more and longer data, it is possible to get into a misunderstanding by hastily interpreting Baidu's migration map.

The fourth one is safe. Time is limited, so let's make it simple. In other words, we only know that Dongfeng is predictiveanalytics, which determines the target of borrowing arrows and prescribes the prescription of lending grass boats. This is a predictive analytics. We need prescription analysis to improve responsiveness, reduce wastage rate and attract new customers.

Error identification is to use multi-source data to filter noise, check leaks and fill gaps, and eliminate the false and retain the true. One example is that the total GDP of more than 20 provinces and cities exceeds the national GDP. Our GPS has an error of tens of meters, but it can be very accurate when combined with map data. GPS has no signal in urban high-rise buildings, so it can be combined with inertial navigation.

Small I involves machine intelligence under big data, which is a big problem and will not be expanded. Post a paragraph of my article: Some people say that people are irreplaceable in the field of "Xiao Yi". This is a fact in the pre-big data era. Moneyball is about quantitative analysis and prediction of the contribution to baseball. It has misunderstandings in the context of big data: First, it is actually not big data, but an existing data thinking and method; Second, intentionally or unintentionally ignored the role of scouts. From the reader's point of view, billy beane, general manager of Oakland A's team, replaced scouts with quantitative analysis. In fact, while using quantitative tools, Bean also increased the cost of reconnaissance. Military medals are half machines and half people, because scouts measure athletes' qualitative indicators (such as competitiveness, pressure resistance, willpower, etc.). ) that can't be described by several structured quantitative indicators. Big data has changed all this. Unconsciously recording people's digital footprints and enhancing the ability of machine learning (especially deep learning) to understand ideas may gradually change the disadvantages of machines. This year, we saw emotional analysis, value analysis and personal characterization based on big data. When these are applied to human resources, they more or less reflect the commitment of scouts.