Traditional Culture Encyclopedia - Traditional festivals - Ten classic algorithms of data mining have finally been figured out, and I want to improve my collection quickly.

Ten classic algorithms of data mining have finally been figured out, and I want to improve my collection quickly.

An excellent data analyst should not only master basic statistics, data analysis thinking and data analysis tools, but also master basic data mining ideas to help us mine valuable data, which is also the gap between data analysis experts and general data analysts.

International authoritative academic organization IEEE International Conference on Data Mining (ICDM) has selected ten classic algorithms in the field of data mining: C4.5, K-means, SVM, Apriori, EM, PageRank, AdaBoost, KNN, Naive Bayes and Cart.

Not only the top ten algorithms, but also the 18 algorithm that participated in the selection, in fact, any one can be called a classic algorithm, which has had a far-reaching impact in the field of data mining. Today, I mainly share the classic algorithm of 10, and the content is relatively dry. It is recommended to collect it and learn it later.

1.C4.5

C4.5 algorithm is a classification decision tree algorithm in machine learning algorithm, and its core algorithm is ID3 algorithm. C4.5 algorithm inherits the advantages of ID3 algorithm, and improves ID3 algorithm in the following aspects:

1) uses information gain rate to select attributes, which overcomes the shortcoming of selecting attributes with more values when selecting attributes with information gain.

2) Pruning in the process of building trees;

3) discretization of continuous attributes can be completed;

4) Be able to handle incomplete data.

C4.5 algorithm has the following advantages: the generated classification rules are easy to understand and have high accuracy. Its disadvantage is that in the process of constructing the tree, the data set needs to be scanned and sorted many times, which leads to the inefficiency of the algorithm (CART algorithm only needs to scan the data set twice, and the following are just the advantages and disadvantages of the decision tree).

2.k-means algorithm is K-Means algorithm.

K-means algorithm is a clustering algorithm, which divides N objects into K partitions according to their attributes, and K

3. Support vector machine

Support Vector Machine, English for support vector machine, referred to as SV machine (this article is generally referred to as SVM). It is a supervised learning method, which is widely used in statistical classification and regression analysis. Support vector machine maps the vector to a higher dimensional space, and establishes a hyperplane with the largest interval in this space. There are two parallel hyperplanes on both sides of the hyperplane separating data. Separating hyperplanes maximizes the distance between two parallel hyperplanes. It is assumed that the greater the distance or gap between parallel hyperplanes, the smaller the total error of the classifier. An excellent guide is C. J.C Burges's Guide to Pattern Recognition Support Vector Machines. Vandervoort and Barnard compared support vector machines with other classifiers.

4.Apriori algorithm

Apriori algorithm is the most influential algorithm for mining frequent itemsets of Boolean association rules. Its core is a recursive algorithm based on the idea of two-stage frequency set. This association rule belongs to single-dimensional, single-layer and Boolean association rules in classification. Here, all itemsets with support greater than the minimum support are called frequent itemsets, which is called frequency sets for short.

5. Maximum expectation algorithm

In statistical calculation, the maximum expectation (EM) algorithm is an algorithm to find the maximum likelihood estimation of parameters in the probability model, in which the probability model depends on an unobservable Latent variable bl. Maximum expectation is often used in data clustering fields of machine learning and computer vision.

6.PageRank

PageRank is an important content of Google algorithm. In September, 200 1 year, he was awarded the American patent by Larry Page, one of the founders of Google. Therefore, the Page in PageRank refers not to a webpage, but to a page, that is, this ranking method is named after the page.

PageRank measures the value of a website according to the quantity and quality of its external links and internal links. The concept behind PageRank is that every link on a page is a vote for that page. The more links you get, the more votes you get from other websites. This is the so-called "link popularity"-a measure of how many people are willing to link their websites to yours. The concept of PageRank comes from the citation frequency of a paper in academic circles-that is, the more times it is cited by others, the higher the authority of the paper is generally judged.

7.adaboost algorithm

Adaboost is an iterative algorithm, and its core idea is to train different classifiers (weak classifiers) for the same training set, and then assemble these weak classifiers to form a stronger final classifier (strong classifier). The algorithm itself is realized by changing the data distribution. It determines the weight of each sample according to whether the classification of each sample in each training set is correct and the accuracy of the last overall classification. The new data set with modified weights is sent to lower-level classifiers for training, and finally the classifiers obtained from each training are finally fused into the final decision classifier.

8.kNN: k nearest neighbor classification

K nearest neighbor (KNN) classification algorithm is a mature method in theory and one of the simplest machine learning algorithms. The idea of this method is that if most of the k most similar (that is, closest) samples in a feature space belong to a certain category, then this sample also belongs to this category.

9. Naive Bayes

Among many classification models, the two most widely used classification models are decision tree model and naive Bayes model (NBC). Naive Bayesian model is derived from classical mathematical theory, which has a solid mathematical foundation and stable classification efficiency.

At the same time, NBC model needs to estimate fewer parameters, is insensitive to missing data, and its algorithm is relatively simple. Theoretically, compared with other classification methods, NBC model has the smallest error rate. But in fact, this is not always the case, because the NBC model assumes that the attributes are independent of each other, and this assumption is often untenable in practical applications, which has brought certain influence to the correct classification of NBC models. When the number of attributes is large or the correlation between attributes is large, the classification efficiency of NBC model is not as good as that of decision tree model The performance of NBC model is the best when the attribute correlation is small.

10.CART: classification and regression tree

CART, classification and regression tree. There are two key ideas under the classification tree. The first is the idea of recursively dividing the independent variable space (binary division method); The second idea is to use validation data for pruning (pre-pruning, post-pruning). It may be more difficult to establish a model tree based on regression tree, but at the same time its classification effect is also improved.

Reference book: machine learning in actual combat