Traditional Culture Encyclopedia - Traditional culture - Ten Algorithms for Novices in Machine Learning

Ten Algorithms for Novices in Machine Learning

This paper introduces the 10 algorithm that beginners of machine learning need to know, including linear regression, Logistic regression, naive Bayes, K nearest neighbor algorithm and so on.

In machine learning, there is a theorem called "There is no such thing as a free lunch". In short, it points out that no algorithm is effective for all problems, especially in supervised learning (that is, predictive modeling).

For example, you can't say that neural networks are always better than decision trees, and vice versa. There are many factors at work, such as the size and structure of the data set.

Therefore, you should try different algorithms for specific problems and set aside a data "test set" to evaluate performance and choose winners.

Of course, the algorithm you try must be suitable for your problem, that is, choose the appropriate machine learning task. For example, if you need to clean the house, you may use a vacuum cleaner, broom or mop, but you won't take out a shovel and start digging.

Grand principle

However, there is also a general principle, which is the basis of predictive modeling of all supervised machine learning algorithms.

The machine learning algorithm is described as a learning objective function F, which optimally maps the input variable X to the output variable Y: Y = F (x).

This is a common learning task. We can predict Y according to the new sample of the input variable X. We don't know the appearance or form of the function F. If we know it, we will use it directly, and we don't need to use machine learning algorithm to learn from the data.

The most common machine learning algorithm is to learn the mapping Y = f(X) to predict the y of the new x, which is called predictive modeling or predictive analysis, and our goal is to make the most accurate prediction possible.

For beginners who want to know the basic knowledge of machine learning, this paper will outline the top 10 machine learning algorithm used by data scientists.

1. Linear regression

Linear regression is probably one of the most well-known and understandable algorithms in statistics and machine learning.

Predictive modeling mainly focuses on minimizing model error or making the most accurate prediction possible at the expense of interpretability. We will borrow and reuse algorithms from many different fields, including statistics, and use them for these purposes.

The representation of linear regression is an equation, which describes a straight line that best represents the relationship between input variable X and output variable Y by finding the specific weight of input variable (called coefficient B).

Linear regression

For example: y = B0+B 1 * x

We will predict y according to the input x, and the goal of the linear regression learning algorithm is to find the values of the coefficients B0 and B 1.

Different techniques can be used to learn linear regression models from data, such as ordinary least square method and linear algebraic solution of gradient descent optimization.

Linear regression has existed for more than 200 years and has been widely studied. Some experience in using this technique is to remove very similar (related) variables and noise as much as possible. This is a quick and simple technique, you can try it first.

2. Logistic regression

Logistic regression is another technique that machine learning draws lessons from statistics. It is the first choice to solve the binary classification problem.

Logistic regression is similar to linear regression, and the goal is to find the weight of each input variable, that is, the coefficient value. Unlike linear regression, logistic regression uses a nonlinear function called a logistic function to predict the output.

The logic function looks like a big S, and can convert any value into the range of 0 to 1. This is very practical, because we can specify the output values of the logic function as 0 and 1 (for example, if the input is less than 0.5, the output is 1) and predict the category value.

Logistic regression

Because of the learning mode of the model, the prediction of logistic regression can also be used as the probability of a given data instance (belonging to category 0 or 1). This is useful for problems that need to provide more prediction basis.

Like linear regression, logistic regression is more effective in deleting attributes unrelated to output variables and very similar (related) attributes. This is a fast learning model, which is very effective for binary classification problems.

3. Linear Discriminant Analysis (LDA)

Logistic regression is a classification algorithm. Traditionally, it is limited to two classification problems. If you have more than two categories, then linear discriminant analysis is the preferred linear classification technique.

The representation of LDA is very simple and direct. It consists of statistical attributes of data and calculates each category. The LDA of a single input variable includes:

Average value of each category;

Variance of all categories.

linear discriminant analysis

The method of prediction is to calculate the discriminant value of each category and predict the category with the maximum value. This technique assumes that the data is Gaussian distribution (bell curve), so it is best to delete the abnormal values in the data in advance. This is a simple and powerful method to deal with the problem of classification prediction modeling.

4. Classification and regression tree

Decision tree is an important algorithm of predictive modeling machine learning.

The representation of decision tree model is binary tree. This is a binary tree algorithm and data structure, nothing special. Each node represents a separate input variable X and a split point on the variable (assuming the variable is a number).

Decision chart

The leaf nodes of the decision tree contain the output variable y for prediction. Prediction can be made by traversing the partition point of the tree until it reaches the leaf node and outputting the category value of the node.

The learning speed and prediction speed of decision tree are very fast. They can also solve a large number of problems without special preparation of data.

5. Naive Bayes

Naive Bayes is a simple but powerful predictive modeling algorithm.

The model consists of two kinds of probabilities, both of which can be directly calculated from the training data: 1) the probability of each category; 2) Given the value of each x, the conditional probability of each category. Once the calculation is completed, the probability model can use Bayesian theorem to predict new data. When your data is true, you usually assume a Gaussian distribution (bell curve) so that you can simply estimate these probabilities.

Thomas Bayes

Naive Bayes is naive because it assumes that each input variable is independent. This is a strong assumption, which is not the case with real data, but this technology is very useful in a large number of complex problems.

6.k nearest neighbor algorithm

KNN algorithm is very simple and effective. The model representation of KNN is the whole training data set. Is it simple?

KNN algorithm searches K most similar examples (neighbors) in the whole training set, and summarizes the output variables of these K examples to predict new data points. For regression problems, this may be the average output variable, and for classification problems, this may be the mode (or the most common) category value.

The trick is how to determine the similarity between data instances. If the measurement units of attributes are the same (for example, they are all expressed in inches), then the simplest technique is to use Euclidean distance, and you can directly calculate their values according to the differences between each input variable.

K nearest neighbor, knn

KNN needs a lot of memory or space to store all the data, but it only performs calculation (or learning) when it needs to predict. You can also update and manage training samples at any time to keep the forecast accurate.

The concept of distance or compactness may collapse in very high dimensions (many input variables), which will have a negative impact on the performance of the algorithm on your problem. This is the so-called dimension disaster. Therefore, you'd better only use those input variables that are most relevant to the predicted output variables.

7. Learning vector quantization

One disadvantage of K nearest neighbor algorithm is that it needs to traverse the whole training data set. Learning Vector Quantization (LVQ) is an artificial neural network algorithm, which allows you to select the number of training samples and learn exactly what these samples should be.

Learning vector quantization

The representation of LVQ is a collection of codebook vectors. These are randomly selected at the beginning and gradually adjusted in multiple iterations of the learning algorithm to best summarize the training data set. After learning, codebook vector can be used for prediction (similar to K nearest neighbor algorithm). By calculating the distance between each codebook vector and the new data instance, the nearest neighbor (the best matching codebook vector) is found. Then return the category value of the best matching unit or (the actual value in the regression) as the prediction. You can get the best results if you readjust the data to have the same range (for example, between 0 and 1).

If you find that KNN has achieved good results on your data set, please try to use LVQ to reduce the memory requirement for storing the whole training data set.

8. Support Vector Machine (SVM)

Support vector machine is probably one of the most popular and widely discussed machine learning algorithms.

Hyperplane is a straight line that divides the input variable space. In SVM, according to the input variable category (category 0 or category 1), select a hyperplane that can best divide the input variable space. In two dimensions, you can imagine it as a line, and we assume that all input points can be completely separated by this line. SVM learning algorithm looks for the coefficients that can make hyperplane optimally divide categories.

support vector machine

The distance between the hyperplane and the nearest data point is called the interval. The best or ideal hyperplane that separates two classes has the largest spacing. Only these points are related to defining hyperplane and establishing classifier. These points are called support vectors, which support or define hyperplanes. In fact, the optimization algorithm is used to find the value of the coefficient that maximizes the interval.

SVM is probably one of the most powerful out-of-the-box classifiers, which is worth a try.

9. Bagging and random forests

Random forest is one of the most popular and powerful machine learning algorithms. It is a Bootstrap Aggregation (also called bagging) integrated machine learning algorithm.

Bootstrap is a powerful statistical method to estimate the number from data samples. Such as the average value. You extract a large number of samples from the data, calculate the average, and then average all the averages to better estimate the true average.

Bagging uses the same method, but it estimates the whole statistical model, the most common one is decision tree. Extract multiple samples from the training data, and then model each data sample. When you need to predict new data, each model will predict and average all the predicted values to better estimate the actual output value.

Random forest

Random forest is an adjustment to this method. In the random forest method, the decision tree is created by introducing randomness instead of choosing the best segmentation point to promote sub-optimal segmentation.

Therefore, the models created for each data sample will be different from those obtained by other methods, but they are still accurate, although the methods are unique and different. Combined with their predictions, the actual output value can be better estimated.

If you get good results with a large variance algorithm (such as decision tree), you can usually get better results with bagging algorithm.

10. Boost and AdaBoost

Boosting is an integration technology, which tries to integrate some weak classifiers to create a strong classifier. This is achieved by building a model from training data and then creating a second model to try to correct the mistakes of the first model. Add models until the training set can be predicted perfectly or the number of added models reaches the maximum.

AdaBoost is the first really successful boosting algorithm for binary classification. This is the best starting point to understand boosting. The modern boosting method is based on AdaBoost, and the most noteworthy is random gradient lifting.

Adaboost algorithm

AdaBoost is used for short decision trees. After the first decision tree is created, the performance of the tree on each training instance is used to measure how much attention the next decision tree should pay to each training instance. Hard-to-predict training data are given greater weight, while easy-to-predict data are given less weight. Models are created in turn, and the weights of each model are updated on the training samples, which affects the learning of the next decision tree in the sequence. After all decision trees are established, new data are predicted, and the performance of each decision tree is evaluated by its accuracy in training data.

Because too much attention is paid to correcting algorithm errors, it is very important to have clean data with outliers deleted.

abstract

When faced with various machine learning algorithms, beginners often ask, "Which algorithm should I use? The answer to this question depends on many factors, including: (1) the size, quality and characteristics of the data; (2) available computing time; (3) the urgency of the task; (4) What do you want to do with these data?

Even experienced data scientists can't judge which algorithm has the best performance before trying different algorithms. Although there are many other machine learning algorithms, this paper discusses the most popular one. If you are a novice in machine learning, this will be a good starting point.

Previous article:In the past, women liked to pin a silver hairpin on their hair bun. What was the original function of the hairpin?
Next article:What does a squirrel look like?