Traditional Culture Encyclopedia - Traditional culture - 7 Regression Techniques Every Data Science Person Should Know

7 Regression Techniques Every Data Science Person Should Know

Introduction Linear regression and logistic regression are usually the first algorithms people learn in data science. Due to their popularity, many analysts even consider them to be the only forms of regression. Anyone with a little work experience would also consider them to be the most important of all forms of regression analysis.

The truth is that there are countless forms of regression that can be used. Each form of regression has its own importance and is best suited for specific scenarios. In this article, I will explain in a simple way the 7 most commonly used forms of regression in data science. Through this article, I also hope that people will have an idea of the breadth of regression and not just that they should use linear/logistic regression for every problem they come across and that they will be able to use so many regression techniques!

If you're new to data science and are looking for a place to start, the " Data Science" course is a great place to start! It covers core topics in Python, statistics, and predictive modeling, and it's the perfect way to take your first steps into data science.

What is regression analysis?

Regression analysis is a technique of predictive modeling technology that examines the relationship between the dependent (target) and independent (predictor) variables. The technique is used for forecasting, time series modeling and finding causal relationships between variables. For example, the relationship between reckless driving and the number of road traffic accidents experienced by drivers can be best studied through regression.

Regression analysis is an important tool for modeling and analyzing data. Here, we fit curved/straight lines to the data points such that the difference between the distance of the data points from the curve or line is minimized. I will explain this in detail in the next sections.

Why do we use regression analysis?

As mentioned above, regression analysis is the estimation of the relationship between two or more variables. Let's understand this through a simple example:

Let's say you want to estimate your company's sales growth rate based on current economic conditions. You have recent company data that suggests sales growth is about 2.5 times the economic growth. Using this insight, we can predict the company's future sales based on current and past information.

There are many benefits to using regression analysis. The following:

It shows a significant relationship between the dependent and independent variables. It indicates the strength of the effect of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the impact of variables measured on different scales, such as the impact of price changes and the number of promotions. These advantages help market researchers/data analysts/data scientists to eliminate and evaluate the best set of variables to use for constructing predictive models.

How many regression techniques do we have?

We have a wide variety of regression techniques available for forecasting. These techniques are driven by three main metrics (the number of independent variables, the type of dependent variable, and the shape of the regression line). We will discuss them in more detail in the following sections.

For the creative, you can even make new regressions that people haven't used before if you feel the need to use a combination of the above parameters. But before we start, let's understand the most commonly used regressions:

1. Linear regression

It is one of the most widely known modeling techniques. Linear regression is usually one of the first few methods people choose when learning predictive modeling. In this method, the dependent variable is continuous, the independent variables can be continuous or discrete, and the regression line is linear in nature.

Linear regression uses a best-fit straight line (also known as a regression line) to establish a relationship between the dependent variable (Y) and one or more independent variables (X).

It is represented by the equation Y = a + b * X + e, where a is the intercept, b is the slope of the line, and e is the error term. This equation predicts the value of the target variable based on the given predictor variables.

The difference between Simple Linear Regression and Multiple Linear Regression is that Multiple Linear Regression has (> 1) independent variable while Simple Linear Regression has only 1 independent variable. Now the question is "How do we get the line of best fit?" .

How do we get the line of best fit (the values of a and b)?

This task can be easily accomplished by least squares. It is the most common method used to fit regression lines. It calculates the line of best fit for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. Because the deviations are squared first, when summed, the positive and negative values do not cancel out from each other.

We can use the R-squared of the metric to evaluate model performance .

Key point: There must be a linear relationship between the independent and dependent variables Multiple regression suffers from multiple **** linearity, autocorrelation, and heteroskedasticity. Linear regression is very sensitive to outliers. It can greatly affect the regression line and ultimately the predicted value. Multiple **** linearity can increase the variance of coefficient estimates and make the estimates very sensitive to small changes in the model. The result is that the coefficient estimates are unstable In the case of multiple independent variables, we have the option of forward selection, backward elimination, and stepwise methods to select the most important independent variables. 2. logistic regression

Logistic regression methods are used to find the probability of success and the probability of failure of an event. We should use logistic regression when the dependent variable is essentially binary (0/1, true/false, yes/no). Here the Y value ranges from 0 to 1 and it can be represented by the following equation.

odds = p /(1-p) = probability of event/probability of non-event ln(odds) = ln(p /(1-p)) logit(p) = ln(p /(1-p)) = b0 + b1X1 + b2X2 + b3X3 .... + bkXk

Above, p is the probability of the presence of the feature of interest. The question you should be asking at this point is "why are we using log log in the equation?".

Since we are using a binomial distribution here (the dependent variable), we need to choose the link function that is best suited to this distribution. And, it is the logit function. In the equation above, this parameter was chosen in order to maximize the likelihood of observing the sample values, rather than minimizing the sum of squared errors (as in ordinary regression).

Highlights: It is widely used for classification problems Logistic regression does not need to rely on a linear relationship between the dependent and independent variables. It can handle all types of relationships because it applies nonlinear logarithmic transformations to predictions with more advantages than To avoid overfitting and underfitting, we should include all important variables. A good way to ensure this is to use a stepwise approach to estimating logistic regression It requires larger sample sizes because maximum likelihood estimation is less efficient than ordinary least squares when sample sizes are small The independent variables should not be correlated with each other, i.e., not be multiply *** linear. However, we can choose to include the interaction of categorical variables in the analysis and model. If the value of the dependent variable is ordinal then it is called ordinal logistic regression If the dependent variable is multicategorical then it is called multinomial logistic regression. 3.Polynomial Regression

If the power of the independent variable is greater than 1, then the regression equation is a polynomial regression equation. The following equation represents the polynomial equation:

Y = A + B * X ^ 2

In this regression technique, the line of best fit is not a straight line. It is a curve that coincides with the data points.

Key Point: While there may be a temptation to fit higher order polynomials for lower error, this can lead to overfitting. Always plot the relationship to see if it matches and focus on making sure the curve fits the nature of the problem. Here is an example of how plotting can help: Pay particular attention to the curves at the ends to see if the shapes and trends make sense. Higher polynomials can end up producing strange results. 4. Stepwise regression

This form of regression is used when we are dealing with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automated process and this process is done without human intervention.

This feat can be achieved by looking at statistical values such as R-squared, t-tests, and AIC metrics to identify important variables. Stepwise regression is basically suitable for regression models by adding/removing covariates one at a time based on specified criteria. Some of the most commonly used stepwise regression methods are listed below:

Standard stepwise regression does two things. It adds and removes predictor variables as needed at each step. Forward selection starts with the most important predictor variables in the model and adds variables for each step. Backward elimination starts with all predictor variables in the model and removes the least important variables for each step.

The goal of this modeling technique is to maximize predictive power with the fewest number of predictor variables. It is one of the ways to deal with the higher dimensions of the dataset.

5. Ridge Regression

Ridge regression is a technique used when there is multiple **** linearity (highly correlated independent variables) in the data. In multiple ****linearity, even though the least squares estimates (OLS) are unbiased, they have a high variance, which makes the observations deviate from the true values. By adding a degree of bias to the regression estimates, ridge regression reduces the standard error.

Above, we saw the equation for linear regression. Remember it? It can be expressed as:

y = a + b * x

This equation also has an error term. The full equation becomes:

y = a + b * x + e (the error term), [the error term is the value needed to correct for the prediction error between the observed and predicted values] denoting multiple independent variables, => y = a + y = a + b1x1 + b2x2 + .... + e.

In the linear equation, the prediction error can be decomposed into two subcomponents. The first is due to bias and the second is due to variance. Prediction error can occur due to either or both of these components. Here, we will discuss errors due to variance.

Ridge regression solves the problem of multiple **** linearity by shrinking the parameter λ(lambda) . Look at the equation below.

In this equation, we have two components. The first is the least squares term and the other is λ for the sum of β2 (β squared), where β is a coefficient. This is added to the least squares term in order to shrink the parameters to have very low variance.

Highlights: This regression makes the same assumptions as the least squares regression, but does not assume normality It will reduce the value of the coefficients, but will not reach zero, which suggests that there is no feature selection function This is a regularization method and uses l2 regularization. 6.Lasso Regression

Similar to ridge regression, Lasso (minimum absolute shrinkage and selection operator) also puts a limit on the absolute size of the regression coefficients. In addition, it reduces the variability and improves the accuracy of linear regression models. Consider the following equation:

Lasso regression differs from ridge regression in that it uses absolute values rather than squares in the penalty function. This results in penalizing (or equivalently, constraining the sum of the absolute values of the estimates) values, which results in some parameter estimates being exactly zero. The larger the penalty applied, the more the estimate shrinks to an absolute zero value. This leads to variable selection from a given set of n variables.

Highlights: This regression assumes the same assumptions as least squares regression but does not assume normality It shrinks the coefficients to zero (exactly zero), which definitely helps in feature selection It is a regularization method and uses l1 regularization If the predictor variables are highly correlated, Lasso selects only one of them and shrinks the other predictions to zero 7. Elasticity Network Regression

Elasticity Network Regression is a hybrid of Lasso regression and ridge regression techniques. It uses L1 and L2 prior as regularizers for training. Elastic networks are useful when there are multiple correlated features.While Lasso may randomly select one of them, elastic nets are likely to select both.

A practical advantage of the tradeoff between Lasso regression and ridge regression is that it allows the elastic net to inherit some of the stability of ridge regression under rotation.

Key points: It encourages group effects when variables are highly correlated There is no limit to the number of variables chosen It is subject to double contraction How to choose the right regression model?

Life is usually simple when you only know one or two techniques. One of the training organizations I know tells their students - if the results are continuous - then use linear regression. If it's binary - then use logistic regression! However, the greater the number of options available to us, the more difficult it becomes to choose the right one. A similar thing happens with regression models.

With many types of regression models, it is important to choose the most appropriate regression method based on the type of independent and dependent variables, the dimensions in the data, and other basic characteristics of the data. The following are key factors that should be considered in selecting the right regression model:

Data mining is an inevitable part of constructing predictive models. Before choosing the right model, one should first determine the correlation coefficients and effects between the variables In order to compare the goodness of fit of different models, we can analyze different metrics such as the statistical significance of the parameters, R-squared, adjusted R-squared, AIC metrics, BIC metrics, and error terms. Another one is Mallow's Cp criterion. This basically checks for possible biases in the model by comparing the model with all possible sub-models (choosing them carefully). Cross-validation is the best way to evaluate the models used for prediction. Here, the dataset can be divided into two groups (training and validation). The simple mean square deviation between the observed and predicted values measures the accuracy of the predictions. If your dataset has multiple confounding variables, you should not choose an automatic model selection method because you will not want to put them in the model at the same time. It also depends on your goals. Less powerful models are easier to implement than models with a high degree of statistical significance. Regression regularization methods (Lasso regression, ridge regression, and elastic network regression) work well with high dimensionality and multiple **** linearity between variables in the dataset. Conclusion

By now, I hope you have learned something about regression. Consider the data conditions to apply these regression techniques. One of the best tips for finding out which technique to use is to check the family of variables, i.e., discrete or continuous.

In this article, I discuss seven types of regression and some key facts associated with each technique. As a newcomer to this industry, I recommend that you learn these techniques and then implement them in your models.

- These are the seven types of regression models that the authors recommend as a must-know and must-know for data science people, if you are interested in these seven models, then experiment with them yourself, it's not enough to just know the theory, you need to experiment with them to really master them.

7 Types of Regression Techniques you should know!