Traditional Culture Encyclopedia - Traditional stories - What is the process of data preprocessing?

What is the process of data preprocessing?

The common processes of data preprocessing include: removing unique attributes, processing missing values, attribute coding, data standardization and regularization, feature selection and principal component analysis.

Remove unique attributes

The unique attributes are usually some id attributes, which cannot describe the distribution law of the sample itself. Simply delete these attributes.

Handling missing values

There are three ways to deal with missing values: directly using the features of missing values; Delete the features with missing values (this method is effective when the attributes with missing values contain a large number of missing values but only a few valid values); Missing value completion.

Common missing value completion methods: mean interpolation, homogeneous mean interpolation, modeling and prediction, high-dimensional mapping, multiple interpolation, maximum likelihood estimation, compressive sensing and matrix completion.

(1) average interpolation

If the distance of the sample attribute is measurable, the missing value is interpolated by the average value of the effective value of the attribute;

If the distance of is unmeasurable, the missing value is interpolated using the pattern of the effective value of the attribute. If pattern interpolation is used, what is the effect of data tilt?

(2) Similar mean interpolation

Firstly, the samples are classified, and then the missing values are interpolated by the average value of such samples.

(3) Modeling and forecasting

Taking the missing attribute as the prediction target, the dataset is divided into two categories according to whether it contains the missing value of a specific attribute, and the missing value of the dataset to be predicted is predicted by using the existing machine learning algorithm.

The fundamental defect of this method is that if other attributes have nothing to do with the missing attributes, then the prediction result is meaningless; However, if the prediction result is quite accurate, it means that this missing attribute is not necessary to be included in the data set; Generally speaking, it is somewhere in between.

(4) High dimensional mapping

Attribute mapping to high-dimensional space, using one-key coding technology. Attribute values including k discrete value ranges are extended to K+ 1 attribute values. If the attribute value is missing, the extended K+ 1 attribute value will be set to 1.

This method is the most accurate method. It keeps all the information without adding any extra information. If all variables are treated in this way during preprocessing, the dimension of data will increase greatly. The advantage of this is that all the information of the original data is completely retained, regardless of missing values; The disadvantage is that the amount of calculation is greatly improved, and the effect is only good when the sample size is large.

(5) Multiple interpolation

Multiple interpolation thinks that the values to be interpolated are random. In practical operation, it is usually necessary to estimate the values to be interpolated and add different noises to form multiple sets of optional interpolation values. According to some selection criteria, choose the most suitable interpolation.

(6) Compression sensing and matrix completion.

(7) Manual interpolation

Interpolation processing only uses our subjective estimate to supplement the unknown value, which may not be completely in line with the objective facts. In many cases, according to the understanding of the domain, manual interpolation of missing values is better.