Traditional Culture Encyclopedia - Traditional stories - What is the process of data preprocessing?
What is the process of data preprocessing?
Remove unique attributes
The unique attributes are usually some id attributes, which cannot describe the distribution law of the sample itself. Simply delete these attributes.
Handling missing values
There are three ways to deal with missing values: directly using the features of missing values; Delete the features with missing values (this method is effective when the attributes with missing values contain a large number of missing values but only a few valid values); Missing value completion.
Common missing value completion methods: mean interpolation, homogeneous mean interpolation, modeling and prediction, high-dimensional mapping, multiple interpolation, maximum likelihood estimation, compressive sensing and matrix completion.
(1) average interpolation
If the distance of the sample attribute is measurable, the missing value is interpolated by the average value of the effective value of the attribute;
If the distance of is unmeasurable, the missing value is interpolated using the pattern of the effective value of the attribute. If pattern interpolation is used, what is the effect of data tilt?
(2) Similar mean interpolation
Firstly, the samples are classified, and then the missing values are interpolated by the average value of such samples.
(3) Modeling and forecasting
Taking the missing attribute as the prediction target, the dataset is divided into two categories according to whether it contains the missing value of a specific attribute, and the missing value of the dataset to be predicted is predicted by using the existing machine learning algorithm.
The fundamental defect of this method is that if other attributes have nothing to do with the missing attributes, then the prediction result is meaningless; However, if the prediction result is quite accurate, it means that this missing attribute is not necessary to be included in the data set; Generally speaking, it is somewhere in between.
(4) High dimensional mapping
Attribute mapping to high-dimensional space, using one-key coding technology. Attribute values including k discrete value ranges are extended to K+ 1 attribute values. If the attribute value is missing, the extended K+ 1 attribute value will be set to 1.
This method is the most accurate method. It keeps all the information without adding any extra information. If all variables are treated in this way during preprocessing, the dimension of data will increase greatly. The advantage of this is that all the information of the original data is completely retained, regardless of missing values; The disadvantage is that the amount of calculation is greatly improved, and the effect is only good when the sample size is large.
(5) Multiple interpolation
Multiple interpolation thinks that the values to be interpolated are random. In practical operation, it is usually necessary to estimate the values to be interpolated and add different noises to form multiple sets of optional interpolation values. According to some selection criteria, choose the most suitable interpolation.
(6) Compression sensing and matrix completion.
(7) Manual interpolation
Interpolation processing only uses our subjective estimate to supplement the unknown value, which may not be completely in line with the objective facts. In many cases, according to the understanding of the domain, manual interpolation of missing values is better.
- Related articles
- How many types of flowers are there
- Types of flutes
- What are the five elements of a company name?
- Which kind of green tea smells best and tastes best?
- What is the symbolic meaning of the national flag of China?
- Dunhuang Tian Fei handwritten newspaper
- Do the new era of good youth essay
- How to deal with the drum washing machine without drainage
- The Essence and Core of Socialist Democratic Politics
- What are the parameters of Apple's original charger?