Traditional Culture Encyclopedia - Traditional customs - Anomaly Detection (Ⅱ) —— Traditional Statistical Method

Anomaly Detection (Ⅱ) —— Traditional Statistical Method

The effectiveness of statistical methods is highly dependent on whether the statistical model assumptions made by given data are established.

The general idea of statistical methods for anomaly detection is to learn the generation model suitable for a given data set, and then identify the objects in the low probability region of the model as outliers.

For example, points other than 3 in the normal distribution are abnormal points, and points exceeding 2 Q in the box chart are abnormal points.

According to how to specify and learn models, statistical methods of anomaly detection can be divided into two categories: parametric methods and nonparametric methods.

Parameterization method assumes that normal data objects are generated by parameter distribution with parameters. The probability density function of parameter distribution gives the probability that the distribution generates objects. The smaller the value, the easier it is to become an anomaly.

The nonparametric method does not assume a priori statistical model, but tries to determine the model from the input data. Nonparametric methods usually assume that the number and properties of parameters are flexible and not predetermined (so nonparametric methods do not mean that the model is completely nonparametric, and it is impossible to learn the model from data without parameters).

Data containing only one attribute or variable is called metadata. We assume that the data is generated by a normal distribution, and then we can learn the parameters of the normal distribution from the input data and identify the points with low probability as abnormal points.

Suppose the input data set is 0, and the samples in the data set obey normal distribution, that is, we can find the parameter sum according to the samples.

After calculating the parameters, we can calculate the probability that the data points obey the distribution according to the probability density function. The probability density function of normal distribution is

If the calculated probability is lower than the threshold, the data point can be considered as an abnormal point.

The threshold is an empirical value, and the threshold that makes the evaluation index value on the verification set the largest (that is, the best effect) can be selected as the final threshold.

For example, in the commonly used 3sigma principle, if data points are out of range, then these points are probably abnormal points.

This method can also be used for visualization. The box chart makes a simple statistical visualization of data distribution, which is formed by using the upper and lower quartiles (Q 1 and Q3) and the midpoint of the data set. Outliers are usually defined as data less than Q 1- 1.5 iqr or greater than q 3 1.5 iqr.

Draw a simple block diagram in Python:

Data involving two or more attributes or variables is called multivariate data. Many univariate anomaly detection methods can be extended to handle multivariate data. Its core idea is to transform multivariate anomaly detection task into univariate anomaly detection problem. For example, if the detection of univariate outliers based on normal distribution is extended to multivariate cases, the mean and standard deviation of each dimension can be obtained. For dimensions:

The probability density function when calculating the probability is

This is in the case that the characteristics of each dimension are independent of each other. If there is correlation between features, multivariate Gaussian distribution will be used.

In many cases, it is assumed that data are generated by normal distribution. When the actual data is complex, this assumption is too simple, and it can be assumed that the data is generated by mixed parameter distribution.

In the nonparametric method of anomaly detection, the model of "normal data" learns from the input data instead of assuming a priori. Generally speaking, nonparametric methods have fewer assumptions about data, so they can be used in more occasions.

Example: Using histogram to detect outliers.

Histogram is a commonly used nonparametric statistical model, which can be used to detect outliers. This process includes the following two steps:

Step 1: Construct histogram. Use input data (training data) to construct a histogram. Histograms can be univariate or multivariate (if the input data is multidimensional).

Although nonparametric methods do not assume any prior statistical model, they usually require users to provide parameters to learn from data. For example, the user must specify the type of histogram (width or depth) and other parameters (the number of boxes in the histogram or the size of each box, etc.). ). Unlike the parameter method, these parameters do not specify the type of data distribution.

Step 2: Detect outliers. In order to determine whether an object is an outlier, it can be checked against the histogram. In the simplest method, if an object falls into a box in the histogram, it is regarded as normal, otherwise it is regarded as an abnormal point.

For more complex methods, histogram can be used to give each object an abnormal point score. For example, the anomaly score of an object is the reciprocal of the volume of the box into which the object falls.

One disadvantage of using histogram as a nonparametric model for outlier detection is that it is difficult to choose an appropriate box size. On the one hand, if the box size is too small, many normal objects will fall into empty or sparse boxes, so they will be mistaken for abnormal points. On the other hand, if the box size is too large, the outliers may infiltrate into some frequent boxes, thus "pretending" to be normal.

The full name of BOS is: outlier score based on histogram. It is a combination of univariate methods, which can not model the dependency between features, but is fast and friendly to large data sets. The basic assumption is that each dimension of the data set is independent of each other. Then each dimension is divided into bins, and the higher the bin density, the lower the anomaly score.

HBOS algorithm flow:

1. Make a data histogram for each data dimension. Calculate the frequency of each value and calculate the relative frequency of the classified data. According to the different distribution of numerical data, the following two methods are adopted:

Static width histogram: a standard histogram construction method, which uses k equal-width boxes within the range of values. The frequency (relative quantity) of samples falling into each barrel is used as an estimate of the density (box height). Time complexity:

2. Dynamic width histogram: first sort all values, and then put a fixed number of continuous values into a box, where n is the total number of instances and k is the number of boxes; The box area in the histogram indicates the number of instances. Because the width of the box is determined by the first value and the last value in the box, the area of all boxes is the same, so the height of each box can be calculated. This means that long-span boxes have low height, that is, low density, except in one case, when the number of boxes exceeding K is equal, it is allowed to exceed the value in the same box.

Time complexity:

2. Calculate an independent histogram for each dimension, where the height of each box represents the estimated density. Then, in order to make the maximum height 1 (to ensure the equal weight of each feature and outlier), the histogram is normalized. Finally, calculate the HBOS value of each instance by the following formula:

Deduction process:

Assuming that the probability density of the I-th feature of the sample P is 0, the probability density of P can be calculated as follows: Logarithm of both sides: The greater the probability density, the smaller the anomaly score. To facilitate scoring, multiply both sides by "-1": finally:

1. Statistical method of anomaly detection Learn models from data to distinguish normal data objects from abnormal points. One advantage of using statistical methods is that anomaly detection will not cause statistical objection. Of course, it is only true if the statistical assumptions made on the data meet the actual constraints.

2.HBOS performs well in global anomaly detection, but it can't detect local outliers. But HBOS is much faster than the standard algorithm, especially on large data sets.