Traditional Culture Encyclopedia - Traditional stories - Comparison of common activation functions

Comparison of common activation functions

The structure of this paper:

As shown in the figure below, in the neuron, the inputs of the inputs are weighted and summed, and then also acted on a function, which is the Activation Function Activation Function.

If you don't use the excitation function, the outputs of each layer are the inputs of the upper layer. linear function, no matter how many layers the neural network has, the output is a linear combination of the inputs.

If used, the activation function introduces a nonlinear element to the neurons, allowing the neural network to approximate any nonlinear function at will, so that the neural network can be applied to numerous nonlinear models.

Formula:

Curve:

Also called the Logistic function, it is used in the output of the hidden layer neurons

Taking values in the range of (0,1)

It maps a real number to the interval (0,1), and can be used for binary classification.

It works better when the difference in features is complex or the difference is not particularly large.

Disadvantages of sigmoid:

The activation function is computationally intensive, and when backpropagating the gradient of the error, the derivation involves division

When backpropagating, it is very easy for the gradient to disappear, which makes it impossible to complete the training of the deeper network

The following is an explanation for the disappearance of the gradient:

The backpropagation algorithm has to evaluate the gradient of the activation function, and sigmoid is not a good choice. In the backpropagation algorithm, the derivative of the sigmoid is expressed as follows:

The graph of the sigmoid's original function and derivative is as follows:

From the graph, we can see that the derivative starts from 0 and converges to 0 very quickly, which is prone to the phenomenon of "vanishing gradient".

The formula

The curve

is also known as a double curve, which is the same as a double curve, which is also known as a bipartite curve. p>

Also known as the double tangent tangent function

The range is [-1,1].

tanh works well when the features are significantly different, expanding the feature effect over the course of the loop.

The difference with sigmoid is that tanh is 0-mean, so in practice tanh will work better than sigmoid

Rectified Linear Unit (ReLU) - used for hidden layer neuron outputs

Formulas

Curves

When the input signal <0. outputs are all 0, >0 the output equals the input

Advantages of ReLU:

Krizhevsky et al. found that SGDs obtained using ReLU converge much faster than sigmoid/tanh

Disadvantages of ReLU:

Training is very "For example, a very large gradient flows through a ReLU neuron, and after updating the parameters, the neuron will never activate on any data again, so the gradient of the neuron will always be zero.

If the learning rate is very large, then it is very likely that 40% of the neurons in the network are "dead".

Softmax - used for multiclassification neural network output

The formula

An example of what the formula means:

It means that if one zj is larger than the others, then the component of the mapping is close to 1, and the others are close to 0. The main application is multiclassification.

The first reason why we take the exponent is to model the behavior of max, so we need to make the big ones bigger.

The second reason is that you need a function that is derivable.

Comparison of Sigmoid and ReLU:

The gradient of sigmoid disappears problem, the derivative of ReLU does not have such a problem, its derivative expression is as follows:

Curve as shown in Figure

Compared to the sigmoid class of functions the main changes are:

1) unilateral inhibition

2) relatively wide excitatory boundaries

3) sparse activation.

Distinction between Sigmoid and Softmax:

Softmax is a generalization of logistic function that "squashes" ( maps) a K-dimensional vector z of arbitrary real values to a K-dimensional vector σ(z) of real values in the range (0, 1) that add up to 1.

sigmoid maps a real value to the interval (0,1), which is used for binary classification.

And softmax maps a k-dimensional vector of real values (a1,a2,a3,a4....) into a (b1,b2,b3,b4....) where bi is a constant from 0 to 1, and the sum of the output neurons is 1.0, so it is equivalent to the probability value, and then the task of multi-classification can be performed based on the probability magnitude of bi.

The sigmoid and softmax are the same for binary classification problems, seeking cross entropy loss, while softmax can be used for multicategorization problems

Softmax is an extension of the sigmoid because, when the number of categories k = 2, the softmax regression degenerates into a logistic regression. Specifically, when k = 2, the hypothesis function of softmax regression is:

By taking advantage of the redundancy of the parameters of softmax regression and subtracting the vector θ1 from both parameter vectors, we get:

Finally, by using θ′ to denote θ2?θ1, the above equation can be expressed as the probability of one of the categories predicted by the softmax regressor is

< p> The probability of the other category is

This is consistent with logistic regression.

Softmax modeling uses a polynomial distribution, whereas logistic is based on the Bernoulli distribution

Multiple logistic regressions can be stacked to achieve the same effect of multicategorization, but with softmax regression, the classes are mutually exclusive, i.e., an input can only be categorized as one class; For multiple logistic regression, the output categories are not mutually exclusive, i.e., the word "apple" belongs to both the "fruit" and "3C" categories.

The choice is based on the advantages and disadvantages of each function to configure, for example:

If you use ReLU, be careful to set the learning rate, be careful not to let the network appear a lot of "dead" neurons, if it is not easy to solve, you can try the Leaky ReLU, PReLU or Maxout.

References:

/qq_17754181/articles/details/56495406

/question/29021768

/articles/c15a1000722p0.html

/question/23765351

Suggested Reading A summary of links to historical tech blog posts

Maybe you can find what you're looking for

I'm Alice, the snail who won't stop

A full-time, post-85 housewife

Artificial Intelligence loving, action oriented

Creativity, Thinking, Learning enhancement practice in progress

Your likes, followers and comments are welcome!

Previous article:Another year of pear blossom white
Next article:Two-word traditional snack name