Traditional Culture Encyclopedia - Traditional stories - Comparison of common activation functions
Comparison of common activation functions
The structure of this paper:
As shown in the figure below, in the neuron, the inputs of the inputs are weighted and summed, and then also acted on a function, which is the Activation Function Activation Function.
If you don't use the excitation function, the outputs of each layer are the inputs of the upper layer. linear function, no matter how many layers the neural network has, the output is a linear combination of the inputs.
If used, the activation function introduces a nonlinear element to the neurons, allowing the neural network to approximate any nonlinear function at will, so that the neural network can be applied to numerous nonlinear models.
Formula:
Curve:
Also called the Logistic function, it is used in the output of the hidden layer neurons
Taking values in the range of (0,1)
It maps a real number to the interval (0,1), and can be used for binary classification.
It works better when the difference in features is complex or the difference is not particularly large.
Disadvantages of sigmoid:
The activation function is computationally intensive, and when backpropagating the gradient of the error, the derivation involves division
When backpropagating, it is very easy for the gradient to disappear, which makes it impossible to complete the training of the deeper network
The following is an explanation for the disappearance of the gradient:
The backpropagation algorithm has to evaluate the gradient of the activation function, and sigmoid is not a good choice. In the backpropagation algorithm, the derivative of the sigmoid is expressed as follows:
The graph of the sigmoid's original function and derivative is as follows:
From the graph, we can see that the derivative starts from 0 and converges to 0 very quickly, which is prone to the phenomenon of "vanishing gradient".
The formula
The curve
is also known as a double curve, which is the same as a double curve, which is also known as a bipartite curve. p>
Also known as the double tangent tangent function
The range is [-1,1].
tanh works well when the features are significantly different, expanding the feature effect over the course of the loop.
The difference with sigmoid is that tanh is 0-mean, so in practice tanh will work better than sigmoid
Rectified Linear Unit (ReLU) - used for hidden layer neuron outputs
Formulas
Curves
When the input signal <0. outputs are all 0, >0 the output equals the input
Advantages of ReLU:
Krizhevsky et al. found that SGDs obtained using ReLU converge much faster than sigmoid/tanh
Disadvantages of ReLU:
Training is very "For example, a very large gradient flows through a ReLU neuron, and after updating the parameters, the neuron will never activate on any data again, so the gradient of the neuron will always be zero.
If the learning rate is very large, then it is very likely that 40% of the neurons in the network are "dead".
Softmax - used for multiclassification neural network output
The formula
An example of what the formula means:
It means that if one zj is larger than the others, then the component of the mapping is close to 1, and the others are close to 0. The main application is multiclassification.
The first reason why we take the exponent is to model the behavior of max, so we need to make the big ones bigger.
The second reason is that you need a function that is derivable.
Comparison of Sigmoid and ReLU:
The gradient of sigmoid disappears problem, the derivative of ReLU does not have such a problem, its derivative expression is as follows:
Curve as shown in Figure
Compared to the sigmoid class of functions the main changes are:
1) unilateral inhibition
2) relatively wide excitatory boundaries
3) sparse activation.
Distinction between Sigmoid and Softmax:
Softmax is a generalization of logistic function that "squashes" ( maps) a K-dimensional vector z of arbitrary real values to a K-dimensional vector σ(z) of real values in the range (0, 1) that add up to 1.
sigmoid maps a real value to the interval (0,1), which is used for binary classification.
And softmax maps a k-dimensional vector of real values (a1,a2,a3,a4....) into a (b1,b2,b3,b4....) where bi is a constant from 0 to 1, and the sum of the output neurons is 1.0, so it is equivalent to the probability value, and then the task of multi-classification can be performed based on the probability magnitude of bi.
The sigmoid and softmax are the same for binary classification problems, seeking cross entropy loss, while softmax can be used for multicategorization problems
Softmax is an extension of the sigmoid because, when the number of categories k = 2, the softmax regression degenerates into a logistic regression. Specifically, when k = 2, the hypothesis function of softmax regression is:
By taking advantage of the redundancy of the parameters of softmax regression and subtracting the vector θ1 from both parameter vectors, we get:
Finally, by using θ′ to denote θ2?θ1, the above equation can be expressed as the probability of one of the categories predicted by the softmax regressor is
< p> The probability of the other category isThis is consistent with logistic regression.
Softmax modeling uses a polynomial distribution, whereas logistic is based on the Bernoulli distribution
Multiple logistic regressions can be stacked to achieve the same effect of multicategorization, but with softmax regression, the classes are mutually exclusive, i.e., an input can only be categorized as one class; For multiple logistic regression, the output categories are not mutually exclusive, i.e., the word "apple" belongs to both the "fruit" and "3C" categories.
The choice is based on the advantages and disadvantages of each function to configure, for example:
If you use ReLU, be careful to set the learning rate, be careful not to let the network appear a lot of "dead" neurons, if it is not easy to solve, you can try the Leaky ReLU, PReLU or Maxout.
References:
/qq_17754181/articles/details/56495406
/question/29021768
/articles/c15a1000722p0.html
/question/23765351
Suggested Reading A summary of links to historical tech blog posts
Maybe you can find what you're looking for
I'm Alice, the snail who won't stop
A full-time, post-85 housewife
.Artificial Intelligence loving, action oriented
Creativity, Thinking, Learning enhancement practice in progress
Your likes, followers and comments are welcome!
- Previous article:Another year of pear blossom white
- Next article:Two-word traditional snack name
- Related articles
- What is the proper noun with English characters in 202 1 wechat?
- Feng Meimei is an international athlete. Do you know the story about Feng Meimei?
- Cricket information
- The composition of Tianqiao in front of my house is 239 words.
- Which is better, Huawei or oppo?
- The origin and custom of Laba Festival
- What symbols are commonly used in chorus to identify male bass?
- What are the specialties in Xinjiang?
- What are the sports modes of animals?
- Xuzhou folk custom composition