Traditional Culture Encyclopedia - Traditional festivals - Several common recurrent neural network structures RNN, LSTM, GRU

Several common recurrent neural network structures RNN, LSTM, GRU

Traditional approaches to text processing tasks typically use TF-IDF vectors as feature input. It is obvious that such a representation actually loses the order of each word in the input text sequence. In the modeling process of neural networks, a general feed-forward neural network, such as a convolutional neural network, usually accepts a vector of fixed length as input. When a convolutional neural network models textual data, it inputs variable-length strings or strings of words, and then converts the original input into a fixed-length vector representation by means of a sliding window plus pooling, which can capture some of the local features in the original text, but long-distance dependencies between two words are still difficult to be learned.

Recurrent neural networks, on the other hand, work well with long and ordered input sequences of text data. It simulates the order in which a person reads a text, reading each word from front to back, encoding the useful information from the previous readings into state variables, and thus possessing a certain amount of memory capacity to better understand the subsequent text.

The structure of the network is shown in the following figure:

As can be seen from the figure, t is the moment, x is the input layer, s is the hidden layer, o is the output layer, and the matrix W is the value of the last time of the hidden layer as the weight of this time's input.

If we repeatedly bring Eq. 2 to Eq. 1, we will get:

Where f and g are the activation functions, U is the weight matrix of the input layer to the hidden layer, and W is the weight matrix of the hidden layer for the state transfer from the previous moment to the next moment. In the text categorization task, f can be selected as Tanh function or ReLU function, and g can be Softmax function.

By minimizing the loss error (i.e., the distance between the output y and the true category), we can continuously train the network so that the resulting recurrent neural network can accurately predict the category to which the text belongs for classification purposes. Compared to feed-forward neural networks such as convolutional neural networks, recurrent neural networks tend to get more accurate results due to their ability to portray sequence order information.

The training algorithm for RNNs is: BPTT

The basic principle of BPTT is the same as the BP algorithm, which is also a three-step process:

1. Calculate the output value of each neuron in the forward direction;

2. Calculate the value of the error term of each neuron in the backward direction, which is the partial derivative of the error function E with respect to the weighted input of neuron j;

3. .calculate the gradient for each weight.

Finally the weights are updated again with a stochastic gradient descent algorithm.

Refer to: /p/39a99c88a565

Finally, the gradient of each weight expressed as a Jacobi matrix below is obtained by the chain rule:

Since the prediction error propagates backwards along each layer of the neural network, when the maximum eigenvalue of the Jacobi matrix is greater than 1, the size of the gradient at each layer will grow exponentially as you get further and further away from the output. will grow exponentially, resulting in a gradient explosion; conversely, if the maximum eigenvalue of the Jacobi matrix is less than 1, the size of the gradient will shrink exponentially, producing a gradient vanishing. For a normal feedforward network, gradient vanishing means that it is not possible to improve the prediction of the neural network by deepening the network layers, because no matter how much the network is deepened, only a number of layers close to the output actually do the learning. This makes it difficult for recurrent neural network models to learn long-distance dependencies in the input sequence .

A detailed derivation of RNN gradient descent can be found at: /p/44163528

The problem of gradient explosion can be mitigated by gradient pruning, which is the isoperimetric contraction of the gradient when the paradigm of the gradient is larger than a given value. The gradient vanishing problem, on the other hand, is relatively tricky and requires improvements to the model itself. Deep residual networks, an improvement of feedforward neural networks, mitigate the phenomenon of gradient vanishing by means of residual learning, thus enabling us to learn a deeper representation of the network; and for recurrent neural networks, models such as the long and short-term memory model and its variant gated recurrent units largely compensate for the loss caused by gradient vanishing by incorporating gating a gating mechanism.

The network architecture of LSTM is shown below:

Compared with traditional recurrent neural networks, LSTM is still based on xt and ht?1 to compute ht, except that the internal structure is more carefully designed, adding three gates, the input gate it, the forgetting gate ft, and the output gate ot, as well as an internal memory unit ct. The input gate controls how much of the current computation is updated to the new state. The input gate controls how much of the new state of the current computation is updated into the memory cell; the forget gate controls how much of the information in the previous memory cell is forgotten; and the output gate controls how much of the current output depends on the current memory cell.

In the classical LSTM model, the update formula for layer t is

where it is obtained by linearly transforming the input xt with the output ht?1 of the implicit layer from the previous step, and then going through the activation function σ. The result of the input gate it is the vector, where each element is a real number between 0 and 1, which is used to control the amount of information flowing through the valve in each dimension; the two matrices Wi , Ui and the vector bi are the parameters of the input gate, which need to be learned and obtained in the training process. The forgetting gate ft and the output gate ot are computed in a similar way to the input gate, and they have their own parameters W, U and b. Unlike traditional recurrent neural networks, the transfer from the state ct?1 of the previous memory unit to the current state ct does not necessarily depend entirely on the state obtained by the activation function computation, but is also controlled by the input and forgetting gates in the same way as the input gate ****.

In a trained network, when there is no important information in the input sequence, the value of the forgetting gate of the LSTM is close to 1, and the value of the input gate is close to 0. At this time, the past memories will be preserved, so as to realize the function of long-term memory; when there is important information in the input sequence, the LSTM should put it into memory, and at this time, the value of the input gate of the LSTM is close to 1. When there is important information in the input sequence, the value of the input gate of the LSTM is close to 1; when there is important information in the input sequence, the value of the input gate is close to 1. When important information appears in the input sequence and the information means that the previous memory is no longer important, the value of the input gate is close to 1 and the value of the forgetting gate is close to 0, so that the old memory is forgotten and the new important information is memorized. With this design, it is easier for the whole network to learn long-term dependencies between sequences.

GRU is obtained by simplifying on LSTM, the network structure of GRU is shown as follows:

Zt stands for update gate, the update gate acts similarly to the forgetting and input gates in LSTM, which decides what information to discard and what new information to add.

Rt stands for the reset gate, which is used to decide how much of the previous information to discard.

It is important to note that h is just one variable, so at every moment, including the final linear combination, h is updating itself with its previous self and the current alternative answer. For example, this variable is like a glass of wine, where each time we want to pour some of the wine out, mix it with new ingredients, and then pour it back in, where reset controls the proportion of wine that is poured out and mixed before pouring it back in, and update controls how much of the new ingredients are mixed with the previously prepared wine that is poured out. Similarly, LSTMs can be understood in the same way. LSTMs have forget gates that are functionally similar to reset, and input gates that are similar to update, with the difference that LSTMs also control the exposure of the current state, i.e., the output gates, which is something that GRUs don't have.

1. Hundred-sided machine learning

2. /p/45649187

3. /p/39a99c88a565