Traditional Culture Encyclopedia - Traditional festivals - A survey of natural language processing

A survey of natural language processing

Title: Overview of Natural Language Processing

Date: 2021-1181:03:11

Natural language refers to the languages that people use every day, such as Chinese, English and Japanese. Natural language is flexible and changeable, which is an important part of human society, but it can't be well understood by computers. In order to realize the communication between people and computers with natural language, natural language processing was born. Natural Language Processing (NLP) is a field that combines linguistics, computer science and mathematics. It not only studies linguistics, but also studies how to make computers deal with these languages. Mainly divided into two directions: natural language understanding (NLU) and natural language generation (NLG). The former is listening and reading, while the latter is speaking and writing.

This paper will start with the history and development of natural language processing, then analyze the current research progress of deep learning in the field of natural language processing, and finally discuss the future development direction of natural language processing.

1950, turing, the father of computer science, put forward "turing test", which marked the beginning of the field of artificial intelligence. At this time, during the Cold War between the United States and the Soviet Union, the American government devoted itself to the study of machine translation in order to decipher the relevant documents of the Soviet Union more conveniently, and natural language processing rose from then on. Since then, natural language processing has mainly adopted the rule-based method, which relies on linguistics. By analyzing lexical and grammatical information, it summarizes the laws between these information, thus achieving the effect of translation. This method, which is similar to expert system, is not universal, not easy to optimize, and finally makes slow progress and cannot achieve the expected results.

In the 1980s and 1990s, with the rapid development of the Internet, computer hardware has also been significantly improved. At the same time, statistical machine learning algorithm is introduced into natural language processing, and rule-based method is gradually replaced by statistical method. At this stage, natural language processing has made a substantial breakthrough and moved towards practical application.

Since 2008, with the remarkable achievements of deep learning neural network in the fields of image processing and speech recognition, it has also been applied to the field of natural language processing. From the initial word embedding, word2vec, to RNN, GRU, LSTM and other neural network models, and then to the recent attention mechanism, pre-training language model and so on. With the blessing of deep learning, natural language processing has also ushered in rapid progress.

Next, I will introduce the related progress after the combination of natural language processing and deep learning.

In natural language, words are the most basic units. In order for computers to understand and process natural languages, we must first encode words. Because the number of words in natural language is limited, each word can be assigned a unique serial number. For example, the serial number of English words can be 1 156. In order to facilitate the calculation, the serial number is usually converted into a unified vector. The simple method is to one-hot encode the word sequence numbers, and each word corresponds to a vector (one-dimensional array) with the length of n (the total number of words). In the vector, only the element value corresponding to the word serial number is 1, and all others are 0.

Although it is very easy to construct word vectors by using one-key coding, it is not a good method. The main reason is that the semantics of words cannot be well expressed. For example, apples and oranges are similar words (both fruits), but a heat vector cannot reflect this similar relationship.

In order to solve the above problems, Mikolov of Google and others published two original papers related to word2vec on 20 13 [1][2]. Word2vec represents words as fixed-length vectors, and learns the semantic information of words through context, so that these vectors can express semantic information such as the characteristics of words and the relationship between words. Word2vec includes two models: skip-gram model [1] and continuous word bag model [2]. Their function is to predict the context through the head word and the head word through the context. For example, there is a saying "I drink apple juice". Skip-gram model uses apple to predict other words, while CBOW model uses other words to predict Apple.

Firstly, the CBOW model is introduced, which is a three-layer neural network that predicts the head word through the context. Take a training data "I drink apple juice" as an example, you can remove the apple as the label value first, take "I drink juice" as the input, and take the apple as the head word to predict.

Skip-gram model is similar to CBOW, and it is also a three-layer neural network model. The difference is that it predicts the context through the head word, that is, it predicts "I drink juice" through "apple". Next, briefly introduce the layers in the Skip-gram model:

After the two models are trained, they will be used as a word vector matrix, and the I-th line represents the word vector of the I-th word in the thesaurus. Word vectors can be used to calculate the similarity between words (word vector point multiplication). For example, if you enter the context of I drink _ juice, the probability of predicting that the first words are apples and oranges may be high, because the word vectors corresponding to apples and oranges are very similar, that is, the similarity is high. Word vectors can also be used in machine translation, named entity recognition, relationship extraction and so on.

In fact, the prototypes of these two models appeared in 2003 [3]. Mikolov's paper in 13 mainly simplified the models and put forward negative sampling and sequential softmax methods to make the training more efficient.

While putting forward the word vector, the deep learning RNN framework has also been applied to NLP, combined with the word vector, and achieved great results. However, RNN network also has some problems, such as: it is difficult to parallelize and establish long-distance and hierarchical dependencies. These problems have been effectively solved in the paper "Attention is All You Need" published on 20 17. This paper presents a transformer model. The traditional complex CNN and RNN were abandoned in Transformers, and the whole network structure was completely composed of attention mechanism.

The core content of Transformers is the self-attention mechanism, which is a variant of the attention mechanism. The function of attention is to select a small amount of important information from a large amount of information and pay attention to it. For example, when people look at an image, they will focus on the more attractive part and ignore other information, which is the embodiment of attention. But the attention mechanism will pay attention to the global information, that is, the correlation between input data and output data and intermediate products. The self-concern mechanism reduces the attention to other external data, only pays attention to the input data itself, and is better at capturing the internal correlation of data.

The algorithm flow of self-concern mechanism is as follows:

The self-attention mechanism not only establishes the relationship between words in the input data, but also calculates the output of each word in parallel and efficiently.

The overall structure of the transformer is as follows:

It is divided into two parts: encoder and decoder.

The input of the encoder is the word vector plus the position code (indicating the position of the word), and then the output is obtained through multi-head attention and feedforward. Among them, multi-head self-attention means that each input word corresponds to multiple groups of Q, K and V, and each group does not affect each other. Finally, each word produces multiple output B values to form a vector. Encoder is the core of transformer, usually with multiple layers. The output of the previous layer will be used as the input of the next layer, and the output of the last layer will be used as part of the decoder input.

The decoder consists of two different multi-head self-attention operations (masking multi-head attention and multi-head attention) and feedforward. The decoder will run several times and output only one word at a time until the complete target text is output. The output part will be combined as the input of the next decoder. Among them, masking multi-head attention is to cover up the parts that are not obtained in the input, and then perform multi-head self-attention operation. For example, if there are five inputs, but only two inputs at a time, then q 1 and q2 will only be multiplied by k 1 and k2.

If the application of deep learning makes NLP have the first leap. The appearance of pre-training mode has made NLP make a second leap. Pre-training learned a powerful language model from large-scale corpus data through self-supervised learning (unlabeled), and then migrated to specific tasks through fine-tuning, and finally achieved remarkable results.

The advantages of pre-training mode are as follows:

The pre-training model has three key technologies:

Regarding the structure of the pre-training model, take Bert as an example: the input is a hot coding vector of words, multiplied by the word vector matrix, and then passed through the encoder module in the multi-layer converter, and finally the output is obtained.

This paper introduces the hot research progress in NLP field, in which the appearance of converter and pre-training model is of epoch-making significance. However, as the pre-training model becomes more and more huge, it will also touch the hardware bottleneck. In addition, NLP's representation in some tasks such as reading comprehension and text reasoning is not satisfactory. In a word, NLP still has great prospects and challenges, and it still needs our long-term efforts.

Mi Kolov, T., Suzkiefer, I., Chen, K., corrado, G.S., & Dean, J. (20 13). Distributed representation of words and phrases and their combination. Progress of Neural Information Processing System (page 311-3119).

[2] Mi Kolov, T., Chen, K., corrado, G., & Dean, J. (20 13). Efficient estimation of word representation in vector space. ArXiv preprint arxiv:1301.3781.

[3] Yue Shu a Bengio, R? Egan Ducham, Pascal Vincent and Christian Javan. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.

[4]Vaswani A, Shazeer N, Parmar N, et al. Attention is everything you need [C]// Progress of neural information processing system.2017: 5998-6008.

[5]Peters M E, Neumann M, Iyyer M, et al. Lexical representation in deep context [J].arXiv preprint arxiv:1802.05365,2018.

Radford, Narasinghan, Salimans, et al. Improving language comprehension through generative pre-training [J].20 18.

[7]Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Converter for Language Understanding [J].arXiv preprint ARXIV:1810.04805,2018.

[8]Houlsby N, Giurgiu A, Jastrzebski S, et al. NLP-oriented parameter efficient transfer learning [C]// International Conference on Machine Learning, PMLR, 20 19: 2790-2799.