Traditional Culture Encyclopedia - Traditional virtues - NLP Basics and Overview

NLP Basics and Overview

A popular natural language processing library, with its own corpus, classification, segmentation, and many other functions, the majority of foreign users, similar to the Chinese jieba processing library

The model that assigns probabilities to sequences of words is called a language model.

In layman's terms, a language model is a model that, for any sequence of words, calculates the probability that the sequence is a sentence. Or a language model can predict what the next word in a sequence of words will be.

** n-gram Language Models **

An N-gram model is a typical statistical Language Model (LM), which is a probability-based discriminative model. A statistical language model treats a language (a sequence of words) as a random event and assigns a corresponding probability to describe the likelihood of it belonging to a certain set of languages. Given a vocabulary set V, for a sequence S = ?w1, - - - , wT ? ∈ Vn , the statistical language model assigns to this sequence a probability P(S) to measure the confidence that S conforms to the syntactic and semantic rules of the natural language. In a simple sentence, a statistical language model is such a model that calculates the probability magnitude of a sentence.

The n-gram model mitigates the problem caused by word sequences that have not appeared in the training set, i.e., the problem of data sparsity

The problem with the n-gram model

The problem with the n-gram model is well described in these two pages of the ppt

The n-gram model is based on the assumption that the current occurrence of a word is the same as the current occurrence. The assumption that the occurrence of the current word is only relevant to the previous N-1 words and not to any other word, and that the probability of the whole sentence is the product of the probabilities of the occurrence of the individual words. These probabilities can be obtained by counting the number of simultaneous occurrences of N words directly from the corpus. Commonly used are the binary Bi-Gram (N=2) and the ternary Tri-Gram (N=3).The assumption satisfied by the Bi-Gram is the Markov assumption.

The commonly used N-Gram models are Bi-Gram and Tri-Gram. the formulas for each are as follows:

Bi-Gram: P(T)=p(w1|begin) p(w2|w1) p(w3|w2)***p(wn|wn-1)

Tri-Gram: P(T)=p(w1|begin) p(w1|w1) p(w3|w2)***p(wn|wn-1)

Tri-Gram: P(T)=p (w1|begin1,begin2) p(w2|w1,begin1) p(w3|w2w1)***p(wn|wn-1,wn-2)

Note the calculation of the probability above: p(w1|begin)=all sentences starting with w1/total number of sentences; p(w2|w1)=number of times that w1,w2 occur at once / number of times w1 occurs. And so on.

For an example of the computation of each of these terms:

From the above, we can see that begin in the Bi-Gram formula is generally labeled with an <s>.

Problems with N-grams:

To give a small number of examples to aid in illustration: Suppose we have a corpus (note the corpus) as follows:

Rats are nasty, rats are ugly, you love your wife, I hate rats.

We want to predict the next word in the sentence "I love old". We do this with bigram and trigram.

2) By trigram, we have to calculate P(w|love old), only "love wife" appeared 1 time, by maximum likelihood estimation, we can get P(mother-in-law|love old) = 1, so we predicted the whole sentence by trigram as: I love wife. Obviously, this is a much more reasonable prediction.

Problem 1: As n goes up, we have more prior information and can predict the next word more accurately. But this also brings a problem, when N is too large it is easy to have a situation where certain n-grams never occur, leading to many predicted probability results of 0. This is the sparsity problem. In practice, only bigram or trigram are used. (This problem can be mitigated by smoothing: /s/NvwB9H71JUivFyL_Or_ENA)

Problem 2: Also due to the last sparsity problem the N-gram is not able to obtain long time dependencies on the context.

Problem 3: The n-gram is based on frequency for statistics and does not have enough generalization power.

n-gram Summary: A statistical language model is a way of calculating the size of the probability value of a sentence, where the probability of the whole sentence is the product of the probabilities of the occurrence of the individual words, and the larger the probability value, the more sensible the sentence is. n-grams are a typical example of a N-gram is a typical statistical language model, which makes an assumption that the occurrence of the current word is only related to the previous N-1 words and not to any other words, and the probability of the whole sentence is the product of the probabilities of the occurrence of each word. There are many problems in it, when finding the probability of each word occurrence again, with the improvement of N, it can have more amount of antecedent information, which can make the prediction of the current word more accurate, but when N is too large there will be a sparse problem, which leads to the probability value of many words to be 0, in order to solve this problem, so it is commonly used for bigram or trigram, which leads to N-gram can not be obtained above. long time dependency. On the other hand, N-gram is only based on frequency statistics, and does not have enough generalization ability.

Neural network language model

In 2003, Bengio proposed that the idea of neural network language model (NNLM) is to put forward the concept of word vectors, instead of using discrete variables (high dimensionality) in N-gram, continuous variables (with a certain dimension) are used. The idea of NNLM is to propose the concept of word vector instead of ngram using discrete variables (high-dimensional), and to use continuous variables (real vectors with certain dimensions) for the distributed representation of words, which solves the problem of dimensionality explosion, and at the same time, the similarity between words can be obtained through the word vector.

Combined with the figure below it can be seen that the task of the language model it builds is to predict the next word based on the previous text within the window size, so in another way it is an n-gram model encoded using a neural network.

It is one of the simplest neural networks, consisting of only four layers, an input layer, an embedding layer, a hidden layer, and an output layer. (From another point of view it is an n-gram model encoded using neural network)

The input is the sequence of index of a sequence of words, e.g. the word 'this' has an index of 10 in the dictionary (of size ∣V∣), the word 'is ' has an index of 23, and 'test' has an index of 65, then the sentence 'this is test' is predicted by 'this is test', and 'test' is predicted by 'this is test'. The index sequence of the above words within the window size is 10, 23, 65. Embedding is a matrix of size ∣V∣×K (note: the size of K is set by ourselves, this matrix is equivalent to the randomly initialized word vectors, which will be updated in the bp, and the neural network will be trained. After the training is completed, this part is the word vector), from which the 10th, 23rd, and 65th row vectors are taken out and spliced into a 3×K matrix which is the output of the Embedding layer. The hidden layer accepts the output of the spliced Embedding layer as input, and uses tanh as the activation function, and finally sends it to the output layer with softmax to output the probability, and the optimization goal is to make the word to be predicted with the largest corresponding softmax value.

Disadvantages: Because this is a feed-forward neural network to train the language model, the disadvantage is obviously that there are too many parameters in it, and the softmax part of the computation is also too large. On the other hand, NNLM intuitively looks like an n-gram model encoded using a neural network, and cannot solve the problem of long-term dependency.

RNNLM

It is a network of RNNs and their variants to train a language model, with the task of predicting the next word from the above, and its advantage over NNLM is that it uses RNNs, which have a natural advantage in processing sequential data, and the RNN network breaks the constraints of the contextual window and uses the state of the hidden layer to contrast all the contextual information of history with the state of the hidden layer. The RNN network breaks the limitation of the context window and uses the state of the hidden layer to summarize all the contextual information of the history, which can capture longer dependencies than the NNLM, and achieves better results in the experiments.The RNNLM has fewer hyperparameters, which makes it more versatile; however, the gradient dispersion problem of the RNN makes it very difficult to capture dependencies over a longer distance.

CBOW in Word2vec and skip-gram, where CBOW predicts the center word from the context within the window size, and skip-gram does the opposite, predicting the context within the window size from the input center word.

Glove belongs to statistical language models, where word vectors are trained by statistical knowledge

ELMO trains a language model by using a multilayer bi-directional LSTM (generally two layers are used), where the task is to predict the current word by using the context, where the information from above is obtained by the forward LSTM, and the information from below is obtained by the reverse LSTM, and this bi-directionality is a weak bi-directionality, so that the information obtained is not really contextual.

GPT trains a language model with Transformer, which is unidirectional and predicts the next word from the previous one

BERT trains a truly bidirectional language model like MLM with Transformer, which predicts the current word based on the context.

The details of the above are covered in NLP Pre-training

Metrics for judging language models

For more information, see: position, SVD) method to find a lower-order approximation of the matrix.

Probability Latent Semantic Analysis (PLSA) model

The Probability Latent Semantic Analysis (PLSA) model was actually proposed to overcome some of the shortcomings of the Latent Semantic Analysis (LSA) model. can think of each column of U k and V k as a topic, but since the values in each column can be thought of as almost unbounded real values, we cannot go any further in explaining what these values actually mean, and even less in understanding the model from a probabilistic point of view.
The PLSA model, on the other hand, gives a probabilistic interpretation to LSA through a generative model. The model assumes that every document contains a set of possible potential topics, and that every word in the document is not generated out of thin air, but is generated with a certain probability guided by these potential topics.

Inside the PLSA model, a topic is actually a probability distribution over words, and each topic represents a different probability distribution over words, while each document can be seen as a probability distribution over topics. Each document is generated by such a two-layer probability distribution, which is the core idea of the generative model proposed by PLSA.

PLSA models the joint distribution of d and w by the following equation:

The number of *z * in this model is a hyperparameter that needs to be given in advance. Note that there are two ways of expressing P (w, d ) given in the above equation. In the first equation, both *d * and w are generated by conditional probability given *z *, which is similar and therefore ''symmetric''; in the second equation, both *d * and w are generated by conditional probability given *z *, which is similar and therefore ''symmetric''; in the second equation, both *d * and w are generated by conditional probability given *z *, which is similar and therefore ''symmetric''. ''; in the latter equation, d is given first, then the possible topic z is generated according to P ( z | d ), and then the possible word w is generated according to P (w| z ), and since the generation of words and documents is not similar in this equation, it is ''asymmetric''. ''

The figure above gives the Plate Notation representation of the asymmetric form in the PLSA model. Where d denotes a document, z denotes a topic generated from the document, and w denotes a word generated from the topic. In this model, d and w are already observed variables, while z is an unknown variable (representing a potential topic).

It is easy to see that there is no way to know what the corresponding P ( d ) is for a new document, so even though the PLSA model is a generative model for a given document, it is not capable of generating new unknown documents. Another problem with the model is that the parameter of P ( z | d ) increases linearly as the number of documents increases, which leads to overfitting no matter how much training data is available. These two points become two major drawbacks that limit the PLSA model from being used more widely.

Latent Dirichlet Analysis (LDA) model

In order to solve the overfitting problem in the PLSA model, the Latent Dirichlet Assignment (LDA) model was proposed by Blei et al. This model has become the most widely used model in the research field of subject modeling. LDA is a Bayesian framework based on PLSA, i.e., LDA is a Bayesian version of PLSA (because LDA is Bayesianized, it needs to take into account the historical prior knowledge and the two prior parameters are added).

As we can see from the previous section, in the model PLSA, for an unknown new document d, we know nothing about P ( d ), and this is actually not consistent with human experience. Or rather, it doesn't use the information that could have been used, which is the so-called a priori information in LDA.

Specifically, in LDA, each document is first considered to be more or less relevant to each of a finite number of given topics, and this relevance is modeled by a probability distribution over the topics, which is consistent with PLSA.

But in an LDA model, each document's probability distribution on a topic is given a prior, which is typically represented by a sparse form of the Dirichlet distribution. This sparse Dirichlet prior can be thought of as encoding the human prior knowledge that, in general, the topics of an article are more likely to be focused on a small number of topics, and rarely on many topics at the same time within a single article with no apparent focus.

In addition, the LDA model also assigns a sparse form of the Dirichlet prior to the probability distribution of a topic over all words, which is similarly intuitively interpreted: in most cases, a small number of words (that are highly relevant to the topic) will occur very often in a single topic, while other words will occur significantly less often. These two a priori allow the LDA model to portray the document-topic-word relationship better than PLSA.

In fact, from the results of PLSA, it is actually equivalent to transforming the prior distribution in the LDA model to a uniform distribution, and then maximizing the posterior estimates of the required parameters (which is equivalent to maximizing the likelihood estimates of the parameters, given that the prior is uniform), which is a reflection of the fact that a more reasonable prior is very important for modeling.

Disambiguation is the process of recombining sequences of words into sequences of words according to certain specifications.
Existing disambiguation algorithms can be categorized into three main groups: disambiguation methods based on string matching, disambiguation methods based on comprehension, and disambiguation methods based on statistics.

According to whether it is combined with the lexical annotation process, it can be divided into simple participle methods and integrated methods that combine participle and annotation.

Chinese participles are mainly divided into the following 2 categories according to the realization principle and characteristics:

(1) Dictionary-based participle algorithm

Also known as string matching participle algorithm. The algorithm is in accordance with a certain strategy to match the string to be matched with a well established "sufficiently large" dictionary of words, if you find a certain word, it means that the matching is successful, recognized the word. There are several common dictionary-based word-splitting algorithms: Forward Maximum Matching, Reverse Maximum Matching, and Bidirectional Matching Splitting.
Dictionary-based participle algorithms are the most widely used and the fastest participle. For a long time researchers have been optimizing the string-based matching method, such as the maximum length setting, string storage and lookup method, and for the organizational structure of the word list, such as the use of TRIE index tree, hash index and so on.

(2) statistics-based machine learning algorithms

This type of algorithm is currently commonly used algorithms are HMM, CRF (Conditional Random Field), SVM, deep learning algorithms, such as stanford, Hanlp participle tool is based on the CRF algorithm. CRF, for example, the basic idea is to label the training of Chinese characters, not only considering the frequency of occurrence of words, but also consider the context, with better learning ability, so its recognition of ambiguous words and unregistered words have good results.

Common lexicons use a combination of machine learning algorithms and lexicons, which on the one hand improves lexical accuracy and on the other hand improves domain adaptation.

With the rise of deep learning, there are also neural network-based disambiguators, for example, there are attempts to use bi-directional LSTM+CRF to realize disambiguators, which are essentially sequence annotations, and so are generalizable, and can be used for named entity recognition and so on. The accuracy of the character of the lexicon can be as high as 97.5%. The idea of the algorithmic framework is similar to the paper "Neural Architectures for Named Entity Recognition", which can be utilized to achieve Chinese word segmentation, as shown in the following figure:

First of all, character embedding is performed on the corpus, and the obtained features are input to the bi-directional LSTM, and then a CRF is added to get the labeled results.

At present, there are three main difficulties in Chinese word segmentation:

1, word segmentation standard: For example, for a person's name, the last name and the first name are separate in the standard of HIT, but they are combined in Hanlp. This requires different word separation criteria based on different needs.

2. Ambiguity : There are multiple disambiguation results for the same string to be disambiguated.

Ambiguity is divided into three types: combinatorial ambiguity, intersection ambiguity, and true ambiguity.

Generally in search engines, different disambiguation algorithms are used for indexing and querying. A common scheme is to use fine-grained disambiguation for indexing to ensure recall, and coarse-grained disambiguation for querying to ensure precision.

3. Neologisms: also known as words not included in the dictionary, the solution to this problem relies on a better understanding of participle technology and the structure of the Chinese language.

A typical text categorization process can be divided into three steps:

1. Text Representation
The purpose of this process is to represent the text into a form that the classifier can handle. The most common approach is the vector space model, where the text set is represented as a word-document matrix, where each element of the matrix represents the weight of a word in the corresponding document. The process of selecting which words to represent a text is called feature selection. Common feature selection methods are document frequency, information gain, mutual information, expected cross entropy, etc. In order to reduce the amount of computation in the classification process, it is often necessary to perform dimensionality reduction, such as LSI.
2. Classifier Construction
The purpose of this step is to select or design a method for constructing a classifier. Different methods have their own advantages and disadvantages and conditions of applicability, and it is important to choose a classifier based on the characteristics of the problem. We will specialize in the commonly used methods later. After selecting a method, construct a classifier for each category on the training set, and then apply the classifier to the test set to get the classification results.
3. Classifier Evaluation
After the classification process is complete, the classification results need to be evaluated. The evaluation process is applied to the text categorization results on the test set (not the training set), and the common evaluation criteria are inherited from the IR domain, including the detection rate, detection rate, F1 value, and so on.

1. Rocchio method
Each class determines a centroid, calculates the distance between the document to be classified and the representative element of each class, and uses it as a criterion for determining whether it belongs to the class or not. the Rocchio method is characterized by its ease of implementation and high efficiency. The disadvantage is that it is affected by the distribution of the text set, for example, the calculated center point may fall outside the corresponding category.

2. na?ve bayes method
Applying probabilistic models to automatic document categorization is a simple and effective classification method. Using Bayesian formulas, the a priori probability and the conditional probability of the category to estimate the posterior probability of the document for a particular category, so as to achieve this document belongs to the category of judgment.

3. K-Nearest Neighbors (KNN) method
From the training set to find the nearest k neighbors (documents) with the document to be classified, according to the category of the k neighbors to determine the category of the document to be classified. kNN method has the advantage of not needing to feature selection and training. One of the disadvantages is the high space complexity.The classifier obtained by the KNN method is a nonlinear classifier.

4. Support Vector Machine (SVM) method
For a certain category, find a categorization surface such that the positive and negative examples of the category fall on both sides of the surface and the surface satisfies the following criteria: equal distances to the nearest positive and negative examples, and the largest distance to the positive (or negative) example among all the classification surfaces. The advantage of the SVM method is that it uses a very small training set and is computationally small; the disadvantage is that it depends too much on the location of the positive and negative examples near the classification surface, and has a large degree of paranoia.

The text clustering process can be divided into 3 steps:
1. Text Representation
Represent the document into a form that can be processed by the clustering algorithm. See the section on text categorization for the techniques used.
2. Clustering Algorithms
The choice of algorithms is often accompanied by the choice of methods for calculating similarity. In text mining, the most commonly used similarity calculation method is cosine similarity. There are many kinds of clustering algorithms, but there is no general algorithm that can solve all clustering problems. Therefore, the characteristics of the problem to be solved need to be carefully studied in order to select the appropriate algorithm. An introduction to various text clustering algorithms will follow.
3. Clustering Evaluation (Clustering Evaluation)
Select the collection of documents that have been manually classified or labeled as a test collection, and after the clustering is finished, compare the clustering results with the existing manual classification results. The common evaluation indexes are also checking the full rate, checking the accuracy rate and F1 value.

1. Hierarchical clustering methods
Hierarchical clustering can be divided into two types: cohesive (agglomerative) hierarchical clustering and division (divisive) hierarchical clustering. Coalescent methods treat each text as an initial cluster, and after a continuous merging process, it finally becomes a cluster. Divisive methods have the opposite process. Hierarchical clustering can get hierarchical clustering results, but the computational complexity is relatively high and can not handle a large number of documents.

2. Division methods
The k-means algorithm is the most common division method. Given the number of clusters k, k texts are selected as k initial clusters respectively, other texts are added to the nearest clusters, and the centroids of the clusters are updated, and then the texts are re-divided according to the new centroids; the algorithm stops when the clusters are no longer changing or after a certain number of iterations. k-means algorithm is low in complexity and is easy to implement, but it is sensitive to exceptions and noisy texts. Another problem is that there is no good way to determine the value of k.

3. Density-based methods
In order to discover clustering results with arbitrary shapes, density-based methods have been proposed. Such methods view clusters as high-density regions in the data space separated by low-density regions. Some of the common density-based methods are DBSCAN, OPTICS, DENCLUE, and so on.

4. Neural network methods
Neural network methods describe each cluster as a specimen, which serves as a "prototype" for clustering that does not necessarily correspond to a particular data, and according to some distance metric, new objects are assigned to the clusters that are most similar to it. Well-known neural network clustering algorithms include competitive learing and self-organizing map [Kohonen, 1990]. Neural network clustering methods require long processing time and complex data complexity, so they are not suitable for clustering large data.

Previous article:What are some ways of parenting?

Next article:What are the Eight Famous Songs of Cantonese Opera?

Related articles

Performance time of Dong nationality songs in Vienna

Who invented the Russian commissar dance

Beijing Museum of Ancient Architecture tickets opening hours

Introduction of Kun Opera

Relay Details

Lao Fengxiang Jin inherited the style of the bracelet.

The structure of a novel can be categorized into ......?

The simplest homemade bait for catching crucian carp

Chinese New Year recipes 10 dishes

What are the advantages and disadvantages between machine vision and human vision?