Traditional Culture Encyclopedia - Traditional festivals - Simple data enhancement method in NLP

Simple data enhancement method in NLP

When training a machine learning or deep learning model, good data is often one of the most important factors that affect the effect of the model. Data enhancement is a common method when data is insufficient.

The method of data enhancement can be used as a problem to quickly improve data imbalance or data missing when we train nlp model.

1, increase the amount of training data, improve the generalization ability of the model

2, increase noise data, and improve the robustness of the model

There are roughly two ideas for Data Augmentation of NLP,

(1) Synonyms replacement: randomly extract n words from a sentence without considering stopwords, and then randomly extract synonyms from a synonym dictionary.

Eg: "I like this movie very much"-> "I like this movie very much", the sentence still has the same meaning, and probably has the same label.

(2) RI: Randomly Insert: a word is randomly selected without considering stopwords, and then one of the synonyms of the word is randomly selected and inserted into a random position in the original sentence. This process can be repeated n times.

Eg: "I like this movie very much"-> Love me and like this film very much.

(3) random swap (RS: Randomly Swap): in a sentence, two words are randomly selected and the positions are switched. This process can be repeated n times.

Eg: "How to evaluate the 217 Zhihu Kanshan Cup Machine Learning Competition?" —> "217 machine learning? How to compete in Zhihu's evaluation of Kanshan Cup ".

(4) randomly delete (rd): every word in the sentence is randomly deleted with probability p.

Eg: "How to evaluate the 217 Zhihu Kanshan Cup Machine Learning Competition? " —> "How to watch the mountain cup machine learning in 217".

In the back-up method, we use machine translation to translate a paragraph of Chinese into another language and then back to Chinese.

Eg: "Jay Chou is a powerful singer in Chinese music, and his albums have been sold all over the world.

" —> “Jay Chou is a strength singer in the Chinese music scene, his albums are sold all over the world.

”—> "Jay Chou is an excellent singer in China's music industry, and his albums sell well all over the world."

This method has been successfully used in the Kaggle malicious comments classification competition. Reverse translation is a data enhancement method often used by NLP in machine translation. Its essence is to produce some translation results quickly to increase data.

Back-translation can often increase the diversity of text data. Compared with substitution words, sometimes it can change the syntactic structure and retain semantic information. However, the data generated by back translation method depends on the quality of translation, and most of the translation results may not be so accurate.

The method of adopting deep learning model is mainly to generate data similar to the original data.

(1) The semantic information of the added data should be consistent with the original data.

(2) The increased data need to be diversified.

copied from the original: