Traditional Culture Encyclopedia - Traditional festivals - Search engine synonyms, near synonyms, superlatives mining

Search engine synonyms, near synonyms, superlatives mining

In search engines, we come across a large demand for synonyms. Users will describe the same thing in a variety of ways.

? In the e-commerce search environment, synonyms are divided into several categories:

?1. brand synonyms: nokia=Nokia, Adidas=Adidas

?2. product synonyms: projector ≈ projector, phone ≈cell phone;?automobile and car.

?3. old and new words: bicycle ? -> bicycle

?4. Southern and northern words: tomato -> tomato.

?5. Traditional synonyms: locker and organizer.

?6. Wrong synonyms: yoga and yoga (incorrectly written as oblique wangbian)

Corresponding to the English language, there are also stem extractions, such as singular and plural, the original form of the verb, and the form of the ing; there is also a special phenomenon in the English language, for example, two words that can be written separately, or merged together, for example, keychain and key chian (key chain).

There are many more near-synonyms: ? including size plus-size ≈ plus-size; shorts and hot pants; border and borderline.

? Superordinate word: Apple phone Superordinate word is cell phone.

Antonyms: loose and slim. When we do query rewrites, rewrites should never rewrite antonyms.

If we look carefully, we will find that some words can be replaced by each other, and some words can only be replaced in one direction (in another direction is not right, for example, Jay Chou can be replaced by Zhou Dong, but Zhou Dong can only be replaced by Zhou Dong under certain circumstances).

We can get from user search terms, commodity titles, searches and clicks. The most fundamental source is still the merchants' optimization of item titles; smart merchants will stack synonyms in their titles in the expectation of getting more traffic.

Looking at the click logs, if w1 and w2 are synonyms, then searching for w1 and searching for w2 will theoretically result in a huge number of *** same clicks on items x1, x2, x3 and so on.

? Headline commodity titles get a large corpus, e.g. projector and projector, draw bar box (draw bar box) and suitcase (luggage).

Find highly relevant words by training the relevance of the words through statistics or word2vec. Count the number of times these words **** together in the headline, i.e. the number of **** occurrences of w1 and w2.

fromgensim.test.utilsimportcommon_texts,get_tmpfile

fromgensim.modelsimportWord2Vec

model_path=". /data/word2vec_en_50d.model"

model=Word2Vec.load(model_path)

model.wv['computer']

Out[6]:

array([- 0.48867282, -0.10507897, -0.23138586, -0.10871041,? 0.1514824 ,

? -0.01487145, -0.385491? ,? 0.01792672, -0.32512784, -0.9063424 ,

? -0.5428677 ,? 0.6565156 ,? 0.02183418,? 0.07939139,? 0.03485253,

0.319492? , -0.27633888,? 0.52685845, -0.0582791 , -0.4844649 ,

0.249212?,? 0.8144138 , -0.03233343, -0.36086813,? 0.34835583,

? -0.07177112,? 0.0828275 ,? 0.6612073 ,? 0.74526566, -0.12676844,

? -0.08891173, -0.08520225, -0.04619604,? 0.13580324,? 0.183159?,

0.15528682,? 0.01727525, -0.43599448, -0.2579532 , -0.23192754,

? -0.32965428, ? 0.09547858,? 0.00419413, -0.06285212,? 0.18150753,

? -0.21699691,? 0.60977536, -0.06555454,? 0.35746607, -0.06610812],

? dtype=float32)

In[13]:

model.wv.similarity('case','cover') # case and cover are basically synonyms when describing phone cases

Out[13]:

0.8538678

In[22]:

defget_top_sim(word):

similary_words=model.wv.most_similar(word,topn=10)

forw,sinsimilary_words:

print(word,"=",w,s)

?

get_top_sim('case')

case = holder 0.8879926800727844

case = clamshell 0.887456476688385

case = tablet 0.8748524188995361

case = storage 0.8703626990318298

case = carrying 0.8672872185707092

case = hardcase 0.8580055236816406

case = carring 0.8558304309844971

case = seal 0.8552369475364685

case = cover 0.8538679480552673

case = stand 0.8476276993751526

With word2vec, we can find out the original word and the 10 most similar words, then we count the number of times that ORIGIN and SUBSTITUTE (original and alternative words) *** appear in the title, through this mining, we find a large number of candidate pairs of words. Such words can be candidates for synonyms by manual REVIEW.

Extending this slightly, we get the correspondence from synonym query to synonym query.

Statistical analysis of superordinate words, statistics of product words under each product category, the number of occurrences of top n of the product word w, corresponding to the product category word c, then w -> c is likely to be a superordinate word relationship.

In the maintenance of the word list, we must not forget the manual word list. Manual word lists must be maintained with backend tools.

1, in the commodity title corresponding to the index word to do synonym expansion, when not used regardless of which one of the synonyms to search can be searched.

2, in the QueryProcess module, the word to do synonym expansion, do the rewriting of near-synonyms, rewritten near-synonyms of the weight than the weight of the original word is smaller. In the rewriting, we will also encounter a problem, Q (split into w1, w2, w3) rewritten into q1 (w1, w2) and q2 (w2, w3), we will encounter the problem of how to calculate the relevance of q1 and q2 respectively and Q.

?3, when query to do synonym rewriting, need some words to do context (context). For example, "Zhou Dong's new song" can be modified to "Jay Chou's new song", but "Zhou Dong's company" may not be Jay Chou's company.

References:

1, Search Engine Synonym Feedback Mechanism Baidu Search R&D

2, /p-1136208118.html

3, Synonym Mining for Retrieved Information