Traditional Culture Encyclopedia - Traditional festivals - 2020 Recommender System Technology Evolution Trends Understanding

2020 Recommender System Technology Evolution Trends Understanding

Read the Zhihu article "Recommender System Technology Evolution Trends: From Recall to Sorting to Rearrangement" Notes:

Recommender System Technology Evolution Trends: From Recall to Sorting to RearrangementThis article mainly says that in the last two years, some of the more obvious technology development trends of the recommender system technology.

Overall architecture of recommender systems

Recall technology evolution

Sorting model technology evolution

Rearrangement technology evolution

Macro architecture of recommender systems:

Subdivision of the four phases:

1, traditional: multi-recall (each recall is equivalent to a single feature sorting results

1, traditional: multi-recall).

2, the future: model recall (the introduction of multi-feature, single feature sorting to expand the single feature sorting into a multi-feature sorting model)

(1) model recall

According to the user items Embedding, using efficient Embedding retrieval tools such as Faiss and other, to quickly find out the items that match with the user's interests, so that the This is equivalent to making a recall model that utilizes multi-feature fusion.

Theoretically, any supervised model you can see can be used to do this recall model, such as FM / FFM / DNN, etc., often referred to as the so-called "two towers" model, refers to the user-side and item-side features are separated from the structure of the Embedding, not a specific model.

The so-called "Twin Towers" model is often referred to as the user-side and item-side features are separated to play Embedding structure, not a specific model.

It is worth noting that: if the model recall is used in the recall phase, theoretically it should be synchronized with the same optimization objective as the sorting model, especially if the sorting phase uses multi-objective optimization, the recall model should also correspond to the same multi-objective optimization. Similarly, if the whole process contains the rough scheduling module, the rough scheduling should also adopt the same multi-objective optimization as the fine scheduling, and the optimization objectives of several links should be consistent. Because the recall and coarse row is the front part of the fine row, otherwise, if the optimization goal is not consistent, it is likely to appear high-quality fine row target, in the front part of the possible filter out, affecting the overall effect.

(2) user behavior sequence recall

The core lies in: this item aggregation function Fun how to define the problem. One thing to note here is: the items in the user behavior sequence, there is a temporal order. Theoretically, any model that can reflect the characteristics of the temporal order or feature localization correlation is more suitable for application here, typically such as CNN, RNN, Transformer, GRU (RNN variant model), etc., are more suitable for integrating user behavior sequence information.

In the recall phase, how to hit the embedding according to the user behavior sequence, you can take a supervised model, such as Next Item Prediction prediction can be; you can also use an unsupervised approach, such as items as long as you can hit the embedding, you can unsupervised integration of user behavior sequence content, such as Sum Pooling.

(3) User Multi-Interest Splitting (using the user's behavioral item sequences, the practice of playing the user's interest Embedding)

(4) Knowledge Graph Fusion Recall

According to the user's interest entities, through the knowledge graph entity Embedding expression (or directly in the knowledge graph nodes on the outreach), the user's interest is expressed through the knowledge graph. Through knowledge outreach or can be based on Embedding similarity, expand the related entities.

(5) Graph Neural Network Model Recall

The ultimate goal of graph neural networks is to obtain the embedding encoding of the nodes in the graph by certain technical means. The most commonly used embedding aggregation tool is CNN, for a certain graph node, its input can have two types of information, one is its own attribute information, such as the example of microblogging cited above; the other is the graph structure information, that is, and the current node has a direct edge associated with other node information. Through CNN, the two types of information can be encoded and aggregated to form the embedding of the graph node. Through the CNN and other information aggregators, the computation is performed on the graph node, and iteratively updating the embedding of the graph node, it is possible to ultimately obtain reliable embedding information of the graph node, and this iterative process, in fact, embodies the node will be a long-distance information progressively The iterative process, in fact, embodies the long-distance nodes will information gradually through the graph structure to pass the process of information, so the graph structure can be knowledge transfer and supplementation.

We can further think about it, the graph node because it can be with attribute information, such as the item's Content information, so it is obvious that this is helpful for solving the item side of the cold-start problem; and because it also allows knowledge to be transmitted over long distances in the graph, for example, for the scenario where the user's behavior is relatively small, it can be formed to transfer and supplement the knowledge, which indicates that it is also more suitable for the data sparse recommendation scenarios. Data sparse recommendation scenarios; on the other hand, the edge of the graph is often constructed through user behavior, and user behavior, at the statistical level, is essentially a synergistic information, for example, we often say that "A item synergistic B items", which essentially means that many users behavior of item A, the probability that they will go to the item B behavior. So the diagram has a good advantage: it is easier to collaborative information, user behavior information, content attribute information and other heterogeneous information in a unified framework for integration, and unified characterization for the form of embedding, which is its unique advantage, it is more natural to do. Another unique advantage is the propagation of information in the graph, so it should be particularly useful for the recommended cold start and data sparse scenarios.

Early graph neural networks to do recommendation, because the need for global information, so the speed of computation is a problem, and often the graph size are very small, do not have the value of combat. GraphSAGE, on the other hand, through some means such as sampling from neighboring nodes to reduce the size of the computation, accelerate the speed of computation, and many of the later methods to improve the efficiency of the calculation are derived from this work; and PinSage on the basis of GraphSAGE (which is the same set of people to do), to further take a large-scale distributed computation, to expand the utility of the graph computation, can calculate the Pinterest's giant graph of 3 billion scale nodes and 18 billion scale edges, and produced better landing results. So these two work can focus on learning from it.

Overall, graph model recall, is a promising direction worth exploring.

The model optimization goal reflects what we want the recommender system to do well, often associated with the business objectives, here we mainly explore from a technical perspective, and multi-objective optimization as well as ListWise optimal is currently the most common direction of technological evolution, ListWise optimization goal in the sorting stage and the rearrangement stage can be used, we put it into the rearrangement part to talk about. Here we mainly introduce multi-objective optimization;

Model expressiveness represents whether the model has the ability to make full use of effective features and feature combinations, in which displaying feature combinations, new types of feature extractors, application of enhancement learning techniques, and automatic exploration of the model structure by AutoML are the obvious technological evolution directions in this regard;

From the perspective of features and information, how to adopt richer

From the perspective of features and information, how to adopt richer new types of features, as well as the expansion and fusion of information and features are the main directions of technological evolution, and the separation of users' long- and short-term interests, the use of user behavioral sequence data, graph neural networks, and multimodal fusion are the main technological trends in this regard.

1.1 Model Optimization Objectives - Multi-Objective Optimization

Multi-objective optimization of recommender systems (optimizing for clicks, interactions, duration, etc.) is not just a trend, but is the current state of the art in R&D for many companies. For recommender systems, different optimization goals may be pulling each other back, and multi-objective optimization aims to balance the impact of different goals, and if multi-objective optimization works well, it will have a great impact on the business results. In short, multi-objective optimization is a technical direction that deserves the attention of the R&D staff associated with recommender systems.

From a technical point of view, there are two key issues in multi-objective optimization. The first problem is the model structure of multiple optimization objectives; the second problem is how to define the importance of different optimization objectives (how to find the optimal super parameter).

2.1 Model Expression Capability-Explicit Feature Combination

If you summarize the history of CTR model evolution in industry, you will find that feature engineering and automation of feature combination have been the most important direction to drive the evolution of practical recommender system technology, and there is no one. The earliest LR model, basically manual feature engineering and manual feature combination, simple and effective but time-consuming and laborious; and then developed to LR + GBDT high-order feature combination automation, as well as the FM model of the second-order feature combination automation; and then after that is the introduction of the DNN model, the pure and simple DNN model is essentially in the FM model of the feature Embedding based on adding a few layers of MLP hidden layer to the FM model, and then adding a few layers of MLP hidden layer to the FM model. Add a few layers of MLP hidden layer to carry out implicit feature nonlinear automatic combination only.

2.2 Model Expression Capability - Evolution of Feature Extractors

From the perspective of feature extractors, the most commonly used feature extractors for the current mainstream DNN sequencing models are still MLP structures, CNNs in the image domain, and RNNs and Transformers in the NLP domain.

The MLP structure is usually a two- or three-layer MLP hidden layers. It has also been shown in theoretical studies that MLP structures are inefficient for capturing feature combinations.

CNN is a very effective structure for capturing local feature associations, but it is not very suitable for pure feature input recommendation models, because there is no inevitable sequential relationship between the features of the recommendation domain in the order of the inputs, the weakness of CNN's ability to capture long-distance feature relationships is poor, and the disadvantage of RNN's inability to process in parallel, so it is slow, etc.

Transformformer is an MLP structure, which is usually two layers or two layers of MLP hidden layers.

Transformer, as the newest and most effective feature extractor in the NLP field, is actually very well suited for making recommendations in terms of its working mechanism. Why? The core lies in Transformer's Multi-Head Self Attention mechanism.MHA structure in NLP, will make a judgment on the degree of relevance of any two words in the input sentence, and if this relationship is applied to the field of recommendation, it is through the MHA to feature combination of any feature, and the above said, feature combination for the recommendation is an As mentioned above, feature combination is a very important link for recommendation, so from this perspective, Transformer is particularly suitable for modeling feature combination, a layer of Transformer Block represents the second-order combination of features, and more Transformer Blocks represent higher-order feature combinations. However, in practice, if Transformer is applied to make recommendation, its application effect does not reflect obvious advantages, or even does not reflect any advantages, basically slightly better than or similar to the effect of the typical MLP structure. This means that we may need to recommend the domain characteristics, Transformer needs to be targeted transformation, rather than a direct copy of the structure in the NLP.

A screenshot of Mr. Zhang's other diagrams about Transformer is enough to illustrate the significance of Transformer, but it is not very understandable right now, haha~

2.3 Application of AutoML in Recommendation

AutoML started to appear at the beginning of 17 years, and has boomed in the last three years, and there have been very important research progresses in, for example, the image field, the NLP field, and so on. There have been very important research advances in all of these areas, and in all of these areas, it is now possible to find model structures that work better than human-designed ones through AutoML.

2.4 Augmented Learning for Recommendations

Augmented Learning is actually better suited for modeling recommendation scenarios. In general, augmented learning has several key elements: state, behavior, and reward. In the recommendation scenario, we can define the state St as the user's behavioral history of the collection of items; the optional behavioral space of the recommender system is based on the user's current state St recommended to the user's list of recommended results, it can be seen here that, under the recommendation scenario, the user's behavioral space is immense, which restricts the application of many augmented learning methods that can not be modeled on a huge behavioral space; and the return, it is the user's The return is the value of the user's behavior in interacting with the content of the list given by the recommender system, e.g., you can define that the return is 1 if you clicked on a certain item, and 5 if you purchased a certain item. .... and so on. With these elements of the scenario defined, recommendations can be modeled using typical augmented learning.

3.1?Multi-modal information fusion

Multi-modal fusion, from the technical means, is essentially the different modal types of information, such as Embedding encoding, mapping to a unified semantic space, so that the different modalities of the information, the information that expresses the same semantics can be completely analogous. For example, the natural language word "apple", and a picture of an apple, should be through certain technical means, the two information encoding, such as playing the embedding, the similarity is very high, which means that the different modalities of knowledge mapped to the same semantic space. In this way, you can go through the text of the apple, for example, search for photos that contain apples.

3.2?Long-term interest/short-term interest separation

For recommender systems, it is very important to accurately describe user interests. Currently there are two main types of commonly used ways to describe user interests. One is to characterize user interests in terms of user side features, which is also the most common; the other is to characterize user interests in terms of sequences of items on which the user has behaved.

Regarding List Wise reordering, it can be said from two perspectives, one is the optimization objective or loss function; the other is the model structure of the recommendation module.

Learning to Rank in the recommender system to do sorting, we know that there are three common optimization goals: Point Wise, Pair Wise and List Wise, so we should first make it clear that: List Wise it does not refer to a specific one or a certain type of model, but refers to the model's optimization goal or loss function. Theoretically, all kinds of unused models can use List Wise loss for model training. The simplest loss function definition is Point Wise, which is to input user features and individual item features, score this item, and sort between items, that is, who should be in front of who, do not have to consider. Obviously this way of both training and online reasoning, are very simple and direct efficiency, but its disadvantage is that it does not take into account the direct association of the items, which is actually useful in the sorting.Pair Wise Loss in the training of the model, the direct use of two items in the order of the relationship between the training model, that is to say, the optimization objective is that the item A sorting to be higher than the item B, similar to this kind of optimization objective. In fact, Pair Wise's Loss has been widely used in the recommendation field, such as BPR loss, is a typical and very effective Pair Wise's Loss Function, often used, especially in implicit feedback, is a very effective optimization objective. list Wise's Loss pays more attention to the whole list of the order of the items in the relationship, from the list of the overall order of the items in the perspective of consideration, the list of the items in the list of the order of the items in the overall perspective. List Wise's Loss is more concerned with the order of items in the list, and optimizes the model from the perspective of the order of items in the list as a whole. In recommendation, List Wise loss function because of the production of training data is difficult, slow training speed, slow online reasoning and many other reasons, although the use is still relatively small, but because more focus on the overall optimality of the sorting results, so it is also a lot of recommender systems are currently doing things.

From the model structure. Because the reordering module is often placed after the fine-sorting module, and the fine-sorting has already done a more accurate scoring of the recommended items, so often the input to the reordering module is the Top score output result of the fine-sorting module, that is, it is ordered. And the scoring or ordering of the fine ranking module is very important reference information for the reordering module. Thus, the order of the output of this sorting module is more important, and the model that can take into account the sequentiality of the input is naturally the first choice of the reordering model. As we know, the most common models that take into account the temporal sequentiality are RNN and Transformer, so it is natural that these two types of models are often used in the reordering module. The general practice is: sort Top results of the items ordered, as the input to the RNN or Transformer, the RNN or Transformer can obviously consider at the feature level, fusion of the current item context, that is, the other items in the sorted list, features, to assess the effect from the list as a whole. each input to the RNN or Transformer corresponding to the Each input to the RNN or Transformer is fused with features, and the predicted scores are output again, and the items are reordered according to the new predicted scores, which accomplishes the purpose of fusing the contextual information and reordering the items.

References:

1, recommender system technology evolution trend: from recall to sorting to rearrangement

/p/100019681

2, the model recall typical work:

FM model recall: recommender system recall four models of: all-powerful FM model

DNN Twin Towers Recall: Sampling-Bias- Corrected Neural Modeling for Large Corpus Item Recommendations

3. User Behavior Sequence Recall Typical Work:

GRU: Recurrent Neural Networks with Top-k Gains for Session-based Recommendations

CNN: Personalized Top-N Sequential Recommendation via Convolutional Sequence Embedding

Transformer: Self-Attentive Sequential Recommendation

4. Knowledge Graph Fusion Recall Typical Work:

KGAT: Knowledge Graph Attention Network for Recommendation

RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems

5. Typical work of graph neural network model recall:

GraphSAGE: Inductive Representation Learning on Large Graphs

PinSage: Graph Convolutional Neural Networks for Web-Scale Recommender Systems

6. Model Multi-Objective Optimization Typical Work:

MMOE: Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

Pareto Optimality: A Pareto-Efficient Algorithm for Multiple Objective Optimization in E-Commerce Recommendation

7. Typical Work on Explicit Feature Combination:

Deep& Cross: Deep & Cross Network for Ad Click Predictions

XDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems

8, Feature Extractor Typical Work:

AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks

DeepFM: An End-to-End Wide & Deep Learning Framework for CTR Prediction

9, Comparison of CNN\RNN\Feature Extractor: /p/54743941

10, AutoML in the recommendation of the application of the typical work:

ENAS structural search: AutoML in the recommendation of sorting network structure search Applications

Bilinear Feature Combination: FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction

11, Enhanced Learning in Recommendations Typical work:

Youtube: Top-K Off-Policy Correction for a REINFORCE Recommender System

Youtube: Reinforcement Learning for Slate-based Recommender Systems: A Tractable Decomposition and Practical Methodology

12. Typical work on multimodal fusion:

DNN Recall: Collaborative Multi-modal deep learning for the personalized product retrieval in Facebook Marketplace

Sort: Image Matters: Visually modeling user behaviors using Advanced Model Server

13. Typical work on long and short-term interest separation:

1. Neural News Recommendation with Long- and Short-term User Representations

2. Sequence-Aware Recommendation with Long-Term and Short-Term Attention Memory Networks

14. List Wise Re-ranking Typical Work:

1. Personalized Re-ranking for Recommendation

2. Learning a Deep Listwise Context Model for Ranking Refinement

Previous article:Tsui Hark's most iconic classic, why the "Wong Fei Hung" series?
Next article:How to make corn flour tortilla?