Traditional Culture Encyclopedia - Traditional customs - 3.3-User Segmentation Analysis

3.3-User Segmentation Analysis

|Introduction In product growth analysis, we want to focus on some users who meet certain conditions, not only want to know the overall behavior of these people (number of visits, visit duration, etc.), but also want to know the subgroups with large differences. The user segmentation method can help us to analyze the different groups in depth, so as to explore the reasons behind the indicator numbers and explore ways to achieve user growth.

First, the application of user subgroups

In our daily data work, we often receive such a demand: we want to pay attention to a part of the user who meets certain conditions, and not only do we want to know the overall behavior of these people (the number of times they visit, the length of the visit, etc.), but we also want to know exactly who meets these conditions. After checking the data of these people to export the user list, targeted to send tips messages. Sometimes also want to further view certain people in the use of a function on the specific operation behavior. User segmentation is a tool used to meet this type of demand, which can help us to analyze the differences between the groups in depth, so as to explore the reasons behind the indicator numbers, and explore ways to achieve user growth.

Such as user profiling subgroups, the core value lies in the refinement of the positioning of the characteristics of the crowd, to explore the potential user groups. So that websites, advertisers, enterprises and advertising agencies fully cognizant of the differentiated characteristics of the group of users, according to the differentiated characteristics of the group, to help customers find marketing opportunities, operational direction, and comprehensively improve the core influence of customers.

Second, the user group

Type 1: not divided into groups, such as the full amount of active users to put, group text messaging, etc., the disadvantage is that there is no targeting, easy to cause users to resent.

Type II: basic user information sub-groups, such as the user registration information based on sub-groups. Compared to the group, this method has a certain degree of targeting, but because of the user is not a real understanding of the results can not produce good expectations.

Type III: user image sub-groups, such as age, gender, geography, user preferences, etc., the focus of the image construction is to play for the user group "labeling", a label is usually man-made provisions for the highly refined characteristics of the logo, and finally the user sub-groups of the labeling of the synthesis, you can outline the three-dimensional user group "portrait". The final combination of user subgroup labels creates a three-dimensional "portrait" of the user group. Portrait subgroups allow us to really understand certain characteristics of users, which helps a lot in business promotion.

Type IV: subgroups based on user behavior , this stage will focus on user behavioral characteristics on the basis of image subgroups,? For example, according to the user's registration channels and active habits, to develop different marketing and promotion strategies.

Type V: clustering and predictive modeling subgroups, clustering modeling can be based on the user's comprehensive characteristics of the index, the user will be divided into different groups, such as the user will be divided into recreational type, hang type, social type, office type, etc.; predictive modeling that is, to try to guess the user's next attitude and behavior (for example, what they want to know, what they want to do). Because of this, it is very helpful in turning complex behavioral processes into marketing automation.

III. Common User Segmentation Dimensions

1. Statistical Indicators: Age, Gender, Geography

2. Payment Status: Free, Trial, Paid Users

3. Purchase History: Unpaid Users, One-time Paid Users, Multiple Paid Users

4. Access Location: The regional location where the user uses the product

5. Frequency of use: how often users use the product

6. Depth of use: light, medium, heavy users

7. Advertising clicks: users clicked on the ads vs. did not click on the ads

Fourth, the common clustering clustering methods

Above introduced some of the clustering methodology and ideas, the next focus on explaining user clustering clustering, clustering

The clustering can be divided into hierarchical clustering (clustering), clustering can be divided into hierarchical clusters, clustering can be divided into hierarchical clusters. Clustering can be divided into hierarchical clustering (merger method, decomposition method, tree diagram) and non-hierarchical clustering (division of clustering, spectral clustering, etc.), and the more commonly used Internet user clustering methods for K-means clustering method and two-step clustering method (both for the division of the clusters).

Characteristics of cluster analysis: simple, intuitive;

Mainly used in exploratory research, the results of its analysis can provide multiple possible solutions, the choice of the final solution requires the subjective judgment of the researcher and subsequent analysis;

Whether or not there are actually different categories in the actual data, the use of cluster analysis can be obtained in a number of categories of solutions;

the solutions of cluster analysis depend on the clustering variables chosen by the researcher, and the addition or deletion of a number of variables may have a substantial effect on the final solution.

The researcher should pay special attention to the various factors that may affect the results when using cluster analysis.

Outliers and special variables have a greater impact on clustering

When categorical variables are measured on inconsistent scales, prior standardization is required.

Weaknesses of cluster analysis:

Clustering is an unsupervised class analysis method that does not automatically find out how many classes should be divided into;

It is unrealistic to expect that roughly equal classes or segments can be found very clearly;

Samples are clustered, and the relationship between the variables needs to be decided by the researcher;

It does not automatically give an optimal clustering result.

The process of applying cluster analysis: ?

(1) Selection of clustering variables

In the selection of features, we will be based on certain assumptions, as far as possible, to select the variables that have an impact on the behavior of the product use, these variables generally contain the user's attitudes, opinions, and behaviors that are closely related to the product. However, the cluster analysis process also has certain requirements for the variables used for clustering:?1. the values of these variables on different research objects have significant differences; 2. there can be no high correlation between these variables.

First, the number of variables used for clustering is not the more the better; variables without significant differences do not play a substantial role in clustering and may bias the results; second, highly correlated variables are equivalent to weighting these variables, which is equivalent to amplifying the role of a certain aspect of factors on user classification. The method of identifying appropriate clustering variables: 1. do cluster analysis of the variables, from the clustered categories to select a representative variable; 2. do principal component analysis or factor analysis, resulting in new variables as clustering variables.

(2) Cluster analysis

Compared with the preparatory work before clustering, the real implementation process seems unusually simple. Once the data was ready, it was imported into a statistical tool and run, and the results came out. One of the problems encountered here is how many classes are appropriate to divide users into? Usually, you can combine several criteria to determine a combination:?1. Look at the inflection point (hierarchical clustering will come out of the aggregation coefficient map, generally choose a few categories near the inflection point); 2. Judge by experience or product characteristics (different products have different user variability); 3. Logically be able to clearly explain.

(3) Find out the important characteristics of each category of users

After determining a classification scheme, next, we need to return to observe the performance of users in each category on each variable. Based on the results of the test of variance, we color-coded the different categories of users on the level of this indicator. In the figure below, red means "well above average", yellow means "average", and blue means "well below average". And so on for the other variables. In the end, we will find the important characteristics that different categories of users have that distinguish them from other categories.

V. K-means clustering in QQ user clustering

In this case, we first look at the most commonly used K-Means clustering method (also known as fast clustering), which is one of the most commonly used non-hierarchical clustering methods. Because of its simple and intuitive calculation method and relatively fast speed (relative to hierarchical clustering), K-Means is often the first algorithm used for exploratory analysis. And, due to its widespread adoption, it also saves a lot of time costs spent on interpretation when communicating collaboratively.

1.?Algorithmic principle of K-means:

1.? Randomly take k elements as the center of each of the k clusters.

2. Calculate the similarity of the remaining elements to the centers of the k clusters, and assign each of these elements to the cluster with the highest similarity.

3. Based on the clustering results, recalculate the centers of each of the k clusters by taking the arithmetic mean of the respective dimensions of all elements in the cluster.

4. Recluster all elements according to the new centers.

5. Repeat step 4 until the clustering result does not change anymore and then the result is output.

Suppose we extract the set of original data as (X1, X2, ..., Xn) and each Xi is a d-dimensional vector, ? The purpose of K-means clustering is, given the value of the number of classification groups k (k ≤ n), to classify the original data into k classes, S = {S1, S2, ..., Sk}, in terms of the numerical model, i.e., to minimize the value of (μi? denotes the average value of the classification Si?) for the following expression:

2. QQ User Segmentation Context and Objectives:

QQ daily login users more than 500 million, covering a variety of social groups (different ages, different industries, different interests, etc.), the need for a certain segmentation of the broader user, and then targeted operational activities.

3. Clustering variable selection: ? User profile characteristics, user state characteristics, user activity characteristics

4. Clustering analysis and results: through the correlation analysis and variable importance analysis, remove some of the poor variables, and then the remaining 11 variables for multiple training (the number of target clusters, the variables involved in the group, the tolerance of individual differences within the group), and ultimately arrive at the results of the clustering

5. Interpretation and naming of the results < /p>

Cluster 1 characteristics: unknown or low age, few friends, very low activity and stickiness of use Low-end low-age group

Cluster 2 characteristics: young age, front-end online and messaging activity are relatively high Student active group

Cluster 3 characteristics: average age of about 27 years old, PC and cell phone activeness is very high High-stickiness group in the workplace

Cluster 4 Characteristics: average age of about 28 years old, the front office online and message activity are very low Workplace low viscous group

Cluster 5 characteristics: higher age, high cell phone online hours, but very little message communication Senior low active group

Six, the effect of two-step clustering and k-means clustering comparison

K-Means clustering method mentioned earlier has simple, intuitive and fast The advantages are. However, its disadvantages are that it can only use numerical variables, can not include category variables, and is very sensitive to outliers, outliers can easily and seriously affect the clustering results. Moreover, K-Means cannot be run on a single machine when the data set is large (which is very common in Tencent) and all data points cannot be loaded into memory. The two-step clustering method overcomes these drawbacks, can include both categorical and numerical variables, and can run smoothly when hardware conditions are insufficient or the dataset is very large. This two-step clustering method can be regarded as a combination of the improved BIRCH clustering algorithm and hierarchical clustering method, first using the BIRCH algorithm in the "clustering feature tree" to do preclustering, the formation of subclasses, and then use the subclasses as the input to do hierarchical clustering.

1. The principle of two-step clustering:

Step 1: Pre-clustering process:

Construct a clustering feature tree (CFT), divided into many subclasses.

At the beginning, a certain observation is placed at the root node of the tree, which records the variable information of the observation, and then according to the specified distance measure as the basis of similarity, so that each subsequent observation is placed in the most similar node according to its similarity with the existing nodes, and if a certain similarity is not found, a new node is formed for it. In this step, outliers will be recognized and eliminated without affecting the results as easily as in K-Means.

Step 2: Formal clustering:

The preclustering done in the first step is used as an input, which is then clustered again using hierarchical clustering (with the log-likelihood function as a measure of distance). At each stage, the Schwarz Bayesian Information Criterion (BIC) is utilized to evaluate the suitability of the existing classification for the available data,

and at the end a classification scheme is given that meets the criterion.

The number of classifications can be determined automatically or specified manually based on business needs;

3.? Comparison of the effect of two-step clustering:

Two-step clustering of the same data in the sixth point, the optimal results of the model are as follows

6. Interpretation of the results of the two-step clustering

Characteristics of Cluster 1: Unknown age or underage, few friends, very low activity and stickiness of the use of the low low-end, underage groups

Characteristics of Cluster 2: Age of the younger generation, the foreground of the online and active messaging is relatively high students or new job seekers. Cluster 3: Average age of 24, low online and message activity, low youth activity

Cluster 4: Average age of 25, high online but low activity, low youth activity

Cluster 5: Average age of 28, low cell phone use but high PC activity, high office activity

Cluster 6: Average age of 28, low cell phone use but high PC activity, high office activity< /p>

Cluster 6 characteristics: higher age, high cell phone online hours, but very little message communication senior low active group

VII. Business case? Mining special behavioral patterns of Mobile QQ customer groups through K-Means clustering

1. Business needs

In this case, the product manager wants to understand the behavioral patterns of logged-in inactive Mobile QQ users, and to be able to segment the huge user group for different combinations of behaviors, so as to pay attention to the different needs of different groups, and even tap into the vertical domain needs, so as to take measures on the product or operation side.

2. Analysis Objectives

1. Discover segments of users whose behavioral patterns are different from those of typical users in the general market

2. Roughly estimate the number of users in each segment

3. Understand the behavioral characteristics and user profiles of each segment

4. Based on the above results, make product or operation suggestions or define the direction for further exploration in terms of pulling in users

3. Analyzing Process

a) ? Feature extraction

The analysis focuses on users' clicking behaviors in Mobile QQ, such as deleting messages, viewing friends' profile pages, and clicking friends' dynamic buttons, etc. So we start from the users' clicked reported information, and sum up the number of clicks for each user. In this example, considering the typicality of user behavior, 4 complete weeks, ***28 days of data are selected, and there are no holidays in the time window. In addition, considering the computational performance and the iterative nature of the exploratory analysis, only one in a thousand users are randomly selected as representatives from the general QQ market.

b) Feature filtering

In the feature extraction phase, nearly 200 clicks were extracted. However, some of these features have very low coverage, with only 1 in 100 users having used them in 28 days, and these low-coverage features are the first to be removed.

In addition, as mentioned earlier, highly correlated variables can also interfere with the clustering process. Here, Pearson's correlation coefficient is calculated for all features in pairs, and for highly correlated features (with correlation coefficients greater than 0.5), only the ones with the widest coverage are retained, in order to maximize the differences between users.

c) ? Feature transformation-exploration

After the above two steps, I have carried out a number of clustering exploration, but without exception, the clustering results are presented a super large class with dozens of very small small class (a few or a dozen users). Such a result is obviously contrary to the goal of our analysis. For one, the small groups mined here are too small in size to be of value from a business perspective; and for another, the super-large category is basically equivalent to a large set of users, without being able to identify the differences in the users therein.

Why is there such a result? Mainly because the click behavior basically follows the power distribution, a large number of users are concentrated in the low-frequency interval, while a very small number of users will have a very high frequency, so that in a typical clustering algorithm, the high-frequency users will be aggregated into a very small number of small classes, and a large number of low-frequency word users will be aggregated into a super-large class.

For this situation, the typical solution is to take the logarithm of the frequency, so that the power distribution is transformed into an approximate normal distribution and then clustered, in this study, take the natural logarithm, the clustering effect is only a small amount of improvement, but still stays in the case of a super-large class plus a number of small classes with a very small number of people. The reason behind this is one of the characteristics of the click behavior data: the core features and popular items, such as the chat box and friends' buttons, have a large number of clicks, while the relatively less popular features have a large number of 0 values. In this case, taking logarithms is not an improvement.

? Going back to the goal of this analysis, we need to "identify segments of users whose behavioral patterns are different from those of typical users in the broader market", and if we discard these cold features and only look at the popular options, we will not be able to identify some relatively niche behavioral patterns to achieve the goal of the analysis. This kind of numerical sparseness reminds me of text categorization. In the bag-of-words model of text categorization, the word vector of each "document" also has a large number of 0 values, and the solution of the bag-of-words model is to weight the word vectors with the TF-IDF method.

d) Feature transformation - TF-IDF

In the bag-of-words model for text categorization, a document (e.g., a news article, a microblog, a comment) needs to be aggregated according to the topic of discussion, and a document has many words (Term) in it. TF (Term-Frequency) is the ratio of the number of occurrences of a word in a document to the total number of words in the entire document, so that a simple calculation can tell what words are more frequent in a document, without being affected by the length of the document itself.

On the other hand, there are some words that are "popular" words that are used in all articles, and which are not very helpful in distinguishing the subject matter of the article (e.g., "report", "reporter", etc., in the news). These words are not helpful in distinguishing the subject matter of the article (e.g. "report", "reporter", etc. in news). For such a "popular" words, we need to reduce his weight, so you can achieve the purpose through the (total number of documents/number of documents containing a word) such a calculation, each article has a word weight will be taken as 0, including the fewer the number of documents, the greater the value. This calculation is IDF (Inverse Document Frequency?).

According to the discussion above, the reader may have thought that if you change the concept of "document" to "user", and replace the "number of occurrences of a word" with the "number of clicks on a feature", you will get the same result. If we change the concept of "document" to "user", and replace "occurrences of words" with "clicks on features", we can categorize the types of user behaviors. First of all, the function preference of low-frequency users will be reflected by the TF calculation, and they will not be generalized into a low-frequency user class in comparison with high-frequency users because of their overall low usage. At the same time, IDF also allows some niche features to have more weight, making it easier to highlight niche preferences in clustering.

e) ? Clustering results

Through such feature transformation, and then clustering with the K-Means algorithm, the results are more in line with the analysis goals. From the big data, we found the middle school students who often delete messages on QQ, the lonely men who avidly swipe the nearby people but seldom talk to each other, the young people in big cities who treat QQ as a news client, and the silent followers who seldom talk to each other but often visit the friends' pages. The size, behavioral and background characteristics of each group are estimated. Based on this data, we are exploring ways to improve the product.

Summary

The biggest change in user segmentation for user data research is breaking down data silos and getting to know users. By analyzing the characteristics of the users behind the numbers of a certain indicator (their demographic attributes, behavioral characteristics, etc.), we can discover the reasons behind the problems of the product, and find opportunities or directions for effective product improvement.

In cluster analysis, the selection and preparation of features is very important: 1.? Suitable variables need to be significantly different in each sample and so on; 2. There can not be a strong correlation between the variables, otherwise you need to use PCA and other methods to reduce the dimensionality first; 3. You need to transform the data according to the characteristics of the data itself and the characteristics of the business (such as standardization, taking logarithms, etc.)

And the selection of clustering algorithms need to be combined with the characteristics of the data (whether there is a variable, outliers, the amount of data, whether the data is clusters), as well as the speed of computation (exploratory analysis often requires faster computation), accuracy (whether the clusters can be accurately identified) and other aspects of the selection of appropriate algorithms. The parameters of the algorithm, such as the number of categories K in K-Means, need to be combined with technical specifications and business background to select a logically sound classification scheme.