Traditional Culture Encyclopedia - Traditional stories - The difference between classification and clustering and their common algorithms

The difference between classification and clustering and their common algorithms

1. The difference between classification and clustering:

Classification (classification), for a classifier, you usually need to tell it "this thing is divided into such and such a category" and so on. , Ideally, a classifier will "learn" from the training set it obtains, so as to have the ability to classify unknown data. This process of providing training data is usually called supervised learning (supervised learning),

< p>Clustering, simply put, is to group similar things into a group. When clustering, we don't care what a certain category is. The goal we need to achieve is to cluster similar things together. Therefore, a clustering algorithm usually only needs to know how to calculate similarities to start working, so clustering usually does not need to use training data for learning, which is called unsupervised learning in Machine Learning.

2. Common classification and clustering algorithms

The so-called classification, simply put, is to divide the text into existing categories based on the characteristics or attributes of the text. For example, in natural language processing NLP, the text classification we often mention is a classification problem, and general pattern classification methods can be used for text classification research. Commonly used classification algorithms include: decision tree classification method, naive Bayesian classifier, classifier based on support vector machine (SVM), neural network method, k-nearest neighbor method (k-nearestneighbor, kNN) , fuzzy classification methods, etc.

Classification, as a supervised learning method, requires that the information of each category must be known clearly in advance, and it is asserted that all items to be classified have a category corresponding to it. However, many times the above conditions are not met, especially when processing massive amounts of data. If the data meets the requirements of the classification algorithm through preprocessing, the cost will be very high. At this time, the clustering algorithm can be considered.

K-means (K-mensclustering) clustering is the most typical clustering algorithm (of course, in addition, there are many such as the K-MEDOIDS algorithm, CLARANS algorithm; BIRCH algorithm, CURE algorithm, CHAMELEON algorithm, etc. which are hierarchical methods; density-based methods: DBSCAN algorithm, OPTICS algorithm, DENCLUE algorithm, etc.; grid-based methods: STING algorithm, CLIQUE algorithm, WAVE-CLUSTER algorithm; based on model method).