Machine Learning - Clustering 聚类
Concept
- Clustering is a process to find similarity groups in data, called clusters
- Group data instances that are similar or near to each other in one cluster
- Data instances that are (very) different or far away from each other should be in differenct clusters
- Clusters are unlabelled(未标记) and no a priori grouping of the data instances are given
- Thus, ofter knowns as unsupervised(非监督性) learning
Approaches
- K-means
l a b e l i = arg min ( ∑ i = 1 n ( x i − a j ) 2 ) label_i = \arg \min {\sqrt{(\sum_{i=1}^n(x_i - a_j)^2)}} labeli=argmin(i=1∑n(xi−aj)2)
a j = 1 N ( C j ) ∑ i ∈ C j x i a_j = \frac{1}{N(C_j)}\sum_{i\in C_j}x_i aj=N(Cj)1i∈Cj∑xi
K-means algorithm assume sample set is: T = X 1 , X 2 , X 3 , . . . , X m T = X_1,X_2,X_3,...,X_m T=X1,X2,X3,...,Xm
according to Euclid distance formula [1], the algorithmic steps following: - Determine(决定) the value for K (number of clusters)
- Randomly choose initial K centroids
- Repeat:
- Assign each data point to the nearest centroid(中心点)
- Update the centroids based on data partitioning
- Until the stopping criterion is met(直到达到终止条件)

However, how to determine it is a good clustering(accoriding to K-means)? - Minimise the Sum of Squared Error (SSE) from data points to their corresponding centroids
S S E = ∑ j = 1 k ∑ x ∈ C j d i s t ( x , m j ) 2 SSE = \sum_{j=1}^k\sum{_{x\in C_j}} dist(x,m_j)^2 SSE=j=1∑k∑x∈Cjdist(x,mj)2 - C j C_j Cj denotes the j t h j^{th} jth cluster, m j m_j mj is the centroid of cluster C j C_j Cj, and d i s t ( x , m j ) dist(x,m_j) dist(x,mj) denotes the distance between data point x and its centroid.
- Hence, the stopping criteria for the iterative estimation of the centroids is often based on the change in SSE
- Very small changes in SSE indicates convergence.
- Sometimes, fixed number of iterations is used.
Example:
- step 1: random isitialisation of centroids

- step 2: assign each data to nearest centroid

-Step 3: recalculate centroids

- Repeat steps 2 and 3:

- Until converges

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans
x, y = make_blobs(n_samples=2000, n_features=2, centers=[[-1,-1], [0,0], [1,1], [2,2]],cluster_std = [0.4, 0.2, 0.2, 0.2], random_state=9)
plt.scatter(X[:, 0], X[:, 1], marker='+')
plt.show()
y_pred = KMeans(n_clusters=4, random_state=9).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()
Summary
- Generally fast (although an iterative process)
- Still one of the most popular clustering algorithms
- Fuzzified version often is more robust
- Have to know the number of clusters to start with, different K values obtain the different results
- For some case, its not easy
- Provides a local solution
- Results depends on initialisation
- Sensitive to outliers
[1] Euclide distance formula
欧几里得距离公式
Usually the Euclide distance in 2D dimension is the distance between two points.
2D dimension formula: d i s t a n c e = ( ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 ) distance = \sqrt{((x_1-x_2)^2+(y_1-y_2)^2)} distance=((x1−x2)2+(y1−y2)2)
3D dimension formula: d i s t a n c e = ( ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 ) + ( z 1 − z 2 ) 2 distance = \sqrt{((x_1-x_2)^2+(y_1-y_2)^2)+(z_1-z_2)^2} distance=((x1−x2)2+(y1−y2)2)+(z1−z2)2
So that if we follow this pattern:
ND dimension formula gonna be:
d i s t a n c e = ∑ i ( x i 1 − x i 2 ) 2 ( i = 1 , 2 , . . . , n ) distance = \sqrt{\sum_i(xi_1-xi_2)^2} (i = 1,2,...,n) distance=i∑(xi1−xi2)2(i=1,2,...,n)
[2] chaowu1993.(2018).KMeans原理与源码实现.Retrieved from https://blog.csdn.net/weixin_40479663/article/details/82974625
本文介绍了机器学习中的聚类方法,特别是K-means算法。聚类是一种寻找数据相似群体的过程,常被视为无监督学习。K-means通过迭代找到使数据点到所属中心点平方误差和最小的聚类结果。算法步骤包括初始化K个中心点,然后按最近中心点分配数据点并更新中心点,直至误差平方和变化极小或达到预设迭代次数。该方法虽快速且广泛使用,但需预先设定簇数量,结果依赖于初始中心点选择,并对异常值敏感。

2万+

被折叠的 条评论
为什么被折叠?



