机器学习笔记之聚类 (Machine Learning -Clustering)

最新推荐文章于 2026-02-11 01:52:56 发布

原创最新推荐文章于 2026-02-11 01:52:56 发布 · 904 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#机器学习 #聚类算法 #机器学习笔记 #K-means

MachineLearningNote 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了机器学习中的聚类方法，特别是K-means算法。聚类是一种寻找数据相似群体的过程，常被视为无监督学习。K-means通过迭代找到使数据点到所属中心点平方误差和最小的聚类结果。算法步骤包括初始化K个中心点，然后按最近中心点分配数据点并更新中心点，直至误差平方和变化极小或达到预设迭代次数。该方法虽快速且广泛使用，但需预先设定簇数量，结果依赖于初始中心点选择，并对异常值敏感。

Machine Learning - Clustering 聚类

Concept

Clustering is a process to find similarity groups in data, called clusters
- Group data instances that are similar or near to each other in one cluster
- Data instances that are (very) different or far away from each other should be in differenct clusters
- Clusters are unlabelled（未标记） and no a priori grouping of the data instances are given
- Thus, ofter knowns as unsupervised（非监督性） learning

Approaches

K-means
$label_i = \arg \min {\sqrt{(\sum_{i=1}^n(x_i - a_j)^2)}}$
$a_j = \frac{1}{N(C_j)}\sum_{i\in C_j}x_i$
K-means algorithm assume sample set is: $T = X_1,X_2,X_3,...,X_m$
according to Euclid distance formula [1], the algorithmic steps following：
Determine(决定) the value for K (number of clusters)
Randomly choose initial K centroids
Repeat:
- Assign each data point to the nearest centroid(中心点)
- Update the centroids based on data partitioning
Until the stopping criterion is met(直到达到终止条件)

However, how to determine it is a good clustering(accoriding to K-means)?
Minimise the Sum of Squared Error (SSE) from data points to their corresponding centroids
$\sum_{j=1}^k\sum{_{x\in C_j}} dist(x,m_j)^2$
$C_j$ denotes the $j^{th}$ cluster, $m_j$ is the centroid of cluster $C_j$ , and $dist(x,m_j)$ denotes the distance between data point x and its centroid.
Hence, the stopping criteria for the iterative estimation of the centroids is often based on the change in SSE
- Very small changes in SSE indicates convergence.
- Sometimes, fixed number of iterations is used.

Example:

step 1: random isitialisation of centroids
step 2: assign each data to nearest centroid

-Step 3: recalculate centroids
Repeat steps 2 and 3:
Until converges

	import numpy as np
	import matplotlib.pyplot as plt
	from sklearn.datasets.samples_generator import make_blobs
	from sklearn.cluster import KMeans
	x, y = make_blobs(n_samples=2000, n_features=2, centers=[[-1,-1], [0,0], [1,1], [2,2]],cluster_std = [0.4, 0.2, 0.2, 0.2], random_state=9)
	plt.scatter(X[:, 0], X[:, 1], marker='+')
	plt.show()
	y_pred = KMeans(n_clusters=4, random_state=9).fit_predict(X)
	plt.scatter(X[:, 0], X[:, 1], c=y_pred)
	plt.show()

Summary

Generally fast (although an iterative process)
Still one of the most popular clustering algorithms
- Fuzzified version often is more robust
Have to know the number of clusters to start with, different K values obtain the different results
- For some case, its not easy
Provides a local solution
- Results depends on initialisation
Sensitive to outliers

[1] Euclide distance formula
欧几里得距离公式
Usually the Euclide distance in 2D dimension is the distance between two points.
2D dimension formula: $\sqrt{((x_1-x_2)^2+(y_1-y_2)^2)}$
3D dimension formula: $\sqrt{((x_1-x_2)^2+(y_1-y_2)^2)+(z_1-z_2)^2}$
So that if we follow this pattern:
ND dimension formula gonna be:
$\sqrt{\sum_i(xi_1-xi_2)^2} (i = 1,2,...,n)$