机器学习笔记之聚类 (Machine Learning -Clustering)

本文介绍了机器学习中的聚类方法,特别是K-means算法。聚类是一种寻找数据相似群体的过程,常被视为无监督学习。K-means通过迭代找到使数据点到所属中心点平方误差和最小的聚类结果。算法步骤包括初始化K个中心点,然后按最近中心点分配数据点并更新中心点,直至误差平方和变化极小或达到预设迭代次数。该方法虽快速且广泛使用,但需预先设定簇数量,结果依赖于初始中心点选择,并对异常值敏感。

Machine Learning - Clustering 聚类

Concept

  1. Clustering is a process to find similarity groups in data, called clusters
    • Group data instances that are similar or near to each other in one cluster
    • Data instances that are (very) different or far away from each other should be in differenct clusters
    • Clusters are unlabelled(未标记) and no a priori grouping of the data instances are given
    • Thus, ofter knowns as unsupervised(非监督性) learning
Approaches
  • K-means
    l a b e l i = arg ⁡ min ⁡ ( ∑ i = 1 n ( x i − a j ) 2 ) label_i = \arg \min {\sqrt{(\sum_{i=1}^n(x_i - a_j)^2)}} labeli=argmin(i=1n(xiaj)2)
    a j = 1 N ( C j ) ∑ i ∈ C j x i a_j = \frac{1}{N(C_j)}\sum_{i\in C_j}x_i aj=N(Cj)1iCjxi
    K-means algorithm assume sample set is: T = X 1 , X 2 , X 3 , . . . , X m T = X_1,X_2,X_3,...,X_m T=X1,X2,X3,...,Xm
    according to Euclid distance formula [1], the algorithmic steps following:
  • Determine(决定) the value for K (number of clusters)
  • Randomly choose initial K centroids
  • Repeat:
    • Assign each data point to the nearest centroid(中心点)
    • Update the centroids based on data partitioning
  • Until the stopping criterion is met(直到达到终止条件)
    在这里插入图片描述
    However, how to determine it is a good clustering(accoriding to K-means)?
  • Minimise the Sum of Squared Error (SSE) from data points to their corresponding centroids
    S S E = ∑ j = 1 k ∑ x ∈ C j d i s t ( x , m j ) 2 SSE = \sum_{j=1}^k\sum{_{x\in C_j}} dist(x,m_j)^2 SSE=j=1kxCjdist(x,mj)2
  • C j C_j Cj denotes the j t h j^{th} jth cluster, m j m_j mj is the centroid of cluster C j C_j Cj, and d i s t ( x , m j ) dist(x,m_j) dist(x,mj) denotes the distance between data point x and its centroid.
  • Hence, the stopping criteria for the iterative estimation of the centroids is often based on the change in SSE
    • Very small changes in SSE indicates convergence.
    • Sometimes, fixed number of iterations is used.

Example:

  • step 1: random isitialisation of centroids
    Step1
  • step 2: assign each data to nearest centroid
    Step2
    -Step 3: recalculate centroids
    step3
  • Repeat steps 2 and 3:
    loop
  • Until converges
    final
	import numpy as np
	import matplotlib.pyplot as plt
	from sklearn.datasets.samples_generator import make_blobs
	from sklearn.cluster import KMeans
	x, y = make_blobs(n_samples=2000, n_features=2, centers=[[-1,-1], [0,0], [1,1], [2,2]],cluster_std = [0.4, 0.2, 0.2, 0.2], random_state=9)
	plt.scatter(X[:, 0], X[:, 1], marker='+')
	plt.show()
	y_pred = KMeans(n_clusters=4, random_state=9).fit_predict(X)
	plt.scatter(X[:, 0], X[:, 1], c=y_pred)
	plt.show()

Summary

  • Generally fast (although an iterative process)
  • Still one of the most popular clustering algorithms
    • Fuzzified version often is more robust
  • Have to know the number of clusters to start with, different K values obtain the different results
    • For some case, its not easy
  • Provides a local solution
    • Results depends on initialisation
  • Sensitive to outliers

[1] Euclide distance formula
欧几里得距离公式
Usually the Euclide distance in 2D dimension is the distance between two points.
2D dimension formula: d i s t a n c e = ( ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 ) distance = \sqrt{((x_1-x_2)^2+(y_1-y_2)^2)} distance=((x1x2)2+(y1y2)2)
3D dimension formula: d i s t a n c e = ( ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 ) + ( z 1 − z 2 ) 2 distance = \sqrt{((x_1-x_2)^2+(y_1-y_2)^2)+(z_1-z_2)^2} distance=((x1x2)2+(y1y2)2)+(z1z2)2
So that if we follow this pattern:
ND dimension formula gonna be:
d i s t a n c e = ∑ i ( x i 1 − x i 2 ) 2 ( i = 1 , 2 , . . . , n ) distance = \sqrt{\sum_i(xi_1-xi_2)^2} (i = 1,2,...,n) distance=i(xi1xi2)2 (i=1,2,...,n)

[2] chaowu1993.(2018).KMeans原理与源码实现.Retrieved from https://blog.csdn.net/weixin_40479663/article/details/82974625

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值