Machine Learning 07 - Unsupervised Learning

最新推荐文章于 2026-06-22 20:57:52 发布

原创最新推荐文章于 2026-06-22 20:57:52 发布 · 313 阅读

1 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#机器学习 #人工智能 #standFord

机器学习专栏收录该内容

7 篇文章

订阅专栏

本文介绍了吴恩达机器学习课程中的聚类算法K-means及其注意事项，包括初始化技巧和选择聚类数量的方法。此外，还详细阐述了主成分分析(PCA)用于数据降维的过程，并给出了如何选择保留主成分数量的建议。

正在学习Stanford吴恩达的机器学习课程，常做笔记，以便复习巩固。
鄙人才疏学浅，如有错漏与想法，还请多包涵，指点迷津。

7.1 Clustering

7.1.1 K-means algorithm

Intuition

K-means algorithm has two steps :

Cluster assignment
Move centroid step

The algorithm illustrations is show in the picture below :

steps

Symbols

$c^{(i)}$ : index of cluster ( $1,2,\dots K$ ) to which example $x^{(i)}$ is currently assigned
$\mu_{k}$ : cluster centroid $k \; (\mu_{k} \in \mathbb{R}^{n})$
$\mu_{c^{(i)}}$ : cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Optimization objective

min c (1), \dots, c (m), μ 1, \dots, μ K J (c (1), \dots, c (m), μ 1, \dots, μ K) = 1 m \sum i = 1 m ∥ ∥ x (i) - μ c (i) ∥ ∥ 2

$\underset{c^{(1)},\cdots ,c^{(m)}, \mu _{1},\cdots ,\mu _{K}}{\text{min}} \; J(c^{(1)},\cdots ,c^{(m)},\mu _{1},\cdots ,\mu _{K})=\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}-\mu _{c^{(i)}} \right \|^{2}$

K-means algorithm - Algorithm 4

Randomly initialize $K$ cluster centroids $\mu_{1}, \mu_{2}, \dots, \mu_{K} \in \mathbb{R}^{n}$
Repeat{
$\quad$ for $i=1$ to $m$
$\qquad$ $c^{(i)}:=$ index (from $1$ to $K$ ) of cluster centroid closest to $x^{(i)}$
$\quad$ for $k$ 1 to $K$
$\qquad$ $\mu _{k}:=$ average (mean) of points assigned to cluster $k$
$\quad$ }

7.1.2 Important tricks

We randomly choose the $K$ cluster centroids, and different case result in different optimal solution, which may cause local optimal.

For example :

local optimal

Random Initialization

For $i=1$ to $100$ {
$\quad$ Randomly initialize K-means.
$\quad$ Run K-means. Get $c^{(1)},\cdots ,c^{(m)},\mu _{1},\cdots ,\mu _{K}$ .
$\quad$ Compute cost function (distortion)
$\qquad$ $J(c^{(1)},\cdots ,c^{(m)},\mu _{1},\cdots ,\mu _{K})$
$\quad$ }
Pick clustering that gave lowest cost $J(c^{(1)},\cdots ,c^{(m)},\mu _{1},\cdots ,\mu _{K})$ .

For $k=2 \; to \;10$ , random initialization behave well, when $k$ is large, it is easy to get a good solution at a time.

Number of Clusters

Choosing the number of clusters is a matter of option. It is often based on experience.

One way to try (but not always effective) is Elbow method, draw the $J-K$ figure, and choose K.

Sometimes, K-means is used for some later/downstream purpose. Evaluate K-means based on metric for how well it performs for that later purpose.

7.2 Dimensionality Reduction

7.2.1 Intuition

The intuition from 2D to 1D and from 3D to 2D is showed below :

Application : Data Compress, Data Visualization …

7.2.2 Principal Component Analysis

Reduce from $n$ -dimension to $k$ -dimension, what the PCA do is :

Find $k$ vectors $u^{(1)},u^{(2)},\cdots u^{(k)}\in \mathbb{R}^{n}$ onto which to project the data so as to minimaze the projection error.

Principal Component Analysis - Algorithm 5

Preprocessing “feature scaling” / “mean normalization” (ensure zero mean)
Calculate the covariance matrix :
$\quad \Sigma = \frac{1}{m}\sum_{i=1}^{m}(x^{(i)})(x^{(i)})^{T}$ (mark Sigma = $\Sigma$ )
Do the single value decomposition :
$\quad$ [U, S, V] = svd(Sigma);
$\quad$ Ureduce = U(:, 1 : k);
$\quad$ z = Ureduce’ * x;

Reconstruction from Compressed Representation

$x (i) = U reduce z (i), i = 1, 2, \dots, m$ $x^{(i)} = U_{\text{reduce}}z^{(i)}, \; i=1,2,\cdots, m$
7.2.3 Choose the $k$

Here the $k$ (dimension of $z$ ) is also call number of principal components.

Typically, choose $k$ to be smallest value so that
$1 m \sum m i = 1 ∥ ∥ x ( i ) - x ( i ) a p p r o x ∥ ∥ 2 1 m \sum m i = 1 ∥ ∥ x ( i ) ∥ ∥ 2 \leq 0.01$ $\frac{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)}-x_{approx}^{(i)} \right \|^{2}}{\frac{1}{m}\sum_{i=1}^{m}\left \| x^{(i)} \right \|^{2}}\leq 0.01$
The number $0.01$ indicates that $99\%$ of variance is retained.

An easier way to calculate is showed below :

Choose the $k$ - Algorithm 6

[U, S, V] = svd(Sigma)
Pick smallest value of $k$ for which

$\sum k i = 1 S i i \sum m i = 1 S i i \geq 0.99$ $\frac{\sum_{i=1}^{k}S_{ii}}{\sum_{i=1}^{m}S_{ii}}\geq 0.99$
7.2.4 Advice for applying PCA

Supervisied learning speedup

Given a dataset : $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\cdots (x^{(m)},y^{(m)})$ , $\; x^{(i)} \in {R}^{n}$

Extract inputs and get unlabeled dataset.
Apply PCA algorithm.
Get new training set.

Finally, we get new training set : $(z^{(1)},y^{(1)}),(z^{(2)},y^{(2)}),\cdots (z^{(m)},y^{(m)})$ , $\; z^{(i)} \in {R}^{k}$

Note :

Mapping $x^{(i)}\rightarrow z^{(i)}$ should be defined by running PCA only the training set.
This mapping can be applied to cross validation and test sets.

Bad use of PCA : To prevent overfitting

That is : use $z^{(i)}$ instead of $x^{(i)}$ to reduce the number of features to $k<n$

Reason : PCA will throw away some valuable information.

Consider machine learning without PCA first

Before implementing PCA, first try running whatever to get with the raw/original data. Only if that doesn’t do idealy, them implement PCA.