全面解读机器学习之k-近邻算法

最新推荐文章于 2026-04-25 02:41:50 发布

原创最新推荐文章于 2026-04-25 02:41:50 发布 · 821 阅读

2 ·

本内容遵循CC 4.0 BY-SA版权协议

机器学习实战专栏收录该内容

2 篇文章

订阅专栏

本文全面解析k-近邻(k-NN)算法，包括算法原理、优缺点和适用场景。通过实例介绍了如何使用Python实现k-NN算法，并应用于约会网站的配对效果改进和手写数字识别系统，探讨了算法的效率问题和优化方案。

一、k-近邻算法概述

k-近邻算法采用测量不同特征值之间的距离方法进行分类。优点： 精度高、对异常值不敏感、无数据输入假定。缺点： 计算复杂度高、空间复杂度高。适用数据范围： 数值型和标称型。

工作原理：存在一个样本数据集合，也称作训练样本集，并且样本集中每个数据都存在标签，即我们知道样本集中每一数据与所属分类的对应关系。输入没有标签的新数据后，将新数据的每个特征与样本集中数据对应的特征进行比较，然后算法提取样本集中特征最相似数据（最近邻）的分类标签。一般来说，我们只选择样本数据集中前k个最相似的数据，这就是k-近邻算法中k的出处，通常k是不大于20的整数。最后，选择k个最相似数据中出现次数最多的分类，作为新数据的分类。

k-近邻算法的一般流程：

收集数据：可以使用任何方法。
准备数据：距离计算所需要的数值，最好是结构化的数据格式。
分析数据：可以使用任何方法。
训练算法：此步骤不适用于k-近邻算法。
测试算法：计算错误率。
使用算法：首先需要输入样本数据和结构化的输出结果，然后运行k-近邻算法判定输入数据分别属于哪个分类，最后应用对计算出的分类执行后续的处理。

1. 准备：使用Python导入数据

创建名为kNN.py的Python模块：

from numpy import * # 导入科学计算模块
import operator # 导入运算符模块

def createDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

'''
>>> group
array([[ 1. ,  1.1],
       [ 1. ,  1. ],
       [ 0. ,  0. ],
       [ 0. ,  0.1]])
>>> labels
['A', 'A', 'B', 'B']
>>>
'''

2. 实施kNN算法

该函数的功能是使用k-近邻算法将每组数据划分到某个类中，其伪代码如下，对未知类别属性的数据集中的每个点依次执行以下操作：

计算已知类别数据集中的点与当前点之间的距离；
按照距离递增次序排序；
选取与当前点距离最小的k个点；
确定前k个点所在类别的出现频率；
返回前k个点出现频率最高的类别作为当前点的预测分类。

# 四个参数是：用于分类的输入向量； 输入的训练样本集； 标签向量； k表示用于选择的最近邻的数目
def classify0(inX, dataSet, labeles, k):
    # shape[0]读取矩阵的第一维度长度， shape[1]读取二维数组每个元素中的格个数
    dataSetSize = dataSet.shape[0] # 读取dataSet样本集的个数
    # tile(inX, (dataSetSize, 1))将向量inX扩大 dataSetSize * 1 倍
    # 目的是与数据集中的每个样本相减求欧氏距离
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2 # 求矩阵中每个元素的平方
    sqDistances = sqDiffMat.sum(axis = 1) # 求平方和
    distances = sqDistances ** 0.5 # 欧式距离计算所有点的距离
    # argsort 返回distances按照从小到大排序的下标
    sortedDistIndicies = distances.argsort()
    # 定义字典 相当于C++中的map
    classCount = {}
    for i in range(k): # 确定前k个主要分类
        voteIlabel = labeles[sortedDistIndicies[i]] # 返回第i个标签值
        # 统计k个样本中的voteIlabel的个数 本题目 A : 1, B : 2
        # get()函数 取key相对应的value， 不存在就赋值为0
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    # items()方法把字典中每对key和value组成一个元组，并把这些元组放在列表中返回。
    # 将classCount转化为元组，按照逆序 从大到小排序
    # key = operator.itemgetter(1) 以第二维的数据为准进行排序
    sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)
    # 返回k个样本中标签分类中最多的分类标签
    return sortedClassCount[0][0]

# 举个例子
print(classify0([1, 0], group, labels, 3)) # B

对上述代码做实验运行结果：

diffMat = np.tile([1,0], (dataSetSize, 1)) - dataSet
'''
array([[ 0. , -1.1],
       [ 0. , -1. ],
       [ 1. ,  0. ],
       [ 1. , -0.1]])
'''
sqDiffMat = diffMat ** 2
'''
array([[ 0.  ,  1.21],
       [ 0.  ,  1.  ],
       [ 1.  ,  0.  ],
       [ 1.  ,  0.01]])
'''
sqDistances = sqDiffMat.sum(axis = 1)
'''
 array([ 1.21,  1.  ,  1.  ,  1.01])
'''
distances = sqDistances ** 0.5
'''
array([ 1.1       ,  1.        ,  1.        ,  1.00498756])
'''
sortedDistIndicies = distances.argsort()
'''
array([1, 2, 3, 0], dtype=int64)
'''
b = sorted(a.items(), key = operator.itemgetter(1), reverse = True)
'''
[('B', 2), ('A', 1)]
'''

二、使用k-近邻算法改进约会网站的配对效果

海伦希望我们的分类软件可以更好地帮助她将匹配对象划分到确切的分类中。此外海伦还收集了一些约会网站未曾记录的数据信息，她认为这些数据更有助于匹配对象的归类。

存在三种人：

不喜欢的人
魅力一般的人
极具魅力的人

算法流程：

收集数据：提供文本文件。
准备数据：使用Python解析文本文件。
分析数据：使用Matplotlib画二维扩散图。
训练算法：此步骤不适用于k-近邻算法。
测试算法：使用海伦提供的部分数据作为测试样本。测试样本和非测试样本的区别在于：测试样本是已经完成分类的数据，如果预测分类与实际类别不同，则标记为一个错误。
使用算法：产生简单的命令行程序，然后海伦可以输入一些特征数据以判断对方是否为自己喜欢的类型。

1. 准备数据：从文本文件中解析数据

海伦的样本主要包含以下3种特征：

每年获得的飞行常客里程数
玩视频游戏所耗时间百分比
每周消费的冰淇淋公升数

收集的数据存放在datingTestSet2.txt中。在kNN.py中创建名为file2matrix的函数，将其转化为我们想要的格式。

def file2Matrix(filename):
    fr = open(filename)  # 打开文件
    arrayOLines = fr.readlines() #readlines 读取文件所有行 返回列表
    numberOfLines = len(arrayOLines) # 读取行的长度
    returnMat = zeros((numberOfLines, 3)) # 形成 numberOfLines * 3 行的0矩阵
    classLabelVector = [] # 定义列表
    index = 0
    for line in arrayOLines: # 循环处理文件中的每一行
        line = line.strip() # strip() # 用于移除字符串头尾指定的字符（默认为空格或换行符）
        listFromLine = line.split('\t') # 以'\t'将上述一行 转换为 一个单元列表
        # [index, : ] 加逗号 代表仅仅是index这一行 不加代表从index一直往后
        returnMat[index, : ] = listFromLine[0 : 3] # 选取列表中前三个值，存到特征矩阵中
        classLabelVector.append(int(listFromLine[-1])) # 将列表中的最后一个元素存到向量中
        index += 1
    return returnMat, classLabelVector

datingDataMat, datingLabels = file2Matrix('datingTestSet2.txt')
print(datingDataMat)
print(datingLabels)
'''
[[  4.09200000e+04   8.32697600e+00   9.53952000e-01]
 [  1.44880000e+04   7.15346900e+00   1.67390400e+00]
 [  2.60520000e+04   1.44187100e+00   8.05124000e-01]
 ...,
 [  2.65750000e+04   1.06501020e+01   8.66627000e-01]
 [  4.81110000e+04   9.13452800e+00   7.28045000e-01]
 [  4.37570000e+04   7.88260100e+00   1.33244600e+00]]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3,
 2, 1, 2, 3, 2, 3, 2, 3, 2, 1, 3, 1, 3, 1, 2, 1, 1, 2, 3, 3,等等 
'''

现在，我们需要了解数据的真实含义。一般采取图形化的方式直观地展示数据。

2. 分析数据：使用Matplotlib创建散点图

首先使用Matplotlib制作原始数据的散点图，输入如下命令：

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[ : ,1], datingDataMat[ : ,2])
plt.show()

如下运行截图：
在这里插入图片描述

散点图使用datingDataMat矩阵的第二、第三列数据，分别表示特征值“玩视频游戏所耗时间百分比”和“每周所消费的冰淇淋公升数”。

重新输入代码，可以得到彩色标记的图：

ax.scatter(datingDataMat[ : ,1], datingDataMat[ : ,2], 15.0 * array(datingLabels), 15.0 * array(datingLabels))

运行截图：
在这里插入图片描述

ax.scatter(datingDataMat[ : ,0], datingDataMat[ : ,1], 15.0 * array(datingLabels), 15.0 * array(datingLabels))

运行截图：
在这里插入图片描述

3. 准备数据：归一化数值

在处理不同取值范围的特征值时，通常采用的方法是将数值归一化，如将取值范围处理为0到1或者-1到1之间。使用公式 $n e w V a l u e = (o l d V a l u e - m i n) / (m a x - m i n)$ 即可。在kNN.py中增加autoNorm()函数，可以自动将数字特征值转化为0到1的区间。

#基于公式：newValue = (oldValue - min) / (max - min)
def autoNorm(dataSet):
    # 取出dataSet每列中的最小值组成向量， dataSet.min(1)表示取出每行中的最小值组成向量
    minvals = dataSet.min(0)
    maxVals = dataSet.max(0)# 取出dataSet每列中的最大值组成向量
    ranges = maxVals - minvals
    normDataSet = zeros(shape(dataSet)) # 定义一个和dataSet一般大的零矩阵
    m = dataSet.shape[0]# m等于数据集的大小
    normDataSet = dataSet - tile(minvals, (m, 1)) # oldValue - min
    normDataSet = normDataSet / tile(ranges, (m, 1)) # (oldValue - min) / (max - min)
    return normDataSet, ranges, minvals

normMat, ranges, minVals = autoNorm(datingDataMat)
print('归一化后的数据是:\n', normMat)
print('列最大值减最小值的差是:\n', ranges)
print('列最小值的差是：\n', minVals)

'''
归一化后的数据是:
 [[ 0.44832535  0.39805139  0.56233353]
 [ 0.15873259  0.34195467  0.98724416]
 [ 0.28542943  0.06892523  0.47449629]
 ...,
 [ 0.29115949  0.50910294  0.51079493]
 [ 0.52711097  0.43665451  0.4290048 ]
 [ 0.47940793  0.3768091   0.78571804]]
列最大值减最小值的差是:
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00]
列最小值的差是：
 [ 0.        0.        0.001156]
'''

4. 测试算法：作为完整程序验证分类器

定义一个函数datingClassTest()来检测分类器的分类错误率。

def datingClassTest():
    hoRatio = 0.10
    # 将文件中的数据存储在datingDataMat矩阵，标签存储在datingLabels矩阵
    datingDataMat, datingLabels = file2Matrix('datingTestSet2.txt')
    # 数据归一化
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0] # 样本大小
    numTestVecs = int(m * hoRatio) # 用作测试集
    errorCount = 0 # 统计误差次数
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs : m, :],
                                     datingLabels[numTestVecs : m], 3)
        print("the classifier came back with: %d, the real answer is: %d"
              % (classifierResult, datingLabels[i]))
        if(classifierResult != datingLabels[i]):
            errorCount += 1.0
    print('the total error rate is: ', errorCount / float(numTestVecs))

print(datingClassTest())
'''
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
......
......
the total error rate is:  0.05
'''

上述可以看到kNN的错误率是0.05。

5. 使用算法：构建完整可用系统

现在使用上述分类器为海伦对人们的特征进行分类。

def classifyPerson():
    resultList = ['不喜欢的人', '魅力一般的人', '极具魅力的人 ']
    percentTats = float(input(" 玩视频游戏所耗时间百分比?"))
    ffMiles = float(input("每年获得的飞行常客里程数?"))
    iceCream = float(input("每周消费的冰淇淋公升数?"))
    datingDataMat, datingLabels = file2Matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream, ])
    # 输入需要分分类的样本也需要归一化
    classifierResult = classify0((inArr - minVals)/ranges, normMat, datingLabels, 3)
    print("这个人的类型是: {}".format(resultList[classifierResult - 1]))

print(classifyPerson())
'''
玩视频游戏所耗时间百分比? 10
每年获得的飞行常客里程数? 40000
每周消费的冰淇淋公升数?0.83
这个人的类型是: 极具魅力的人 
'''

三、示例：手写识别系统

需要识别的数字已经使用图形处理软件，处理成具有相同的色彩和大小：宽高是32像素×32像素的黑白图像。

使用k-近邻算法的手写识别系统：

收集数据：提供文本文件。
准备数据：编写函数img2vector()，将图像格式转换为分类器使用的向量格式。
分析数据：在Python命令提示符中检查数据，确保它符合要求。
训练算法：此步骤不适用于k-近邻算法。
测试算法：编写函数使用提供的部分数据集作为测试样本，测试样本与非测试样本的区别在于测试样本是已经完成分类的数据，如果预测分类与实际类别不同，则标记为一个错误。
使用算法：本例没有完成此步骤，若你感兴趣可以构建完整的应用程序，从图像中提取数字，并完成数字识别，美国的邮件分拣系统就是一个实际运行的类似系统。

1. 准备数据：将图像转换为测试向量

# 将32 * 32像素转化为1 * 1024的向量
def img2vector(filename):
    returnVect = zeros((1, 1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32 * i + j] = int(lineStr[j])
    return returnVect

testVector = img2vector('testDigits/0_13.txt')
print(testVector[0, 0 : 31])
print(testVector[0, 32 : 63])
'''
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]
'''

2. 测试算法：使用k-近邻算法识别手写数字

思路：读取训练样本中的数据，将每个文本图片转化为1 * 1024的一维向量，记录每个文件表示的真实数字作为标签。读取测试样本中的数据，用同样的方法。然后使用classify0()函数求k-近邻。统计误差率。

from os import listdir

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('trainingDigits') # 从文件夹中读取文件
    m = len(trainingFileList)
    trainingMat = zeros((m, 1024)) # 存储训练样本
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0] # 取.前面的字符
        classNumStr = int(fileStr.split('_')[0])# 取_前面的字符
        hwLabels.append(classNumStr) # 解析出这个图片表示的真实数字
        trainingMat[i, :] = img2vector('trainingDigits/%s' %fileNameStr)
    testFileList = listdir('testDigits')
    errorCount= 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectornuderTest =img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectornuderTest, trainingMat, hwLabels, 3)
        print('分类器分类的类型是：', classifierResult, '真实类型是：', classNumStr)
        if(classifierResult != classNumStr):
            errorCount += 1.0
    print('一共分类错误个数是：', errorCount)
    print('分类错误率是：', errorCount / float(mTest))

print(handwritingClassTest())
'''
.......
.......
分类器分类的类型是： 9 真实类型是： 9
分类器分类的类型是： 9 真实类型是： 9
分类器分类的类型是： 9 真实类型是： 9
分类器分类的类型是： 9 真实类型是： 9
分类器分类的类型是： 9 真实类型是： 9
分类器分类的类型是： 9 真实类型是： 9
分类器分类的类型是： 9 真实类型是： 9
分类器分类的类型是： 9 真实类型是： 9
一共分类错误个数是： 11.0
分类错误率是： 0.011627906976744186
'''

实际执行算法时，需要对每个文本图片执行1024次运算，而训练样本和测试样本中的文件数目也比较多，所以效率低下。下一章使用k-决策树的方法，是k-近邻算法的优化版，可以节省大量的计算开销。

to be continued…

下面是完整代码：

"""
Created on 2020-6-29
kNN: k Nearest Neighbors
@author: bernardo
"""

from numpy import *
import operator
from os import listdir


def createDataSet():
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    return group, labels


def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()     
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]


def file2matrix(filename):
    love_dictionary={'largeDoses':3, 'smallDoses':2, 'didntLike':1}
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines)            #get the number of lines in the file
    returnMat = zeros((numberOfLines,3))        #prepare matrix to return
    classLabelVector = []                       #prepare labels return   
    index = 0
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        if(listFromLine[-1].isdigit()):
            classLabelVector.append(int(listFromLine[-1]))
        else:
            classLabelVector.append(love_dictionary.get(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

    
def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m,1))
    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide
    return normDataSet, ranges, minVals
   
def datingClassTest():
    hoRatio = 0.50      #hold out 10%
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')       #load data setfrom file
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print("the classifier came back with: {}, the real answer is: {}".format(classifierResult, datingLabels[i]))
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    print("the total error rate is:{}".format(errorCount/float(numTestVecs)))
    print(errorCount)
    
def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input(\
                                  "percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream, ])
    classifierResult = classify0((inArr - \
                                  minVals)/ranges, normMat, datingLabels, 3)
    print("You will probably like this person: {}".format(resultList[classifierResult - 1]))
    
def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('digits/trainingDigits')           #load the training set
    m = len(trainingFileList)
    trainingMat = zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i,:] = img2vector('digits/trainingDigits/{}'.format(fileNameStr))
    testFileList = listdir('digits/testDigits')        #iterate through the test set
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('digits/testDigits/{}'.format(fileNameStr))
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: {}, the real answer is: {}".format(classifierResult, classNumStr))
        if (classifierResult != classNumStr): errorCount += 1.0
    print("\nthe total number of errors is: %{}".format(errorCount))
    print("\nthe total error rate is: {}".format(errorCount/float(mTest)))