kaggle 训练赛(1)Digit Recognizer

本文介绍使用K近邻算法(KNN)进行手写数字识别的过程,通过Python实现并对Kaggle的手写数字数据集进行了初步尝试,最终取得了0.96的成绩。

题目

识别手写数字

做法

开始做kaggle的第一套题,识别手写数字。每个数字是28*28的一个向量,朴素的跑了一个KNN,距离用的是欧几里得距离。最终成绩0.96

def knn(inX,num):
    dataSet = trainMat
    labels = labelList
    k = 3

    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sumDiffMat = sqDiffMat.sum(axis=1)
    distances = sumDiffMat**0.5
    sortedDistances = distances.argsort()
    classCount = {}
    for i in range(k):
        vote = labels[sortedDistances[i]]
        classCount[vote] = classCount.get(vote,0) + 1
    # sortedClassCount = sorted(classCount,key=itemgetter('vote'))
    max = 0
    ans = ''
    for k,v in classCount.items():
        if(v>max):
            ans = k
            max = v
    print(str(num+1) + ' = ' + ans)
    outFile.write(str(num+1) + ',' + ans + '\n')
    return

以后学到更多的知识再做优化。
在写法上,用上了Python的多线程来处理,节省了一定的时间

from multiprocessing.dummy import Pool
    outFile = open("out2.csv",'w')
    pool = Pool()
    pool.starmap(knn,zip(testMat,range(n)))
    pool.close()
    pool.join()
    outFile.close()

代码

from numpy import *
import csv
from multiprocessing.dummy import Pool

def knn_warp(args):
    return knn(*args)

def knn(inX,num):
    dataSet = trainMat
    labels = labelList
    k = 3

    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sumDiffMat = sqDiffMat.sum(axis=1)
    distances = sumDiffMat**0.5
    sortedDistances = distances.argsort()
    classCount = {}
    for i in range(k):
        vote = labels[sortedDistances[i]]
        classCount[vote] = classCount.get(vote,0) + 1
    # sortedClassCount = sorted(classCount,key=itemgetter('vote'))
    max = 0
    ans = ''
    for k,v in classCount.items():
        if(v>max):
            ans = k
            max = v
    print(str(num+1) + ' = ' + ans)
    outFile.write(str(num+1) + ',' + ans + '\n')
    return

def readTrain(row,i):
    labelList[i] = row['label']

    for x in range(0, 784):
        trainMat[i, x] = int(row['pixel' + str(x)])
    print(str(i))

def readTest(row,i):
    for x in range(0, 784):
        testMat[i, x] = int(row['pixel' + str(x)])
    print(str(i))

global labelList
global trainMat
global outFile

if __name__ == '__main__':

    f = open('train.csv')
    m = len(f.readlines())
    m = m - 1
    labelList = list(range(m))
    trainMat = zeros((m,784))
    f.close()

    with open('train.csv') as f:
        f_csv = csv.DictReader(f)
        pool = Pool()
        pool.starmap(readTrain, zip(f_csv, range(m)))
        pool.close()
        pool.join()

    f = open('test.csv')
    n = len(f.readlines())
    n = n - 1
    testMat = zeros((n,784))
    f.close()

    with open('test.csv') as f:
        f_csv = csv.DictReader(f)
        pool = Pool()
        pool.starmap(readTest, zip(f_csv, range(n)))
        pool.close()
        pool.join()

    outFile = open("out2.csv",'w')
    pool = Pool()
    pool.starmap(knn,zip(testMat,range(n)))
    pool.close()
    pool.join()
    outFile.close()


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值