Machine Learning-Programming Exercise 1: Linear Regression

数据描述

        数据:房屋面积+卧室个数+房价。其中,房屋面积和卧室个数位自变量,房价为因变量。数据下载链接

导入python包

import numpy as np
import matplotlib
import matplotlib.pyplot as plt


子函数描述

#将文本记录转化为Numpy
#filename为文件名,k为属性维数
def file2matrix(filename,k):
    fr = open(filename)
    numberOfLines = len(fr.readlines())         #get the number of lines in the file
    returnMat = np.zeros((numberOfLines,k))        #prepare matrix to return
    classLabelVector = np.zeros((numberOfLines,1))                #prepare labels return   
    fr = open(filename)
    index = 0
    for line in fr.readlines():
        line = line.strip()
        listFromLine = line.split(',')
        returnMat[index,:] = listFromLine[0:k]
        classLabelVector[index,:] = float(listFromLine[-1])
        index += 1
    return returnMat,classLabelVector

#特征标准化
def featureNormalize(X):
    X_norm = np.array(X)
    #mu = np.zeros((1,X.shape[1]))
    #sigma = np.zeros((1, X.shape[1]))
    mu = np.mean(X, axis = 0)#对每列求均值
    sigma = np.std(X, axis = 0)#对每列求标准差
    X_norm = (X_norm - np.tile(mu, (X.shape[0],1)))/np.tile(sigma, (X.shape[0],1))
    return X_norm, mu, sigma

# 损失函数
def computeCostMulti(X, y, theta):
    m = len(y)
    J = np.sum((y - np.dot(X, theta))**2) / (2 * m)
    return J

#梯度下降
def gradientDescentMulti(X, y, theta, alpha, numb_iters):
    # Initialize some useful values
    m = len(y) # number of training examples
    J_history = []
    
    for i in range(numb_iters):
        theta -= alpha / m * np.dot(X.T, (np.dot(X, theta) - y))
        J_history.append(computeCostMulti(X, y, theta))
    return theta, J_history
       
# Normal Equation
# theta = (X'X)^(-1)X'y
def normalEqn(X, y):
    from numpy.linalg import inv
    return np.dot(np.dot(inv(np.dot(X.T, X)), X.T), y)


读入数据

returnMat, LabelVector = file2matrix('ex1data2.txt',2)

          这里,我们对文件进行读取。首先,open文件并读取其行数,定义存放特征数据和预测数据的变量(这里,我们使用的类型是np.array。其实我们也可以使用np.mat,都是大同小异)。然后,利用strip移除字符串头尾指定的字符(默认为空格或换行符)或字符序列,利用split(',')指定分隔符(这里为',')对字符串进行切片,将前两列存入特征数据中,最后一列存入预测数据中。这里,如果预测数据是英文字母,我们还需要对其进行类型的转换。returnMat, LabelVector分别为存放特征数据和预测数据的变量。


特征标准化

# Part 1: Feature Normalization
X, mu, sigma = featureNormalize(returnMat)


        我们对returnMat的数据求每一列的均值和标准差,这里使用到的函数为numpy自带的np.mean和np.std,具体使用详情可以查阅官方文件。然后使用np.tile对变量内容复制成输入矩阵同大小的矩阵。最后,(returnMat-mean)/std为标准化之后的结果,这里的‘/’是点除。

输入数据扩充

# Add intercept term to X
X = np.column_stack((np.ones((X.shape[0],1)), X))


        我们在特征数据的第一列插入一列1矩阵,方便与之后delta的常数项系数相乘。

梯度下降

# Part 2: Gradient Descent
# Choose some alpha value
alpha = 0.01
num_iters = 400
# Init Theta and Run Gradient Descent 
theta = np.zeros((X.shape[1],1))
theta, J_history = gradientDescentMulti(X, LabelVector, theta, alpha, num_iters)



梯度下降法的公式参见下图


        由公式可以知道,对于常数求偏导之后不会乘x。所以我们将x转置,第一列常数项全部为1,任何数乘以1都是其本身。而且利用x转置乘以误差项也可以达到将所有样例求和的目的。

theta -= alpha / m * np.dot(X.T, (np.dot(X, theta) - y))


迭代误差结果图

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(range(len(J_history)), J_history)
plt.show()


预测

# 预测
# Estimate the price of a 1650 sq-ft, 3 br house
# Recall that the first column of X is all-ones. Thus, it does not need to be normalized.

price = np.dot(np.column_stack((np.array([[1]]), ((np.array([[1650, 3]]) - mu) / sigma))), theta)

print("Predicted price of a 1650 sq-ft, 3 br house is $", price[0][0])


        结果:Predicted price of a 1650 sq-ft, 3 br house is $ 289221.547371

       这里,我们需要注意的是数据刚开始就标准化,所以这里的数据也需要进行标准化。


Normal Equations

# Part 3: Normal Equations
X, y = file2matrix('ex1data2.txt',2)
# Add intercept term to X
X = np.column_stack((np.ones((X.shape[0],1)), X))
theta1 = normalEqn(X, LabelVector)


# 预测
# Estimate the price of a 1650 sq-ft, 3 br house
# Recall that the first column of X is all-ones. Thus, it does not need to be normalized.

price1 = np.dot(np.column_stack((np.array([[1]]), np.array([[1650, 3]]))), theta1)

print("Predicted price of a 1650 sq-ft, 3 br house is $", price1[0][0])




最小二乘法的delta用矩阵的形式表示为:


结果:Predicted price of a 1650 sq-ft, 3 br house is $ 293081.464335。这里,因为刚开始没有对数据进行标准化处理,所以在预测时候也不需要进行标准化处理。


Programming Exercise 1: Linear Regression Machine Learning Introduction In this exercise, you will implement linear regression and get to see it work on data. Before starting on this programming exercise, we strongly recom- mend watching the video lectures and completing the review questions for the associated topics. To get started with the exercise, you will need to download the starter code and unzip its contents to the directory where you wish to complete the exercise. If needed, use the cd command in Octave/MATLAB to change to this directory before starting this exercise. You can also find instructions for installing Octave/MATLAB in the “En- vironment Setup Instructions” of the course website. Files included in this exercise ex1.m - Octave/MATLAB script that steps you through the exercise ex1 multi.m - Octave/MATLAB script for the later parts of the exercise ex1data1.txt - Dataset for linear regression with one variable ex1data2.txt - Dataset for linear regression with multiple variables submit.m - Submission script that sends your solutions to our servers [?] warmUpExercise.m - Simple example function in Octave/MATLAB [?] plotData.m - Function to display the dataset [?] computeCost.m - Function to compute the cost of linear regression [?] gradientDescent.m - Function to run gradient descent [†] computeCostMulti.m - Cost function for multiple variables [†] gradientDescentMulti.m - Gradient descent for multiple variables [†] featureNormalize.m - Function to normalize features [†] normalEqn.m - Function to compute the normal equations ? indicates files you will need to complete † indicates optional exercises
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值