【案例分析】Boston Houseprice

原创已于 2023-12-10 17:00:22 修改 · 1k 阅读

19 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#机器学习 #pandas #python #numpy #matplotlib

于 2023-12-10 16:44:38 首次发布

案例分析专栏收录该内容

7 篇文章

订阅专栏

文章目录

1.库和数据导入
2.处理特征
3.建模与预测

1.库和数据导入

1.1库的导入

# 基础库导入
import pandas as pd
import numpy as np

# 可视化
import matplotlib.pyplot as plt
import seaborn as sns 

# 高级计算库
from scipy import stats
from scipy.stats import  norm

# 机器学习库的导入
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import sklearn

1.2 删除警告消息

#通过警告过滤器进行控制是否发出警告消息
import warnings
warnings.filterwarnings('ignore')

1.3导入csv数据

# 导入train训练集和test测试集
train = pd.read_csv(r'Desktop\train.csv',index_col = "Id")
test = pd.read_csv(r'Desktop\test.csv',index_col = "Id")
print(train.shape)
print(test.shape)

(1460, 80)
(1459, 79)

2.处理特征

步骤分析：

查看每个特征的构成
填补缺失
查看每个特征的分布
偏态正态化
构造新特征
删除不需要特征
删除异常值

2.1特征含义匹配

房价受诸多因素影响，本次比赛提供了房屋的很多信息以及最终房屋的价格。
我们需要用这些数据训练波形，最终来预测其他房屋的价格。

以下为本组的中文含义定义：

特征	含义	特征	含义
SalePrice	房价	MSSubClass	建筑的类别
MSZoning	分区(难道是穷人区和富人区?)	LotFrontage	临街,越靠近街道价格越高
LotArea	面积	Street	街道
Alley	胡同	LotShape	房屋的大致形状
LandContour	物业	Utilities	公共设施可用度
LotConfig	配置	LandSlope	坡度
Neighborhood	位置	Condition1	靠近铁路公路
Condition2	靠近铁路公路(如果存在第二个)	BldgType	住宅类型
HouseStyle	房屋风格	OverallQual	建筑材料和施工质量
OverallCond	总体状况评级	YearBuilt	建造年份
YearRemodAdd	改造日期	RoofStyle	屋顶类型
RoofMatl	屋顶材料	Exterior1st	外部覆盖物
Exterior2nd	外部覆盖物 (如果有第二种)	MasVnrType	墙体贴面类型
MasVnrArea	墙体贴面的面积	ExterQual	贴面的质量
ExterCond	外部材料状况	Foundation	地基类型
BsmtQual	地下室高度	BsmtCond	地下室评分
BsmtExposure	花园层地下室墙壁	BsmtFinType1	地下室施工质量
BsmtFinSF1	Type 1 建筑面积	BsmtFinType2	第二个建筑面积 (如果存在)
BsmtFinSF2	Type 2 建筑面积	BsmtUnfSF	未完成地下室面积
TotalBsmtSF	地下室总面积	Heating	光照类型
HeatingQC	光照质量与类型	CentralAir	中央空调
Electrical	电力系统(国外有别墅房顶是太阳能发电板的)	1stFlrSF	一楼面积
2ndFlrSF	二楼面积	LowQualFinSF	低质量建筑面积
GrLivArea	地上生活区面积	BsmtFullBath	齐全的洗浴间
BsmtHalfBath	地下室有洗浴间	FullBath	齐全的高档洗浴间
HalfBath	地下室有高档洗浴间	BedroomAbvGr	地上卧室数
KitchenAbvGr	厨房数	KitchenQual	厨房质量
TotRmsAbvGrd	高档房间数	Functional	实用等级
Fireplaces	壁炉数量	FireplaceQu	壁炉质量
GarageType	车库位置	GarageYrBlt	车库库龄
GarageFinish	车库的内部装饰	GarageCars	车库存车量
GarageArea	车库面积	GarageQual	车库质量
GarageCond	车库评分	PavedDrive	车道
WoodDeckSF	木阳台面积	OpenPorchSF	开放门廊面积
EnclosedPorch	封闭门廊面积	3SsnPorch	三季门廊面积
ScreenPorch	屏风门廊面积	PoolArea	泳池面积
PoolQC	泳池质量	Fence	围栏质量
MiscFeature	杂项功能	MiscVal	其他功能的价值
MoSold	已售月份	YrSold	已售年份
SaleType	出售类型	SaleCondition	销售状况

2.2目标值简单分析

2.2.1房价

# SalePrice  房价

train['SalePrice'].value_counts()# 计算拥有的值和相应频率
train['SalePrice'].isnull().sum()# 检查数据是否丢失
train['SalePrice'] = np.log1p(train['SalePrice'])# 由于房价是有偏度的,将房价对数化
sns.distplot(train['SalePrice'], fit=norm)# 显示直方图及核密度估计

# 绘制PP图，观察目标分布与理论正态分布的区别
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。

2.2.2面积

# totalArea 整体的面积（地上生活区 + 车库 + 地下室）

train['totalArea'] = train['GrLivArea'] + train['GarageArea'] + train['TotalBsmtSF']
train = train.drop(train[train['totalArea'] > 8000].index)
train['totalArea'] = np.log1p(train['totalArea'])

# 显示直方图及核密度估计
sns.distplot(train['totalArea'], fit=norm)
# 绘制PP图，观察目标分布与理论正态分布的区别
fig = plt.figure()
res = stats.probplot(train['totalArea'], plot=plt)
# 分析面积特征与价格的关系
fig = plt.figure()
plt.plot(train['totalArea'],train['SalePrice'],'o')

# test空值填充
test['GrLivArea'] = test['GrLivArea'].fillna(0)
test['GarageArea'] = test['GarageArea'].fillna(0)
test['TotalBsmtSF'] = test['TotalBsmtSF'].fillna(0)
test['totalArea'] = test['GrLivArea'] + test['GarageArea'] + test['TotalBsmtSF']
test['totalArea'] = np.log1p(test['totalArea'])

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。
显然，面积与价格有很强的线性关系。

2.2.3销售类型

# SaleCondition 销售类型

# 去掉个别不相干的值
print(train['SaleCondition'].value_counts())
# train['SaleCondition'].isnull().sum()
train = train[train['SaleCondition'] == 'Normal']
train = train.drop(train[train['GrLivArea'] > 4000].index)

# 绘制散点图
plt.plot(train['GrLivArea'][train['SaleCondition'] == 'Normal'],train['SalePrice'][train['SaleCondition'] == 'Normal'],'o')
plt.plot(train['GrLivArea'][train['SaleCondition'] == 'Partial'],train['SalePrice'][train['SaleCondition'] == 'Partial'],'o')
plt.plot(train['GrLivArea'][train['SaleCondition'] == 'Abnorml'],train['SalePrice'][train['SaleCondition'] == 'Abnorml'],'o')
plt.plot(train['GrLivArea'][train['SaleCondition'] == 'Family'],train['SalePrice'][train['SaleCondition'] == 'Family'],'o')
plt.plot(train['GrLivArea'][train['SaleCondition'] == 'Alloca'],train['SalePrice'][train['SaleCondition'] == 'Alloca'],'o')
plt.plot(train['GrLivArea'][train['SaleCondition'] == 'AdjLand'],train['SalePrice'][train['SaleCondition'] == 'AdjLand'],'o')

Normal     1198
Partial     123
Abnorml     101
Family       20
Alloca       12
AdjLand       4
Name: SaleCondition, dtype: int64





[<matplotlib.lines.Line2D at 0x2939f8ae970>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：面积与价格有很强的线性关系。
六种不同的销售条件的情况下，在销售条件相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.4建筑类别、分区

# MSSubClass 建筑类别、分区

# 探究该特征的值的分布、去掉个别不相干的值
print(train['MSZoning'].value_counts())
train['SaleCondition'].isnull().sum()
train = train.drop(train[(train['GrLivArea'] > 3000) & (train['MSZoning'] == 'RM')].index)
train = train.drop(train[(train['GrLivArea'] > 3000) & (train['MSZoning'] == 'RH')].index)

# 绘制散点图
plt.plot(train['GrLivArea'][train['MSZoning'] == 'RL'],train['SalePrice'][train['MSZoning'] == 'RL'],'o')
plt.plot(train['GrLivArea'][train['MSZoning'] == 'RM'],train['SalePrice'][train['MSZoning'] == 'RM'],'o')
plt.plot(train['GrLivArea'][train['MSZoning'] == 'FV'],train['SalePrice'][train['MSZoning'] == 'FV'],'o')
plt.plot(train['GrLivArea'][train['MSZoning'] == 'RH'],train['SalePrice'][train['MSZoning'] == 'RH'],'o')
plt.plot(train['GrLivArea'][train['MSZoning'] == 'C (all)'],train['SalePrice'][train['MSZoning'] == 'C (all)'],'o')

#test集中用RL填充空值
test['MSZoning'] = test['MSZoning'].fillna('RL')

RL         954
RM         189
FV          39
RH          11
C (all)      4
Name: MSZoning, dtype: int64

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的房屋分区、类型，价格分布有明显不同，且结果具有显著性。
五种不同的建筑类型的情况下，在建筑类型相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.5是否临街

# LotFrontage 临街

# 去掉个别不相干的值、填充完整此列
# print(train['LotFrontage'].value_counts().sort_index())
train = train.drop(train[train['LotFrontage'] > 300].index)
train['LotFrontage'] = train['LotFrontage'].fillna(train['LotFrontage'].mean())
train['not_LotFrontage'] = pd.Series(np.zeros((len(train))),index = train.index)
train['not_LotFrontage'][train['LotFrontage'] < 25] = 1
index = train['LotFrontage'] > 25

# 对数化处理、绘图
train['LotFrontage'] = np.log1p(train['LotFrontage'])
# train['LotFrontage'].isnull().sum()
sns.distplot(train['LotFrontage'], fit=norm)
fig = plt.figure()
res = stats.probplot(train['LotFrontage'][index], plot=plt)

# test集整理数据
test['LotFrontage'] = test['LotFrontage'].fillna(test['LotFrontage'].mean())
test['not_LotFrontage'] = pd.Series(np.zeros((len(test))),index = test.index)
test['not_LotFrontage'][test['LotFrontage'] < 25] = 1
test['LotFrontage'] = np.log1p(test['LotFrontage'])

在这里插入图片描述

由图易得在对数化处理后，拟合效果不能很好匹配。
使用PP图得出的目标值在小于-1和大于2时，偏离正态分布趋势。

2.2.6建筑面积

# LotArea建筑面积

# 去掉个别不相干的值
# print(train['LotArea'].value_counts().sort_index())
train = train.drop(train[train['LotArea'] > 100000].index)

# 对数化处理、绘图
train['LotArea'] = np.log1p(train['LotArea'])
sns.distplot(train['LotArea'], fit=norm)
fig = plt.figure()
res = stats.probplot(train['LotArea'], plot=plt)
fig = plt.figure()
plt.plot(train['LotArea'],train['SalePrice'],'o')

# test对数化处理
test['LotArea'] = np.log1p(test['LotArea'])

在这里插入图片描述

由图易得在对数化处理后，拟合效果不能很好匹配。
使用PP图得出的目标值在小于-1和大于2时，偏离正态分布趋势。
由散点图的趋势可得：面积与价格有不具备显著的线性关系。

2.2.7 街道、胡同数目

# Street Alley 街道 胡同

# 探究两个特征的值的分布
print(train['Street'].value_counts().sort_index())
print(train['Alley'].value_counts().sort_index())

# 数据量过少，可排除处理
train = train.drop('Street', axis=1)
train = train.drop('Alley', axis=1)
test = test.drop('Street', axis=1)
test = test.drop('Alley', axis=1)

Grvl       4
Pave    1186
Name: Street, dtype: int64
Grvl    43
Pave    28
Name: Alley, dtype: int64

*数据量过少，可排除处理

2.2.8房屋大致形状

# LotShape 房屋大致形状

# 探究该特征的值的分布
print(train['LotShape'].value_counts().sort_index())
print(train['LotShape'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['GrLivArea'][train['LotShape'] == 'IR1'],train['SalePrice'][train['LotShape'] == 'IR1'],'o')
plt.plot(train['GrLivArea'][train['LotShape'] == 'Reg'],train['SalePrice'][train['LotShape'] == 'Reg'],'o')
plt.plot(train['GrLivArea'][train['LotShape'] == 'IR2'],train['SalePrice'][train['LotShape'] == 'IR2'],'o')
plt.plot(train['GrLivArea'][train['LotShape'] == 'IR3'],train['SalePrice'][train['LotShape'] == 'IR3'],'o')

IR1    396
IR2     31
IR3      6
Reg    757
Name: LotShape, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x2939f9a7610>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的房屋形状最终价格分布均在一定范围内，不具有显著性。
四种不同的房屋形状的情况下，在房屋形状相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.9物业

# LandContour 物业

# 探究该特征的值的分布
print(train['LandContour'].value_counts().sort_index())
print(train['LandContour'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['GrLivArea'][train['LandContour'] == 'Lvl'],train['SalePrice'][train['LandContour'] == 'Lvl'],'o')
plt.plot(train['GrLivArea'][train['LandContour'] == 'Bnk'],train['SalePrice'][train['LandContour'] == 'Bnk'],'o')
plt.plot(train['GrLivArea'][train['LandContour'] == 'HLS'],train['SalePrice'][train['LandContour'] == 'HLS'],'o')
plt.plot(train['GrLivArea'][train['LandContour'] == 'Low'],train['SalePrice'][train['LandContour'] == 'Low'],'o')

Bnk      49
HLS      33
Low      29
Lvl    1079
Name: LandContour, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a09d1f70>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的物业情况下，最终价格分布均在一定范围内，不具有显著性。
四种不同的物业的情况下，在物业选择相同时，考虑地上生活区面积与出售价格，得到二者不成一定的线性关系。

2.2.10公共设施可用度

# Utilities 公共设施可用度

# 探究该特征的值的分布
print(train['Utilities'].value_counts().sort_index())
print(train['Utilities'].isnull().sum()/len(train))

# 数据量过少，可排除处理
train = train.drop('Utilities', axis=1)
test = test.drop('Utilities', axis=1)

AllPub    1190
Name: Utilities, dtype: int64
0.0

*数据量过少，可排除处理

2.2.11内部配置

# LotConfig内部配置

# 探究该特征的值的分布
print(train['LotConfig'].value_counts().sort_index())
print(train['LotConfig'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['GrLivArea'][train['LotConfig'] == 'Corner'],train['SalePrice'][train['LotConfig'] == 'Corner'],'o')
plt.plot(train['GrLivArea'][train['LotConfig'] == 'Inside'],train['SalePrice'][train['LotConfig'] == 'Inside'],'o')
plt.plot(train['GrLivArea'][train['LotConfig'] == 'CulDSac'],train['SalePrice'][train['LotConfig'] == 'CulDSac'],'o')
plt.plot(train['GrLivArea'][train['LotConfig'] == 'FR3'],train['SalePrice'][train['LotConfig'] == 'FR3'],'o')

Corner     209
CulDSac     77
FR2         42
FR3          4
Inside     858
Name: LotConfig, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a0a39910>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的内部配置情况下，最终价格分布均在一定范围内，不具有显著性。
四种不同的内部配置的情况下，在内部配置相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.12坡度

# LandSlope 坡度

# 探究该特征的值的分布
print(train['LandSlope'].value_counts().sort_index())
print(train['LandSlope'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['GrLivArea'][train['LandSlope'] == 'Gtl'],train['SalePrice'][train['LandSlope'] == 'Gtl'],'o')
plt.plot(train['GrLivArea'][train['LandSlope'] == 'Mod'],train['SalePrice'][train['LandSlope'] == 'Mod'],'o')
plt.plot(train['GrLivArea'][train['LandSlope'] == 'Sev'],train['SalePrice'][train['LandSlope'] == 'Sev'],'o')

Gtl    1130
Mod      55
Sev       5
Name: LandSlope, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a0a96880>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的坡度情况下，最终价格分布均在一定范围内，不具有显著性。
三种不同的坡度的情况下，在坡度相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.13周围位置

# Neighborhood Condition1 Condition2位置

# 探究该特征的值的分布
print(train['Neighborhood'].value_counts().sort_index())
print(train['Neighborhood'].isnull().sum()/len(train))

print(train['Condition1'].value_counts().sort_index())
print(train['Condition1'].isnull().sum()/len(train))

print(train['Condition2'].value_counts().sort_index())
print(train['Condition2'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['GrLivArea'][train['Condition1'] == 'Norm'],train['SalePrice'][train['Condition1'] == 'Norm'],'o')
plt.plot(train['GrLivArea'][train['Condition1'] == 'Artery'],train['SalePrice'][train['Condition1'] == 'Artery'],'o')
plt.plot(train['GrLivArea'][train['Condition1'] == 'Feedr'],train['SalePrice'][train['Condition1'] == 'Feedr'],'o')

Blmngtn     12
Blueste      2
BrDale      12
BrkSide     54
ClearCr     22
CollgCr    129
Crawfor     43
Edwards     82
Gilbert     64
IDOTRR      29
MeadowV     16
Mitchel     42
NAmes      197
NPkVill      8
NWAmes      64
NoRidge     36
NridgHt     45
OldTown     92
SWISU       22
Sawyer      67
SawyerW     50
Somerst     49
StoneBr     16
Timber      26
Veenker     11
Name: Neighborhood, dtype: int64
0.0
Artery      39
Feedr       67
Norm      1025
PosA         7
PosN        17
RRAe         8
RRAn        21
RRNe         2
RRNn         4
Name: Condition1, dtype: int64
0.0
Artery       1
Feedr        5
Norm      1179
PosA         1
RRAe         1
RRAn         1
RRNn         2
Name: Condition2, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a0afb970>]

在这里插入图片描述

Neighborhood的数据值极度分散，不需要具体考虑。
Condition2数据显著集中在‘Norm’值中，不需要具体考虑。
由散点图的趋势可获得显而易见的结果：不同的condition1，价格分布有明显不同，且结果具有显著性。
三种不同的建筑类型的情况下，在condition1相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.14住宅风格

# BldgType住宅风格

# 探究该特征的值的分布
print(train['BldgType'].value_counts().sort_index())
print(train['BldgType'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['GrLivArea'][train['BldgType'] == '1Fam'],train['SalePrice'][train['BldgType'] == '1Fam'],'o')
plt.plot(train['GrLivArea'][train['BldgType'] == '2fmCon'],train['SalePrice'][train['BldgType'] == '2fmCon'],'o')
plt.plot(train['GrLivArea'][train['BldgType'] == 'Duplex'],train['SalePrice'][train['BldgType'] == 'Duplex'],'o')
plt.plot(train['GrLivArea'][train['BldgType'] == 'TwnhsE'],train['SalePrice'][train['BldgType'] == 'TwnhsE'],'o')
plt.plot(train['GrLivArea'][train['BldgType'] == 'Twnhs'],train['SalePrice'][train['BldgType'] == 'Twnhs'],'o')

1Fam      999
2fmCon     27
Duplex     37
Twnhs      37
TwnhsE     90
Name: BldgType, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a0b60880>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的住宅风格情况下，最终价格分布均在一定范围内，不具有显著性。
五种不同的住宅风格的情况下，在住宅风格相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.15建筑材料和施工质量

# OverallQual 建筑材料和施工质量

# 探究该特征的值的分布
print(train['OverallQual'].value_counts().sort_index())
print(train['OverallQual'].isnull().sum()/len(train))

# 对数化处理、绘图
sns.distplot(train['OverallQual'], fit=norm)
fig = plt.figure()
res = stats.probplot(train['OverallQual'], plot=plt)
fig = plt.figure()
plt.plot(train['OverallQual'],train['SalePrice'],'o')

1       2
2       2
3      16
4      98
5     340
6     327
7     252
8     120
9      25
10      8
Name: OverallQual, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a0c7aa90>]

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。
显然，建筑材料和施工质量与价格有很强的线性关系。

2.2.16总体状况评级

# OverallCond 总体状况评级

# 探究该特征的值的分布、去掉个别不相干的值
print(train['OverallCond'].value_counts().sort_index())
print(train['OverallCond'].isnull().sum()/len(train))
train = train.drop(train[(train['OverallCond'] ==2 )&(train['SalePrice'] > 300000 )].index)

# 对数化处理、绘图
sns.distplot(train['OverallCond'], fit=norm)
fig = plt.figure()
res = stats.probplot(train['OverallCond'], plot=plt)
fig = plt.figure()
plt.plot(train['OverallCond'],train['SalePrice'],'o')

1      1
2      2
3     18
4     49
5    626
6    225
7    180
8     70
9     19
Name: OverallCond, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x2939f5f8c70>]

在这里插入图片描述

由图易得在对数化处理后，拟合效果不能很好匹配。
使用PP图得出的目标值在小于-2和大于1时，偏离正态分布趋势。
由散点图的趋势可得：面积与价格有不具备显著的线性关系。

2.2.17建造年份

# YearBuilt 建造年份

# 探究该特征的值的分布、去掉个别不相干的值
train['newhouse']  = pd.Series(np.zeros((len(train))),index = train.index)
train['newhouse'][train['YearBuilt'] > 2000]  = 1
train['age']  = 2010 - train['YearBuilt']
print(train['age'].value_counts().sort_index())
print(train['YearBuilt'].value_counts().sort_index())
print(train['YearBuilt'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['YearBuilt'],train['SalePrice'],'o')

# test对数化处理
test['newhouse']  = pd.Series(np.zeros((len(test))),index = test.index)
test['newhouse'][test['YearBuilt'] > 2000]  = 1
test['age']  = 2010 - test['YearBuilt']

1       3
2       9
3      19
4      24
5      39
       ..
125     2
128     1
130     2
135     1
138     1
Name: age, Length: 110, dtype: int64
1872     1
1875     1
1880     2
1882     1
1885     2
        ..
2005    39
2006    24
2007    19
2008     9
2009     3
Name: YearBuilt, Length: 110, dtype: int64
0.0

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的建造年份，价格分布有明显不同，且结果具有显著性。
建造年份与出售价格，二者成一定的线性关系。

2.2.18改造日期

# YearRemodAdd 改造日期

# 探究该特征的值的分布、去掉个别不相干的值
print(train['YearRemodAdd'].value_counts().sort_index())
print(train['YearRemodAdd'].isnull().sum()/len(train))
train['1950house']  = pd.Series(np.zeros((len(train))),index = train.index)
train['1950house'][train['YearRemodAdd'] == 1950]  = 1

# 绘制散点图
sns.distplot(train['YearRemodAdd'], fit=norm)
fig = plt.figure()
res = stats.probplot(train['YearRemodAdd'], plot=plt)
fig = plt.figure()
plt.plot(train['YearRemodAdd'],train['SalePrice'],'o')

# test对数化处理
test['1950house']  = pd.Series(np.zeros((len(test))),index = test.index)
test['1950house'][test['YearRemodAdd'] == 1950]  = 1

1950    150
1951      3
1952      4
1953      8
1954     12
       ... 
2006     48
2007     36
2008     22
2009      6
2010      1
Name: YearRemodAdd, Length: 61, dtype: int64
0.0

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。
显然，改造日期与价格有很强的线性关系

2.2.19地基类型

# Foundation 地基类型

# 探究该特征的值的分布
print(train['Foundation'].value_counts().sort_index())
print(train['Foundation'].isnull().sum()/len(train))

# 绘制散点图
plt.plot(train['GrLivArea'][train['Foundation'] == 'BrkTil'],train['SalePrice'][train['Foundation'] == 'BrkTil'],'o')
plt.plot(train['GrLivArea'][train['Foundation'] == 'CBlock'],train['SalePrice'][train['Foundation'] == 'CBlock'],'o')
plt.plot(train['GrLivArea'][train['Foundation'] == 'PConc'],train['SalePrice'][train['Foundation'] == 'PConc'],'o')

BrkTil    124
CBlock    551
PConc     486
Slab       21
Stone       5
Wood        3
Name: Foundation, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a0ca11c0>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的地基类型情况下，最终价格分布均在一定范围内，不具有显著性。
三种不同的地基类型的情况下，在地基类型相同时，考虑地上生活区面积与出售价格，得到二者成一定的线性关系。

2.2.20地下室高度

# BsmtQual 地下室高度

# 填充空缺数据
train['BsmtQual'] = train['BsmtQual'].fillna("unknown")
train['BsmtCond'] = train['BsmtCond'].fillna("TA")
train['BsmtExposure'] = train['BsmtExposure'].fillna("unknown")
train['BsmtFinType1'] = train['BsmtFinType1'].fillna("unknown")

test['BsmtQual'] = test['BsmtQual'].fillna("unknown")
test['BsmtCond'] = test['BsmtCond'].fillna("TA")
test['BsmtExposure'] = test['BsmtExposure'].fillna("unknown")
test['BsmtFinType1'] = test['BsmtFinType1'].fillna("unknown")

*数据残缺不具备普遍意义。

2.2.21Type 1地下室建筑面积

# BsmtFinSF1 Type 1地下室建筑面积

# 探究该特征的值的分布
print(train['BsmtFinSF1'].value_counts())
print(train['BsmtFinSF1'].isnull().sum()/len(train))
train['noSF1']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noSF1'][train['BsmtFinSF1'] == 0]  = 1

# 绘制散点图
plt.plot(train['BsmtFinSF1'],train['SalePrice'],'o')

# 填充test空缺值
test['BsmtFinSF1'] = test['BsmtFinSF1'].fillna(0)
test['BsmtFinSF2'] = test['BsmtFinSF2'].fillna(0)
test['noSF1']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noSF1'][test['BsmtFinSF1'] == 0]  = 1

0      353
24       9
16       8
662      5
686      5
      ... 
762      1
763      1
769      1
772      1
609      1
Name: BsmtFinSF1, Length: 554, dtype: int64
0.0

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的Type 1地下室建筑面积，价格分布有明显不同，且结果具有显著性。
Type 1地下室建筑面积与出售价格，二者成一定的线性关系。

2.2.22BsmtCond地下室评分

# BsmtCond地下室评分

# 填充空缺数据
train['BsmtCond'] = train['BsmtCond'].fillna(0)
test['BsmtCond'] = test['BsmtCond'].fillna(0)

*数据残缺不具备普遍意义。

2.2.23未完工地下室总面积、地下室总面积

# BsmtUnfS、TotalBsmtSF、noTotalBsmtSF

# 探究该特征的值的分布、去掉个别不相干的值
print(train['BsmtUnfSF'].value_counts())
print(train['TotalBsmtSF'].isnull().sum()/len(train))
train['noTotalBsmtSF']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noTotalBsmtSF'][train['TotalBsmtSF'] == 0]  = 1
index = train['TotalBsmtSF'] > 0

# 对数化处理、绘图
train['TotalBsmtSF'] = np.log1p(train['TotalBsmtSF'])
sns.distplot(train['TotalBsmtSF'][index], fit=norm)
fig = plt.figure()
res = stats.probplot(train['TotalBsmtSF'][index], plot=plt)
fig = plt.figure()
plt.plot(train['TotalBsmtSF'][index],train['SalePrice'][index],'o')

# 填充test空缺值、对数化
test['TotalBsmtSF'] = test['TotalBsmtSF'].fillna(0)
test['noTotalBsmtSF']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noTotalBsmtSF'][test['TotalBsmtSF'] == 0]  = 1
test['TotalBsmtSF'] = np.log1p(test['TotalBsmtSF'])

0      100
384      8
572      7
300      6
319      5
      ... 
778      1
779      1
783      1
784      1
568      1
Name: BsmtUnfSF, Length: 668, dtype: int64
0.0

在这里插入图片描述

由图易得在对数化处理后，拟合效果能很好匹配。
使用PP图得出的目标值在小于-2时，偏离正态分布趋势。
由散点图的趋势可得：面积与价格有不具备显著的线性关系。

2.2.24光照类型

# Heating 光照类型

# 填充空缺数据
train['Electrical'] = train['Electrical'].fillna('SBrkr')
test['Electrical'] = test['Electrical'].fillna('SBrkr')

*数据残缺不具备普遍意义。

2.2.25一楼面积

# 1stFlrSF 一楼面积

# 探究该特征的值的分布
print(train['1stFlrSF'].value_counts())
print(train['1stFlrSF'].isnull().sum()/len(train))

# 对数化处理、绘图
train['1stFlrSF'] = np.log1p(train['1stFlrSF'])
sns.distplot(train['1stFlrSF'], fit=norm)
fig = plt.figure()
res = stats.probplot(train['1stFlrSF'], plot=plt)
fig = plt.figure()
plt.plot(train['1stFlrSF'][index],train['SalePrice'][index],'o')

# test对数化处理
test['1stFlrSF'] = np.log1p(test['1stFlrSF'])

864     22
1040    15
912     12
848     11
894     10
        ..
1304     1
1307     1
1309     1
1310     1
2053     1
Name: 1stFlrSF, Length: 647, dtype: int64
0.0

在这里插入图片描述

![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/e2449ee5957646aba892818ade57a9ca.png#pic_center)

由图易得在对数化处理后，拟合效果能较好匹配。
使用PP图得出的目标值在接近-3时，偏离正态分布趋势。
由散点图的趋势可得：面积与价格有不具备显著的线性关系。

2.2.26二楼面积

# 2ndFlrSF二楼面积

# 探究该特征的值的分布、针对个别值定义
print(train['2ndFlrSF'].value_counts())
print(train['2ndFlrSF'].isnull().sum()/len(train))
train['no2ndFlrSF']  = pd.Series(np.zeros((len(train))),index = train.index)
train['no2ndFlrSF'][train['2ndFlrSF'] == 0]  = 1

# 绘图图
sns.distplot(train['2ndFlrSF'], fit=norm)

# test数据定义、对数化处理
test['no2ndFlrSF']  = pd.Series(np.zeros((len(test))),index = test.index)
test['no2ndFlrSF'][test['2ndFlrSF'] == 0]  = 1

0       662
504       8
600       6
896       6
672       6
       ... 
1243      1
838       1
836       1
834       1
1796      1
Name: 2ndFlrSF, Length: 358, dtype: int64
0.0

在这里插入图片描述

由图易得在对数化处理后，拟合效果不能很好匹配。

2.2.27低质量建筑面积

# LowQualFinSF 低质量建筑面积

# 探究该特征的值的分布、针对个别值定义
print(train['LowQualFinSF'].value_counts())
print(train['LowQualFinSF'].isnull().sum()/len(train))
train['haveLowQual']  = pd.Series(np.zeros((len(train))),index = train.index)
train['haveLowQual'][train['LowQualFinSF'] > 0]  = 1

# test数据定义
test['haveLowQual']  = pd.Series(np.zeros((len(test))),index = test.index)
test['haveLowQual'][test['LowQualFinSF'] > 0]  = 1

0      1170
80        2
360       2
479       1
473       1
420       1
397       1
390       1
384       1
514       1
234       1
232       1
205       1
156       1
144       1
120       1
481       1
53        1
528       1
Name: LowQualFinSF, dtype: int64
0.0

*数据极度集中，不具备普遍意义。

2.2.28地上生活区面积

# GrLivArea 地上生活区面积

# 探究该特征的值的分布
print(train['LowQualFinSF'].value_counts())
print(train['LowQualFinSF'].isnull().sum()/len(train))

# 对数化处理、绘图
train['GrLivArea'] = np.log1p(train['GrLivArea'])
sns.distplot(train['GrLivArea'], fit=norm);
fig = plt.figure()
res = stats.probplot(train['GrLivArea'], plot=plt)
fig = plt.figure()
plt.plot(train['GrLivArea'][index],train['SalePrice'][index],'o')

# test对数化处理
test['GrLivArea'] = np.log1p(test['GrLivArea'])

0      1170
80        2
360       2
479       1
473       1
420       1
397       1
390       1
384       1
514       1
234       1
232       1
205       1
156       1
144       1
120       1
481       1
53        1
528       1
Name: LowQualFinSF, dtype: int64
0.0

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。
显然，地上生活区面积与价格有很强的线性关系。

2.2.29齐全的高档洗浴间

# FullBath 齐全的高档洗浴间

# 探究该特征的值的分布
print(train['FullBath'].value_counts())
print(train['FullBath'].isnull().sum()/len(train))

# 对数化处理、绘图
sns.distplot(train['FullBath'], fit=norm);
fig = plt.figure()
res = stats.probplot(train['FullBath'], plot=plt)
fig = plt.figure()
plt.plot(train['FullBath'],train['SalePrice'],'o')

2    606
1    563
3     16
0      5
Name: FullBath, dtype: int64
0.0





[<matplotlib.lines.Line2D at 0x293a0bca5e0>]

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。
显然，齐全的高档洗浴间与价格有很强的线性关系。

2.2.30齐全的洗浴间

# BsmtFullBath 齐全的洗浴间

# 探究该特征的值的分布
print(train['BsmtFullBath'].value_counts())
print(train['BsmtFullBath'].isnull().sum()/len(train))

# 对数化处理、绘图
train['BsmtFullBath'] = np.log1p(train['BsmtFullBath'])
sns.distplot(train['BsmtFullBath'], fit=norm);
fig = plt.figure()
res = stats.probplot(train['BsmtFullBath'], plot=plt)
fig = plt.figure()
plt.plot(train['BsmtFullBath'],train['SalePrice'],'o')

# test对数化处理、部分的平均定义
test['BsmtFullBath'] = test['BsmtFullBath'].fillna(test['BsmtFullBath'].mean())
test['BsmtFullBath'] = np.log1p(test['BsmtFullBath'])

0    687
1    497
2      6
Name: BsmtFullBath, dtype: int64
0.0

在这里插入图片描述

由图易得在对数化处理后，拟合效果不能很好匹配。
使用PP图得出的目标值偏离正态分布趋势。
由散点图的趋势可得：齐全洗浴间的数目与价格有不具备显著的线性关系。

2.2.31高档房间数

# TotRmsAbvGrd高档房间数

# 探究该特征的值的分布
print(train['TotRmsAbvGrd'].value_counts())
print(train['TotRmsAbvGrd'].isnull().sum()/len(train))

# 对数化处理、绘图
train['TotRmsAbvGrd'] = np.log1p(train['TotRmsAbvGrd'])
sns.distplot(train['TotRmsAbvGrd'], fit=norm);
fig = plt.figure()
res = stats.probplot(train['TotRmsAbvGrd'], plot=plt)
fig = plt.figure()
plt.plot(train['TotRmsAbvGrd'],train['SalePrice'],'o')

# test对数化处理、部分的平均定义
test['TotRmsAbvGrd'] = test['TotRmsAbvGrd'].fillna(test['TotRmsAbvGrd'].mean())
test['TotRmsAbvGrd'] = np.log1p(test['TotRmsAbvGrd'])

6     328
7     267
5     243
8     150
4      77
9      58
10     33
3      16
11     10
12      7
2       1
Name: TotRmsAbvGrd, dtype: int64
0.0

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。
显然，高档房间数与价格有很强的线性关系。

2.2.32实用等级

# Functional 实用等级

# 探究该特征的值的分布
print(train['Fireplaces'].value_counts())
print(train['Fireplaces'].isnull().sum()/len(train))

# 数据量过少，可排除处理
train = train.drop('FireplaceQu',axis=1)
test = test.drop('FireplaceQu',axis=1)

0    563
1    531
2     92
3      4
Name: Fireplaces, dtype: int64
0.0

*数据量过少，可排除处理

2.2.33车库位置

# GarageType 车库位置

# 探究该特征的值的分布
print(train['GarageType'].value_counts())
print(train['GarageType'].isnull().sum()/len(train))

# 填充空缺数据
train['GarageType'] = train['GarageType'].fillna('unknown')
test['GarageType'] = test['GarageType'].fillna('unknown')

Attchd     705
Detchd     337
BuiltIn     65
Basment     12
CarPort      6
2Types       4
Name: GarageType, dtype: int64
0.05126050420168067

*数据残缺不具备普遍意义。

2.2.33车库库龄

# GarageYrBlt 车库库龄

# 探究该特征的值的分布
print(train['GarageYrBlt'].value_counts())
print(train['GarageYrBlt'].isnull().sum()/len(train))

# 部分数据的平均定义
train['GarageYrBlt'] = train['GarageYrBlt'].fillna(train['GarageYrBlt'].mean())
test['GarageYrBlt'] = test['GarageYrBlt'].fillna(test['GarageYrBlt'].mean())

# 绘制散点图
plt.plot(train['GarageYrBlt'],train['SalePrice'],'o')

2004.0    52
2003.0    47
2005.0    41
1977.0    31
2000.0    26
          ..
1937.0     1
1906.0     1
1947.0     1
1900.0     1
1933.0     1
Name: GarageYrBlt, Length: 96, dtype: int64
0.05126050420168067





[<matplotlib.lines.Line2D at 0x293a1097310>]

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的车库库龄，价格分布无明显不同，结果不具有显著性。
车库库龄与出售价格，二者不成线性关系。

2.2.34车库的内部装饰

# GarageFinish 车库的内部装饰

# 探究该特征的值的分布
print(train['GarageFinish'].value_counts())
print(train['GarageFinish'].isnull().sum()/len(train))

# 填充空缺数据
train['GarageFinish'] = train['GarageFinish'].fillna('unknown')
test['GarageFinish'] = test['GarageFinish'].fillna('unknown')

Unf    526
RFn    340
Fin    263
Name: GarageFinish, dtype: int64
0.05126050420168067

*数据残缺不具备普遍意义。

2.2.35车库存车量

# GarageCars 车库存车量

# 探究该特征的值的分布
print(train['GarageCars'].value_counts())
print(train['GarageCars'].isnull().sum()/len(train))

# 对数化处理、绘图
train['GarageCars'] = np.log1p(train['GarageCars'])
sns.distplot(train['GarageCars'], fit=norm);
fig = plt.figure()
res = stats.probplot(train['GarageCars'], plot=plt)
fig = plt.figure()
plt.plot(train['GarageCars'],train['SalePrice'],'o')

# test空值填充
test['GarageCars'] = test['GarageCars'].fillna(0)

2    694
1    325
3    106
0     61
4      4
Name: GarageCars, dtype: int64
0.0

在这里插入图片描述

对目标列进行拟合正态分布，得到最接近真实数据的标准正态分布曲线。
使用PP图将理论的正态分布图和实际的分布图作对比，得出的目标值接近正态分布。
显然，车库存车量与价格有很强的线性关系。

2.2.36车库面积

# GarageArea 车库面积

# 探究该特征的值的分布
print(train['GarageArea'].value_counts())
print(train['GarageArea'].isnull().sum()/len(train))
train['noGarageArea']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noGarageArea'][train['GarageArea'] == 0]  = 1
index = train['GarageArea'] > 0

# 对数化处理、绘图
train['GarageArea'] = np.log1p(train['GarageArea'])
sns.distplot(train['GarageArea'][index], fit=norm);
fig = plt.figure()
res = stats.probplot(train['GarageArea'][index], plot=plt)
fig = plt.figure()
plt.plot(train['GarageCars'],train['SalePrice'],'o')

# test空值填充、对数化处理
test['GarageArea'] = test['GarageArea'].fillna(0)
test['noGarageArea']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noGarageArea'][test['GarageArea'] == 0]  = 1
test['GarageArea'] = np.log1p(test['GarageArea'])

0      61
440    44
576    44
240    34
484    30
       ..
605     1
604     1
602     1
601     1
526     1
Name: GarageArea, Length: 385, dtype: int64
0.0

在这里插入图片描述

由图易得在对数化处理后，拟合效果能较好匹配。
使用PP图得出的目标值在大于2时，偏离正态分布趋势。
由散点图的趋势可得：车库面积与价格有不具备显著的线性关系。

2.2.37车库质量

# GarageQual 车库质量

# 探究该特征的值的分布
print(train['GarageQual'].value_counts())
print(train['GarageQual'].isnull().sum()/len(train))

# 填充空缺数据
train['GarageQual'] = train['GarageQual'].fillna('unknown')
test['GarageQual'] = test['GarageQual'].fillna('unknown')

TA    1073
Fa      40
Gd      12
Po       2
Ex       2
Name: GarageQual, dtype: int64
0.05126050420168067

*数据残缺不具备普遍意义。

2.2.38车库评分

# GarageCond 车库评分

# 探究该特征的值的分布
print(train['GarageCond'].value_counts())
print(train['GarageCond'].isnull().sum()/len(train))

# 填充空缺数据
train['GarageCond'] = train['GarageCond'].fillna('unknown')
test['GarageCond'] = test['GarageCond'].fillna('unknown')

TA    1083
Fa      32
Gd       7
Po       5
Ex       2
Name: GarageCond, dtype: int64
0.05126050420168067

*数据残缺不具备普遍意义。

2.2.39木阳台面积

# WoodDeckSF 木阳台面积

# 去掉个别不相干的值
train['noWoodDeckSF']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noWoodDeckSF'][train['OpenPorchSF'] == 0]  = 1
index = train['WoodDeckSF'] > 0
train['WoodDeckSF'] = np.log1p(train['WoodDeckSF'])

# 绘制图
sns.distplot(train['WoodDeckSF'][index], fit=norm);
fig = plt.figure()
res = stats.probplot(train['WoodDeckSF'][index], plot=plt)
fig = plt.figure()
plt.plot(train['WoodDeckSF'],train['SalePrice'],'o')

# 去掉test个别不相干的值
test['noWoodDeckSF']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noWoodDeckSF'][test['WoodDeckSF'] == 0]  = 1
test['WoodDeckSF'] = np.log1p(test['WoodDeckSF'])

在这里插入图片描述

由图易得在对数化处理后，拟合效果能较好匹配。
使用PP图得出的目标值在大于2时，偏离正态分布趋势。
由散点图的趋势可得：木阳台面积与价格有不具备显著的线性关系。

2.2.40OpenPorchSF

# OpenPorchSF 开放门廊面积

# 去掉个别不相干的值
train['noOpenPorchSF']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noOpenPorchSF'][train['OpenPorchSF'] == 0]  = 1
index = train['OpenPorchSF'] > 0
train['OpenPorchSF'] = np.log1p(train['OpenPorchSF'])

# 绘制图
sns.distplot(train['OpenPorchSF'][index], fit=norm);
fig = plt.figure()
res = stats.probplot(train['OpenPorchSF'][index], plot=plt)
fig = plt.figure()
plt.plot(train['OpenPorchSF'],train['SalePrice'],'o')

# 去掉test个别不相干的值
test['noOpenPorchSF']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noOpenPorchSF'][test['OpenPorchSF'] == 0]  = 1
test['OpenPorchSF'] = np.log1p(test['OpenPorchSF'])

在这里插入图片描述

由图易得在对数化处理后，拟合效果能较好匹配。
使用PP图得出的目标值在小于-2和大于2时，偏离正态分布趋势。
由散点图的趋势可得：开放门廊面积与价格有不具备显著的线性关系。

2.2.41封闭门廊面积

# EnclosedPorch封闭门廊面积

# 探究该特征的值的分布、去掉个别不相干的值
print(train['EnclosedPorch'].value_counts())
train['noEnclosedPorch']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noEnclosedPorch'][train['EnclosedPorch'] == 0]  = 1
index = train['EnclosedPorch'] > 0

# 绘制图像
sns.distplot(train['EnclosedPorch'][index], fit=norm);
fig = plt.figure()
res = stats.probplot(train['EnclosedPorch'][index], plot=plt)
fig = plt.figure()
plt.plot(train['EnclosedPorch'],train['SalePrice'],'o')

# 去掉test个别不相干的值
test['noEnclosedPorch']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noEnclosedPorch'][test['EnclosedPorch'] == 0]  = 1

0      1012
112      12
216       5
120       5
144       5
       ... 
160       1
162       1
169       1
170       1
386       1
Name: EnclosedPorch, Length: 107, dtype: int64

在这里插入图片描述

由图易得在对数化处理后，拟合效果能较好匹配。
使用PP图得出的目标值在小于-2和大于2时，偏离正态分布趋势。
由散点图的趋势可得：封闭门廊面积与价格有不具备显著的线性关系。

2.2.42三季门廊面积

# 3SsnPorch 三季门廊面积

# 探究该特征的值的分布
print(train['3SsnPorch'].value_counts())
print(train['3SsnPorch'].isnull().sum()/len(train))

# 数据高度集中，可排除处理
train = train.drop('3SsnPorch',axis=1)
test = test.drop('3SsnPorch',axis=1)

0      1172
144       2
168       2
180       1
96        1
130       1
140       1
162       1
508       1
407       1
196       1
216       1
238       1
245       1
290       1
320       1
182       1
Name: 3SsnPorch, dtype: int64
0.0

*数据高度集中，可排除处理

2.2.43屏风门廊面积

# ScreenPorch 屏风门廊面积

# 定义部分特征的量
train['noScreenPorch']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noScreenPorch'][train['ScreenPorch'] == 0]  = 1
index = train['ScreenPorch'] > 0

# 对数化处理、绘图
train['ScreenPorch'] = np.log1p(train['ScreenPorch'])
sns.distplot(train['ScreenPorch'][index], fit=norm);
fig = plt.figure()
res = stats.probplot(train['ScreenPorch'][index], plot=plt)
fig = plt.figure()
plt.plot(train['ScreenPorch'][index],train['SalePrice'][index],'o')

# 定义部分test特征的量
test['noScreenPorch']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noScreenPorch'][test['ScreenPorch'] == 0]  = 1
test['ScreenPorch'] = np.log1p(test['ScreenPorch'])

在这里插入图片描述

由图易得在对数化处理后，拟合效果能较好匹配。
使用PP图得出的目标值在小于-1.5和大于1.5时，偏离正态分布趋势。
由散点图的趋势可得：高度离散，不具备普遍意义。

2.2.44泳池面积、质量

# PoolArea、PoolQC 泳池面积、质量

# 探究该特征的值的分布
print(train['PoolArea'].value_counts())
print(train['PoolArea'].isnull().sum()/len(train))
print(train['PoolQC'].value_counts())
print(train['PoolQC'].isnull().sum()/len(train))

# 数据高度集中/过少，可排除处理
train = train.drop('PoolArea',axis=1)
train = train.drop('PoolQC',axis=1)
test = test.drop('PoolArea',axis=1)
test = test.drop('PoolQC',axis=1)

0      1187
648       1
576       1
519       1
Name: PoolArea, dtype: int64
0.0
Fa    2
Gd    1
Name: PoolQC, dtype: int64
0.9974789915966387

*数据高度集中/过少，可排除处理

2.2.45杂余项

# Fence围栏质量、MiscFeature杂项功能、MiscVal其他功能的价值、MoSold已售月份、YrSold已售年份

# 探究该特征的值的分布
print(train['Fence'].value_counts())
print(train['Fence'].isnull().sum()/len(train))
print(train['MiscFeature'].value_counts())
print(train['MiscFeature'].isnull().sum()/len(train))
print(train['MiscVal'].value_counts())
print(train['MiscVal'].isnull().sum()/len(train))
print(train['MoSold'].value_counts())
print(train['MoSold'].isnull().sum()/len(train))
print(train['YrSold'].value_counts())
print(train['YrSold'].isnull().sum()/len(train))

# 数据高度集中/过少，可排除处理
train = train.drop('Fence',axis=1)
train = train.drop('MiscFeature',axis=1)
train = train.drop('MiscVal',axis=1)
train = train.drop('MoSold',axis=1)
train = train.drop('YrSold',axis=1)
test = test.drop('Fence',axis=1)
test = test.drop('MiscFeature',axis=1)
test = test.drop('MiscVal',axis=1)
test = test.drop('MoSold',axis=1)
test = test.drop('YrSold',axis=1)

MnPrv    135
GdPrv     51
GdWo      45
MnWw       9
Name: Fence, dtype: int64
0.7983193277310925
Shed    43
Othr     2
Gar2     2
TenC     1
Name: MiscFeature, dtype: int64
0.9596638655462185
0        1143
400        10
500         7
700         4
450         4
2000        4
600         4
1200        2
480         2
1150        1
800         1
15500       1
3500        1
560         1
2500        1
1300        1
1400        1
350         1
8300        1
Name: MiscVal, dtype: int64
0.0
6     219
7     194
5     179
4     118
8      89
3      86
10     67
11     59
9      46
2      46
12     45
1      42
Name: MoSold, dtype: int64
0.0
2009    287
2007    262
2008    261
2006    226
2010    154
Name: YrSold, dtype: int64
0.0

*数据高度集中/过少，可排除处理

2.2.46墙体贴面类型

# MasVnrType墙体贴面类型

# 探究该特征的值的分布
print(train['MasVnrType'].value_counts())
train['MasVnrType'] = train['MasVnrType'].fillna('None')

# 绘制散点图
plt.plot(train['GrLivArea'][train['MasVnrType'] == 'None'],train['SalePrice'][train['MasVnrType'] == 'None'],'o')
plt.plot(train['GrLivArea'][train['MasVnrType'] == 'BrkFace'],train['SalePrice'][train['MasVnrType'] == 'BrkFace'],'o')
plt.plot(train['GrLivArea'][train['MasVnrType'] == 'Stone'],train['SalePrice'][train['MasVnrType'] == 'Stone'],'o')

# test填充空值
test['MasVnrType'] = test['MasVnrType'].fillna('None')

None       730
BrkFace    374
Stone       73
BrkCmn       9
Name: MasVnrType, dtype: int64

在这里插入图片描述

由散点图的趋势可获得显而易见的结果：不同的墙体贴面类型情况下，最终价格分布均在一定范围内，不具有显著性。
三种不同的墙体贴面类型的情况下，在物业选择相同时，考虑地上生活区面积与出售价格，得到二者不成一定的线性关系。

2.2.47墙体贴面的面积

# MasVnrArea 墙体贴面的面积

# 去掉个别不相干的值
train['MasVnrArea'] = train['MasVnrArea'].fillna(0)
train['noMasVnrArea']  = pd.Series(np.zeros((len(train))),index = train.index)
train['noMasVnrArea'][train['MasVnrArea'] == 0]  = 1
index = train['MasVnrArea'] > 0

# 对数化处理、绘图
train['MasVnrArea'] = np.log1p(train['MasVnrArea'])
sns.distplot(train['MasVnrArea'][index], fit=norm);
fig = plt.figure()
res = stats.probplot(train['MasVnrArea'][index], plot=plt)
fig = plt.figure()
plt.plot(train['MasVnrArea'][index],train['SalePrice'][index],'o')

# 去掉个别test不相干的值
test['MasVnrArea'] = test['MasVnrArea'].fillna(0)
test['noMasVnrArea']  = pd.Series(np.zeros((len(test))),index = test.index)
test['noMasVnrArea'][test['MasVnrArea'] == 0]  = 1
test['MasVnrArea'] = np.log1p(test['MasVnrArea'])

在这里插入图片描述

由图易得在对数化处理后，拟合效果能较好匹配。
使用PP图得出的目标值在小于-1.5和大于1.5时，偏离正态分布趋势。
由散点图的趋势可得：高度离散，不具备普遍意义。

2.2.48第二个建筑面积 (如果存在)

# BsmtFinType2 第二个建筑面积 (如果存在)

# 探究该特征的值的分布
print(train['BsmtFinType2'].value_counts())
print(train['BsmtFinType2'].isnull().sum()/len(train))

# 填充空缺数据
train['BsmtFinType2'] = train['BsmtFinType2'].fillna('Unf')
test['BsmtFinType2'] = test['BsmtFinType2'].fillna('Unf')

Unf    1012
Rec      47
LwQ      39
BLQ      29
ALQ      17
GLQ      13
Name: BsmtFinType2, dtype: int64
0.02773109243697479

*数据残缺不具备普遍意义。

2.2.49杂余项2

# KitchenQual餐厅质量、Functional实用等级、Exterior1st外部覆盖物、Exterior2nd外部覆盖物 (如果有第二种)、BsmtHalfBath地下室有洗浴间、齐全的洗浴间BsmtFullBath、BsmtUnfSF未完成地下室面积

# 探究该特征的值的分布
print(train['KitchenQual'].value_counts())
print(train['KitchenQual'].isnull().sum()/len(train))
print(train['Functional'].value_counts())
print(train['Functional'].isnull().sum()/len(train))
print(train['Exterior1st'].value_counts())
print(train['Exterior1st'].isnull().sum()/len(train))
print(train['Exterior2nd'].value_counts())
print(train['Exterior2nd'].isnull().sum()/len(train))
print(train['BsmtHalfBath'].value_counts())
print(train['BsmtHalfBath'].isnull().sum()/len(train))
print(train['BsmtFullBath'].value_counts())
print(train['BsmtFullBath'].isnull().sum()/len(train))
print(train['BsmtUnfSF'].value_counts())
print(train['BsmtUnfSF'].isnull().sum()/len(train))

# 填充空缺数据
test['KitchenQual'] = test['KitchenQual'].fillna('TA')
test['Functional'] = test['Functional'].fillna('Typ')
test['Exterior1st'] = test['Exterior1st'].fillna('unknown')
test['Exterior2nd'] = test['Exterior2nd'].fillna('unknown')
test['BsmtHalfBath'] = test['BsmtHalfBath'].fillna(0)
test['BsmtFullBath'] = test['BsmtFullBath'].fillna(0)
test['BsmtUnfSF'] = test['BsmtUnfSF'].fillna(0)

TA    637
Gd    466
Ex     53
Fa     34
Name: KitchenQual, dtype: int64
0.0
Typ     1100
Min2      32
Min1      27
Mod       14
Maj1      13
Maj2       4
Name: Functional, dtype: int64
0.0
VinylSd    385
HdBoard    202
MetalSd    189
Wd Sdng    179
Plywood     86
BrkFace     46
CemntBd     42
WdShing     21
Stucco      21
AsbShng     15
ImStucc      1
CBlock       1
BrkComm      1
AsphShn      1
Name: Exterior1st, dtype: int64
0.0
VinylSd    376
HdBoard    186
MetalSd    183
Wd Sdng    177
Plywood    115
CmentBd     41
Wd Shng     31
BrkFace     24
Stucco      21
AsbShng     16
ImStucc      7
Brk Cmn      6
Stone        3
AsphShn      3
CBlock       1
Name: Exterior2nd, dtype: int64
0.0
0    1125
1      65
Name: BsmtHalfBath, dtype: int64
0.0
0.000000    687
0.693147    497
1.098612      6
Name: BsmtFullBath, dtype: int64
0.0
0      100
384      8
572      7
300      6
319      5
      ... 
778      1
779      1
783      1
784      1
568      1
Name: BsmtUnfSF, Length: 668, dtype: int64
0.0

*数据残缺/高度离散/高度集中，不具备普遍意义。

2.3相关性分析

2.3.1检查空缺值

print(train.isnull().sum().max())
print(test.isnull().sum().max())

0
1

2.3.2分析关键的相关性特征

# 相关性分析
corrmat = train.corr()
k = 10
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
print(cols)
cm = train[cols].corr()
f, ax = plt.subplots(figsize=(14, 11))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True,fmt='.2f')
plt.show()

Index(['SalePrice', 'totalArea', 'OverallQual', 'GrLivArea', 'GarageCars',
       '1stFlrSF', 'FullBath', 'YearBuilt', 'TotRmsAbvGrd', 'YearRemodAdd'],
      dtype='object')

在这里插入图片描述

[‘SalePrice’,‘totalArea’,‘OverallQual’,‘GrLivArea’,‘TotalBsmtSF’,‘GarageCars’,‘1stFlrSF’,‘FullBath’,‘YearBuilt’,‘TotRmsAbvGrd’]等十项具有明确的相关性

2.4标准化

concat_data = pd.concat([train,test])

# one-hot编码
dummies_data = pd.get_dummies(concat_data.drop('SalePrice', axis=1))

# 归一化
dummies_data = StandardScaler().fit_transform(dummies_data)
X = dummies_data[:len(train)]
print(X.shape)
submission_data = dummies_data[-len(test):]
print(submission_data.shape)
y = concat_data.iloc[:len(train)]['SalePrice'].values

(1190, 281)
(1459, 281)

3.建模与预测

3.1模型的选择

from sklearn import preprocessing
import seaborn as sns       
from scipy import stats
from scipy.stats import  norm
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model, svm, gaussian_process
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np
cols = ['totalArea', 'OverallQual', 'GrLivArea', 'GarageCars',
       '1stFlrSF', 'YearBuilt', 'FullBath', 'YearRemodAdd', 'TotRmsAbvGrd']
x = train[cols].values
y = train['SalePrice'].values
x_scaled = preprocessing.StandardScaler().fit_transform(x)
y_scaled = preprocessing.StandardScaler().fit_transform(y.reshape(-1,1))
X_train,X_test, y_train, y_test = train_test_split(x_scaled, y_scaled, test_size=0.33, random_state=42)

# 选取向量机、随机森林与岭回归三种模型进行比较
clfs = {
        'svm':svm.SVR(), 
        'RandomForestRegressor':RandomForestRegressor(n_estimators=400),
        'BayesianRidge':linear_model.BayesianRidge()
       }
for clf in clfs:
    try:
        clfs[clf].fit(X_train, y_train)
        y_pred = clfs[clf].predict(X_test)
        print(clf + " cost:" + str(np.sum(y_pred-y_test)/len(y_pred)) )
    except Exception as e:
        print(clf + " Error:")
        print(str(e))

svm cost:6.8075372464129
RandomForestRegressor cost:2.6014082689666065
BayesianRidge cost:-2.9514604668616684

*由上述结果可得岭回归模型为最优，所以我们选择岭回归模型进行建模与预测。

3.2训练模型

from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.utils import resample

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 对超参数取值进行猜测和验证
alphas = [.0001, .0003, .0005, .0007, .0009, .01, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50,100,200,300,500]

scores = [np.sqrt(-cross_val_score(Ridge(alpha), X_train, y_train, scoring="neg_mean_squared_error", cv = 10)).mean()
          for alpha in alphas]

# 画图查看不同超参数的模型的分数
plt.plot(alphas, scores, label=Ridge.__name__)
plt.legend(loc='center')
plt.xlabel('alpha')
plt.ylabel('cross validation score')
plt.tight_layout()
plt.show()

# 从图中，可以看出alpha参数取50时，均方根误差最小，所以我们选取50为alpha，重新计算均方根误差
reg = linear_model.Ridge(50)
reg.fit(X_train,y_train)
y_pred = reg.predict(X_test)
print('均方根误差:',mean_squared_error(y_test,y_pred)**0.5)

在这里插入图片描述

均方根误差: 0.08428597203119031

从图中可得出alpha参数取50时，均方根误差最小，所以我们选取50为alpha.得到均方根误差在[0.08，0.1]之间，误差较小。

import xgboost as xgb

regr = xgb.XGBRegressor(
                 colsample_bytree=0.2,
                 gamma=0.0,
                 learning_rate=0.05,
                 max_depth=6,
                 min_child_weight=1.5,
                 n_estimators=7200,                                                                  
                 reg_alpha=0.9,
                 reg_lambda=0.6,
                 subsample=0.2,
                 seed=42,
                 silent=1)

regr.fit(X_train,y_train)

# 在训练集上进行预测，并计算他的均方根误差
y_pred = regr.predict(X_test)
print('均方根误差:',mean_squared_error(y_test,y_pred)**0.5)

# 在kaggle给的submission文件中进行预测
y_pred_xgb = regr.predict(submission_data)
y_pred_xgb = np.expm1(y_pred_xgb)

[01:10:25] WARNING: ..\src\learner.cc:541: 
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


均方根误差: 0.10272477705432077

通过XGBoost模型算出均方根误差在0.1左右，与岭回归模型接近，误差也较小。

3.3预测结果

reg = linear_model.Ridge(50)
reg.fit(X,y)
y_test_ridge = reg.predict(submission_data)
y_test_ridge = np.expm1(y_test_ridge)

# 预测结果采用岭回归与XGBoost二者模型结果的均值
result = (y_test_ridge+y_pred_xgb)/2

# 画出地上生活区面积与预测结果的关系图进行检验
plt.plot(np.expm1(test['GrLivArea']),result,'o')

[<matplotlib.lines.Line2D at 0x21638b44820>]

在这里插入图片描述

通过上图可知，房价与地上生活区面积成正相关，符合特征分析，所以预测结果合理。

# 保存数据
my_submission = pd.DataFrame({'Id':test.index,'SalePrice': result})
my_submission.to_csv('ex4-submission.csv', index=False)