Day03_Pandas基础
一. pandas层次化索引
-
创建多层索引
-
隐式构造
- Series也可以创建多层索引
df = DataFrame(np.random.rand(4,2), index=[['a','a','b','b'],[1,2,1,2]], columns=['data1','data2']) df 输出: data1 data2 a 1 0.495897 0.130101 2 0.551563 0.450209 b 1 0.664648 0.701401 2 0.442188 0.500506 s = Series(np.random.rand(4),index = [['a','a','b','b'],[1,2,1,2]]) s 输出: a 1 0.192162 2 0.236828 b 1 0.560971 2 0.751398 dtype: float64 -
显示构造MultiIndex
-
创建MultiIndex
# 使用数组 Mindex = pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]]) Mindex 输出: MultiIndex(levels=[['a', 'b'], [1, 2]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) # 使用tuple Mindex = pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)]) Mindex 输出: MultiIndex(levels=[['a', 'b'], [1, 2]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) # 使用product Mindex = pd.MultiIndex.from_product([['a','b'],[1,2]]) Mindex 输出: MultiIndex(levels=[['a', 'b'], [1, 2]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) -
人口统计
index = [('California',2000),('California',2010), ('New York',2000),('New York',2010), ('Texas',2000),('Texas',2010)] Mindex = pd.MultiIndex.from_tuples(index) Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) pop = DataFrame(populations,index=Mindex,columns = ['人口']) pop 输出: 人口 California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561
-
-
练习:
创建一个DataFrame,表示出张三李四期中期末各科成绩a = np.random.randint(0,150,(8,2)) Mindex = pd.MultiIndex.from_product([['语文','数学','英语','理综'],['期中','期末']]) ddd = DataFrame(data = a, index = Mindex, columns = ['张三','李四']) ddd 输出: 张三 李四 语文 期中 90 55 期末 59 20 数学 期中 132 85 期末 65 116 英语 期中 57 28 期末 73 145 理综 期中 142 48 期末 120 54
-
-
多层列索引
- 除了行索引index,列索引columns也能用同样的方法创建多层索引
a = np.random.randint(0,150,(2,8)) Mindex = pd.MultiIndex.from_product([['语文','数学','英语','理综'],['期中','期末']]) ddd = DataFrame(data = a, index = ['张三','李四'], columns = Mindex) ddd 输出: 语文 数学 英语 理综 期中 期末 期中 期末 期中 期末 期中 期末 张三 57 110 65 3 127 66 146 32 李四 73 24 57 19 35 133 57 127 -
多层索引对象的索引与切片操作
-
Series的操作
- 【重要】对于Series来说,直接中括号[]与使用.loc()完全一样,因此,推荐使用中括号索引和切片
index = [('California',2000),('California',2010), ('New York',2000),('New York',2010), ('Texas',2000),('Texas',2010)] index = [['California','California','New York','New York','Texas','Texas'],[2000,2010,2000,2010,2000,2010]] # index = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) #相同的效果 populations = [33871648,37253956, 18976457,19378102, 20851820,25145561] pop = Series(populations,index = index) pop 输出: California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64-
索引
pop['California'] 输出: 2000 33871648 2010 37253956 dtype: int64 -
切片
pop['California':'New York'] 输出: California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 dtype: int64 pop[::-1] 输出: Texas 2010 25145561 2000 20851820 New York 2010 19378102 2000 18976457 California 2010 37253956 2000 33871648 dtype: int64 pop[:,2010] 输出: California 37253956 New York 19378102 Texas 25145561 dtype: int64
-
DataFrame的操作
-
可以直接使用列名称来进行列索引
index = [('California',2000),('California',2010), ('New York',2000),('New York',2010), ('Texas',2000),('Texas',2010)] populations = [33871648,37253956, 18976457,19378102, 20851820,25145561] Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) pop = DataFrame(populations,index=Mindex,columns=['population']) pop 输出: population California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 pop['population'] 输出: California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 Name: population, dtype: int64 -
使用行索引需要用ix(),loc()等函数
【极其重要】推荐使用loc()函数pop.loc['California'] 输出: population 2000 33871648 2010 37253956 pop.loc[('California',2000)] 输出: population 33871648 Name: (California, 2000), dtype: int64 pop.loc['California',2000] 输出: population 33871648 Name: (California, 2000), dtype: int64 pop.loc['California',2000][0] 输出: 33871648 pop.loc['California'].loc[2000] 输出: population 33871648 Name: 2000, dtype: int64 pop.loc[('California',2000),'population'] 输出: 33871648 -
注意在对行索引的时候,若一级行索引还有多个,对二级行索引会遇到问题!也就是说,无法直接对二级索引进行索引,必须让二级索引变成一级索引后才能对其进行索引!
pop.loc['California':'New York',2000] 输出: TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2000] of <class 'int'>
-
-
-
索引的堆(unstack)
-
unstack()
-
【小技巧】使用unstack()的时候,level等于哪一个,哪一个就消失,出现在列里
populations = [33871648,37253956, 18976457,19378102, 20851820,25145561] Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) pop = DataFrame(populations,index=Mindex,columns=['population']) pop 输出: population California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 /////////////////////////////////////////////////////////////////////////////// pop.unstack(level=0) 输出: population California New York Texas 2000 33871648 18976457 20851820 2010 37253956 19378102 25145561 pop.unstack(level=1) 输出: population 2000 2010 California 33871648 37253956 New York 18976457 19378102 Texas 20851820 25145561
-
-
-
聚合操作
-
【注意】
需要指定level
【小技巧】和unstack()相反,聚合的时候,level等于哪一个,哪一个就保留populations = [33871648,37253956, 18976457,19378102, 20851820,25145561] Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) pop = DataFrame(populations,index=Mindex,columns=['population']) pop 输出: population California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 pop.mean(level=0) 输出: population California 35562802.0 New York 19177279.5 Texas 22998690.5 pop.mean(level=1) 输出: population 2000 2.456664e+07 2010 2.725921e+07 -
练习:
计算各个科目期中期末平均成绩
计算各科目张三李四的最高分df = DataFrame(np.random.randint(0,150,(6,3)), index = pd.MultiIndex.from_product([['张三','李四','王五'],['期中','期末']]), columns = ['语文','数学','编程']) display(df) print("计算各个科目期中期末平均成绩") display(df.mean(level = 1)) print("计算各科目张三李四的最高分") display(df.loc['张三':'李四'].max(level = 1)) 输出: 语文 数学 编程 张三 期中 5 0 136 期末 139 4 92 李四 期中 26 74 52 期末 51 105 18 王五 期中 117 34 51 期末 58 55 18 计算各个科目期中期末平均成绩 语文 数学 编程 期中 49.333333 36.000000 79.666667 期末 82.666667 54.666667 42.666667 计算各科目张三李四的最高分 语文 数学 编程 期中 26 74 136 期末 139 105 92
-
二. pandas拼接操作
-
级联:pd.concat, pd.append
-
回顾numpy的级联
x = np.array([1,2,3]).reshape(1,3) y = np.array([4,5,6]).reshape(1,3) z = np.array([7,8,9]).reshape(1,3) display(x,y,z) np.concatenate([x,y,z]) 输出: array([[1, 2, 3]]) array([[4, 5, 6]]) array([[7, 8, 9]]) Out[7]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # 练习: # 生成2个3*3的矩阵,对其分别进行两个维度上的级联 x = np.random.randn(3,3) x 输出: array([[-0.09120671, -0.52299759, -0.16349379], [ 0.36666059, -1.3587184 , 0.81883706], [ 1.0621649 , 0.72906367, -0.41325065]]) np.concatenate([x,x]) 输出: array([[-0.09120671, -0.52299759, -0.16349379], [ 0.36666059, -1.3587184 , 0.81883706], [ 1.0621649 , 0.72906367, -0.41325065], [-0.09120671, -0.52299759, -0.16349379], [ 0.36666059, -1.3587184 , 0.81883706], [ 1.0621649 , 0.72906367, -0.41325065]]) np.concatenate([x,x],axis=1) 输出: array([[-0.09120671, -0.52299759, -0.16349379, -0.09120671, -0.52299759, -0.16349379], [ 0.36666059, -1.3587184 , 0.81883706, 0.36666059, -1.3587184 , 0.81883706], [ 1.0621649 , 0.72906367, -0.41325065, 1.0621649 , 0.72906367, -0.41325065]]) -
创建生成DataFrame的函数
def make_df(cols,ind): ''' 快速生成一个DataFrame ''' data = {c:[str(c)+str(i) for i in ind] for c in cols} return DataFrame(data,ind) 调用: make_df('ABC', range(3)) 输出: A B C 0 A0 B0 C0 1 A1 B1 C1 2 A2 B2 C2 -
使用pd.concat()级联
# pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)-
简单级联
df1 = make_df('AB',[1,2]) df2 = make_df('AB',[3,4]) display(df1,df2,pd.concat([df1,df2])) 输出: A B 1 A1 B1 2 A2 B2 A B 3 A3 B3 4 A4 B4 A B 1 A1 B1 2 A2 B2 3 A3 B3 4 A4 B4-
可以通过设置axis来改变级联方向
df3 = make_df('WX',[1,3]) df4 = make_df('QT',[5,8]) display(df3,df4,pd.concat([df3,df4],axis=1)) 输出: W X 1 W1 X1 3 W3 X3 Q T 5 Q5 T5 8 Q8 T8 W X Q T 1 W1 X1 NaN NaN 3 W3 X3 NaN NaN 5 NaN NaN Q5 T5 8 NaN NaN Q8 T8 -
注意index在级联时可以重复
x = make_df('AB',[0,1]) y = make_df('AB',[1,8]) pd.concat([x,y]) 输出: A B 0 A0 B0 1 A1 B1 1 A1 B1 8 A8 B8 可以选择忽略index,重新索引 x = make_df('AB',[0,1]) y = make_df('AB',[1,8]) pd.concat([x,y],ignore_index=True) 输出: A B 0 A0 B0 1 A1 B1 2 A4 B4 3 A8 B8 -
使用多层索引
display(pd.concat([x,y])) display(pd.concat([x,y],keys=['x','y'])) display(x,y) display(pd.concat([x,y],axis = 1)) display(pd.concat([x,y],axis = 1,keys = ['v','r'])) 输出: A B 0 A0 B0 1 A1 B1 1 A1 B1 8 A8 B8 A B x 0 A0 B0 1 A1 B1 y 1 A1 B1 8 A8 B8 A B 0 A0 B0 1 A1 B1 A B 1 A1 B1 8 A8 B8 A B A B 0 A0 B0 NaN NaN 1 A1 B1 A1 B1 8 NaN NaN A8 B8 v r A B A B 0 A0 B0 NaN NaN 1 A1 B1 A1 B1 8 NaN NaN A8 B8
-
-
不匹配级联
# 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致-
外连接:补NaN(默认模式)
x = make_df('ABC',[1,2]) y = make_df('BCD',[3,4]) display(x,y,pd.concat([x,y])) 输出: A B C 1 A1 B1 C1 2 A2 B2 C2 B C D 3 B3 C3 D3 4 B4 C4 D4 A B C D 1 A1 B1 C1 NaN 2 A2 B2 C2 NaN 3 NaN B3 C3 D3 4 NaN B4 C4 D4 -
内连接:只连接匹配的项
x = make_df('ABC',[1,2]) y = make_df('BCD',[3,4]) display(x,y,pd.concat([x,y],join="inner")) 输出: A B C 1 A1 B1 C1 2 A2 B2 C2 B C D 3 B3 C3 D3 4 B4 C4 D4 B C 1 B1 C1 2 B2 C2 3 B3 C3 4 B4 C4 -
连接指定轴
x = make_df('ABC',[1,2]) y = make_df('BCD',[3,4]) display(x,y,pd.concat([x,y],join_axes=[x.columns])) 输出: A B C 1 A1 B1 C1 2 A2 B2 C2 B C D 3 B3 C3 D3 4 B4 C4 D4 A B C 1 A1 B1 C1 2 A2 B2 C2 3 NaN B3 C3 4 NaN B4 C4 -
练习:
假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联np.random.seed(15) a = np.random.randint(150,size=(5,3)) ddd2 = DataFrame(a,columns=["王五","李四","赵小六"], index = ["语文","数学","英语","理综","Python"]) np.random.seed(10) ddd = DataFrame(np.random.randint(150,size = (5,3)), columns = ['张三','李四','小莲'], index = ["语文","数学","英语","理综","Python"]) display(ddd2) display(ddd) 输出: 王五 李四 赵小六 语文 140 133 119 数学 128 85 118 英语 17 23 62 理综 10 143 105 Python 39 37 19 张三 李四 小莲 语文 9 125 15 数学 64 113 123 英语 113 8 73 理综 0 40 115 Python 16 100 139 df = pd.concat([ddd,ddd2],axis = 1,ignore_index=False) df pd.concat([ddd,ddd2],keys=["期中","期末"],axis = 0)
-
-
-
使用pd.appnd()级联
# 由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加 x = make_df('AB',np.random.randint(10,size = 2)) y = make_df('AB',np.random.randint(10,20,size = 2)) display(x.append(y)) 输出: A B 6 A6 B6 8 A8 B8 11 A11 B11 18 A18 B18-
练习:
新建一个只有张三李四王老五的期末考试成绩单ddd3,使用append()与期中考试成绩表ddd级联np.random.seed(15) a = np.random.randint(150,size=(5,3)) ddd2 = DataFrame(a,columns=["王五","李四","赵小六"], index = ["语文","数学","英语","理综","Python"]) np.random.seed(10) ddd = DataFrame(np.random.randint(150,size = (5,3)), columns = ['张三','李四','小莲'], index = ["语文","数学","英语","理综","Python"]) display(ddd2) display(ddd) 输出: 王五 李四 赵小六 语文 140 133 119 数学 128 85 118 英语 17 23 62 理综 10 143 105 Python 39 37 19 张三 李四 小莲 语文 9 125 15 数学 64 113 123 英语 113 8 73 理综 0 40 115 Python 16 100 139 df = pd.concat([ddd,ddd2],axis = 1,ignore_index=False) df pd.concat([ddd,ddd2],keys=["期中","期末"],axis = 0)
-
-
-
合并:pd.merge
merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并 使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。 注意每一列元素的顺序不要求一致-
一对一合并
import numpy as np import pandas as pd from pandas import Series,DataFrame df1 = DataFrame({'employee':['Bob','Jake','Lisa'], 'group':['Accounting','Engineering','Engineering'], }) df2 = DataFrame({'employee':['Lisa','Bob','Jake'], 'hire_date':[2004,2008,2012], }) display(df1,df2) 输出: employee group 0 Bob Accounting 1 Jake Engineering 2 Lisa Engineering employee hire_date 0 Lisa 2004 1 Bob 2008 2 Jake 2012 df3 = pd.merge(df1,df2) df3 输出: employee group hire_date 0 Bob Accounting 2008 1 Jake Engineering 2012 2 Lisa Engineering 2004 -
多对一合并
df3 = DataFrame({ 'employee':['Lisa','Jake'], 'group':['Accounting','Engineering'], 'hire_date':[2004,2016]}) df4 = DataFrame({'group':['Accounting','Engineering','Engineering'], 'supervisor':['Carly','Guido','Steve'] }) display(df3,df4,pd.merge(df3,df4)) 输出: employee group hire_date 0 Lisa Accounting 2004 2 Jake Engineering 2016 group supervisor 0 Accounting Carly 1 Engineering Guido 2 Engineering Steve employee group hire_date supervisor 0 Lisa Accounting 2004 Carly 1 Jake Engineering 2016 Guido 2 Jake Engineering 2016 Steve -
多对多合并
df1 = DataFrame({'employee':['Bob','Jake','Lisa'], 'group':['Accounting','Engineering','Engineering']}) df5 = DataFrame({'group':['Engineering','Engineering','HR'], 'supervisor':['Carly','Guido','Steve'] }) display(df1,df5,pd.merge(df1,df5))#多对多 display(pd.concat([df1,df5])) 输出: employee group 0 Bob Accounting 1 Jake Engineering 2 Lisa Engineering group supervisor 0 Engineering Carly 1 Engineering Guido 2 HR Steve employee group supervisor 0 Jake Engineering Carly 1 Jake Engineering Guido 2 Lisa Engineering Carly 3 Lisa Engineering Guido employee group supervisor 0 Bob Accounting NaN 1 Jake Engineering NaN 2 Lisa Engineering NaN 0 NaN Engineering Carly 1 NaN Engineering Guido 2 NaN HR Steve -
key的规范化
- 使用on=显式指定哪一列为key
df1 = DataFrame({'employee':['Jack',"Summer","Steve"], 'group':['Accounting','Finance','Marketing']}) df2 = DataFrame({'employee':['Jack','Bob',"Jake"], 'hire_date':[2003,2009,2012], 'group':['Accounting','sell','ceo']}) display(df1,df2,pd.merge(df1,df2),pd.merge(df1,df2,on = 'employee')) 输出: employee group 0 Jack Accounting 1 Summer Finance 2 Steve Marketing employee group hire_date 0 Jack Accounting 2003 1 Bob sell 2009 2 Jake ceo 2012 employee group hire_date 0 Jack Accounting 2003 employee group_x group_y hire_date 0 Jack Accounting Accounting 2003-
使用left_on和right_on指定左右两边的列作为key
df1 = DataFrame({'employee':['Bobs','Linda','Bill'], 'group':['Accounting','Product','Marketing'], 'hire_date':[1998,2017,2018]}) df5 = DataFrame({'name':['Lisa','Bobs','Bill'], 'hire_dates':[1998,2016,2007]}) display(df1,df5,pd.merge(df1,df5,left_on = 'employee',right_on = 'name')) 输出: employee group hire_date 0 Bobs Accounting 1998 1 Linda Product 2017 2 Bill Marketing 2018 hire_dates name 0 1998 Lisa 1 2016 Bobs 2 2007 Bill employee group hire_date hire_dates name 0 Bobs Accounting 1998 2016 Bobs 1 Bill Marketing 2018 2007 Bill -
练习:
假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?
如果ddd4中张三的名字被打错了,成为了张十三,怎么办?
自行练习多对一,多对多的情况a1 = np.array([[122,107],[21,14],[110,38],[128,111],[50,50]]) ddd1 = DataFrame(a1,columns=["张三","李四"], index = ["语文","数学","英语","理综","计算机"]) a2 = [[122,120],[21,25],[110,95],[128,61],[50,45]] ddd2 = DataFrame(np.random.randint(100,size = (5,3)),columns=["李四","赵小六",'张五'], index = ["语文","数学","英语","理综","计算机"]) display(ddd1,ddd2) pd.merge(ddd1,ddd2,left_index = True,right_index = True,on = ['李四']) 输出: 张三 李四 语文 122 107 数学 21 14 英语 110 38 理综 128 111 计算机 50 50 李四 赵小六 张五 语文 95 76 85 数学 89 15 4 英语 28 79 21 理综 99 58 28 计算机 21 5 58 Out[94]: 张三 李四 赵小六 张五 语文 122 107 76 85 数学 21 14 15 4 英语 110 38 79 21 理综 128 111 58 28 计算机 50 50 5 58
-
内合并与外合并
-
内合并:只保留两者都有的key(默认模式)
df6 = DataFrame({'name':['Peter','Paul','Mary'], 'food':['fish','beans','bread']} ) df7 = DataFrame({'name':['Mary','Joseph'], 'drink':['wine','beer']}) display(df6,df7,pd.merge(df6,df7)) 输出: food name 0 fish Peter 1 beans Paul 2 bread Mary drink name 0 wine Mary 1 beer Joseph food name drink 0 bread Mary winee -
外合并 how=‘outer’:补NaN
df6 = DataFrame({'name':['Peter','Paul','Mary'], 'food':['fish','beans','bread']} ) df7 = DataFrame({'name':['Mary','Joseph'], 'drink':['wine','beer']}) display(df6,df7,pd.merge(df6,df7,how='outer')) 输出: food name 0 fish Peter 1 beans Paul 2 bread Mary drink name 0 wine Mary 1 beer Joseph food name drink 0 fish Peter NaN 1 beans Paul NaN 2 bread Mary wine 3 NaN Joseph beer -
左合并、右合并:how=‘left’,how=‘right’
df6 = DataFrame({'name':['Peter','Paul','Mary'], 'food':['fish','beans','bread']} ) df7 = DataFrame({'name':['Mary','Joseph'], 'drink':['wine','beer']}) display(df6,df7,pd.merge(df6,df7,how='left')) display(df6,df7,pd.merge(df6,df7,how='right')) 输出: food name 0 fish Peter 1 beans Paul 2 bread Mary drink name 0 wine Mary 1 beer Joseph food name drink 0 fish Peter NaN 1 beans Paul NaN 2 bread Mary wine food name drink 0 bread Mary wine 1 NaN Joseph beer -
练习:
如果只有张三赵小六语数英三个科目的成绩,如何合并?
考虑应用情景,使用多种方式合并ddd与ddd4import numpy as np import pandas as pd from pandas import Series,DataFrame np.random.seed(10) df1 = DataFrame({'张三':[127,119,99,119,73],'李四':[125,64,123,99,139]}, index = ['语文','数学','英语','理综','编程'],columns = ['张三','李四']) df2 = DataFrame({'张三':[122,21,110], '赵小六':[78,96,83]}, index = ['语文','数学','英语'] ) display(df1,df2) pd.merge(df1,df2,how = 'left',on = '张三',left_index = True,right_index = True) 输出: 张三 李四 语文 127 125 数学 119 64 英语 99 123 理综 119 99 编程 73 139 张三 赵小六 语文 122 78 数学 21 96 英语 110 83 Out[56]: 张三 李四 赵小六 语文 127 125 78.0 数学 119 64 96.0 英语 99 123 83.0 理综 119 99 NaN 编程 73 139 NaN
-
-
列冲突的解决
-
当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key
df8 = DataFrame({'name':['Peter','Paul','Mary'], 'rank':[1,2,3]}) df9 = DataFrame({'name':['Peter','Paul','Mary'], 'rank':[5,6,7]}) display(df8,df9,pd.merge(df8,df9,on = 'name',suffixes=['rank_L','rank_R'])) 输出: name rank 0 Peter 1 1 Paul 2 2 Mary 3 name rank 0 Peter 5 1 Paul 6 2 Mary 7 name rankrank_L rankrank_R 0 Peter 1 5 1 Paul 2 6 2 Mary 3 7 -
练习:
假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?np.random.seed(100) df4 = DataFrame(np.random.randint(150,size = (5,2)), columns = ['张三','李四'], index = ['语文','数学','英语','理综','计算机']) df5 = DataFrame({'张三':[8,67,87,128,50], '李四':[24,103,83,45,58]}, index = ['语文','数学','英语','理综','计算机'] ) display(df4,df5) pd.merge(df4,df5,left_index = True,right_index = True,suffixes=['-期中','-期末']) 输出: 张三 李四 语文 8 24 数学 67 103 英语 87 79 理综 138 94 计算机 98 53 张三 李四 语文 8 24 数学 67 103 英语 87 83 理综 128 45 计算机 50 58 张三-期中 李四-期中 张三-期末 李四-期末 语文 8 24 8 24 数学 67 103 67 103 英语 87 79 87 83 理综 138 94 128 45 计算机 98 53 50 58
-
-
三. pandas案例
美国各州人口数据分析
四. 回顾:Series/DataFrame运算与ndarray运算的区别
-
ndarray有广播,通过重复已有值来计算
a = np.array([1,2,3]) b = np.array([2]) display(a,b,a+b) # 输出: array([1, 2, 3]) array([2]) array([3, 4, 5]) -
Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值
# Series s = Series([1,2,3]) s2 = Series([2]) display(s,s+1,s+s2) display(s.add(s2,fill_value=0)) 输出: 0 1 1 2 2 3 dtype: int64 0 2 1 3 2 4 dtype: int64 0 3.0 1 NaN 2 NaN dtype: float64 0 3.0 1 2.0 2 3.0 dtype: float64 # DataFrame df = DataFrame({'A':[1,2,3], 'B':[4,5,6]}) df2 = DataFrame(np.random.randint(10,size = (3,2)),columns=['B','c']) display(df,df2,df+df2,df.add(df2,fill_value = 0),df2.add(df,fill_value = 0).astype(int)) 输出: A B 0 1 4 1 2 5 2 3 6 B c 0 8 4 1 7 0 2 2 8 A B c 0 NaN 12 NaN 1 NaN 12 NaN 2 NaN 8 NaN A B c 0 1.0 12 4.0 1 2.0 12 0.0 2 3.0 8 8.0 A B c 0 1 12 4 1 2 12 0 2 3 8 8
本文详细介绍了Pandas的基础知识,包括层次化索引的创建、多层列索引操作,以及数据拼接和合并。重点讨论了如何使用loc()进行索引,concat()和append()的级联操作,以及merge()的一对一、多对一和多对多合并。同时,提供了多个练习案例,如计算平均成绩、处理缺失数据和解决列冲突问题。

484

被折叠的 条评论
为什么被折叠?



