Day03_Pandas

最新推荐文章于 2025-02-07 22:26:20 发布

原创最新推荐文章于 2025-02-07 22:26:20 发布 · 1k 阅读

4 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#python #numpy #数据分析

本文详细介绍了Pandas的基础知识，包括层次化索引的创建、多层列索引操作，以及数据拼接和合并。重点讨论了如何使用loc()进行索引，concat()和append()的级联操作，以及merge()的一对一、多对一和多对多合并。同时，提供了多个练习案例，如计算平均成绩、处理缺失数据和解决列冲突问题。

Day03_Pandas基础

一. pandas层次化索引

创建多层索引

隐式构造

Series也可以创建多层索引

df = DataFrame(np.random.rand(4,2),
               index=[['a','a','b','b'],[1,2,1,2]],
              columns=['data1','data2'])
df

输出：
		  data1		  data2
a	1	0.495897	0.130101
	2	0.551563	0.450209
b	1	0.664648	0.701401
	2	0.442188	0.500506
	

s = Series(np.random.rand(4),index = [['a','a','b','b'],[1,2,1,2]])
s

输出：
a  1    0.192162
    2    0.236828
b  1    0.560971
    2    0.751398
dtype: float64

显示构造MultiIndex

创建MultiIndex

# 使用数组
Mindex = pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
Mindex

输出：
MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
           
      
# 使用tuple
Mindex = pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
Mindex

输出：
MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
           

# 使用product
Mindex = pd.MultiIndex.from_product([['a','b'],[1,2]])
Mindex

输出：
MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

人口统计

index = [('California',2000),('California',2010),
        ('New York',2000),('New York',2010),
        ('Texas',2000),('Texas',2010)]
Mindex = pd.MultiIndex.from_tuples(index)

Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])

pop = DataFrame(populations,index=Mindex,columns = ['人口'])
pop

输出：
						人口
California	2000	33871648
			2010	37253956
New York	2000	18976457
			2010	19378102
Texas	2000	20851820
		2010	25145561

练习：
创建一个DataFrame，表示出张三李四期中期末各科成绩

a = np.random.randint(0,150,(8,2))
Mindex = pd.MultiIndex.from_product([['语文','数学','英语','理综'],['期中','期末']])
ddd = DataFrame(data = a,
                index = Mindex,
                columns = ['张三','李四'])
ddd

输出：
		  张三	李四
语文	期中	90		55
	 期末	 59		 20
数学	期中	132 	85
	  期末 65		 116
英语	期中	57		28
	 期末	 73		 145
理综	期中	142 	48
	 期末	 120	 54

多层列索引

除了行索引index，列索引columns也能用同样的方法创建多层索引

a = np.random.randint(0,150,(2,8))
Mindex = pd.MultiIndex.from_product([['语文','数学','英语','理综'],['期中','期末']])
ddd = DataFrame(data = a,
                index = ['张三','李四'],
                columns = Mindex)
ddd

输出：
语文	数学	英语	理综
期中	期末	期中	期末	期中	期末	期中	期末
张三	57	110	65	3	127	66	146	32
李四	73	24	57	19	35	133	57	127

多层索引对象的索引与切片操作

Series的操作

【重要】对于Series来说，直接中括号[]与使用.loc()完全一样，因此，推荐使用中括号索引和切片

index = [('California',2000),('California',2010),
        ('New York',2000),('New York',2010),
        ('Texas',2000),('Texas',2010)]
index = [['California','California','New York','New York','Texas','Texas'],[2000,2010,2000,2010,2000,2010]]
# index = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) #相同的效果
populations = [33871648,37253956,
              18976457,19378102,
              20851820,25145561]
pop = Series(populations,index = index)
pop

输出：
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

索引

pop['California']

输出：
2000    33871648
2010    37253956
dtype: int64

切片

pop['California':'New York']

输出：
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64


pop[::-1]

输出：
Texas       2010    25145561
            2000    20851820
New York    2010    19378102
            2000    18976457
California  2010    37253956
            2000    33871648
dtype: int64


pop[:,2010]

输出：
California    37253956
New York      19378102
Texas         25145561
dtype: int64

DataFrame的操作

可以直接使用列名称来进行列索引

index = [('California',2000),('California',2010),
        ('New York',2000),('New York',2010),
        ('Texas',2000),('Texas',2010)]
populations = [33871648,37253956,
              18976457,19378102,
              20851820,25145561]
Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
pop = DataFrame(populations,index=Mindex,columns=['population'])
pop

输出：
population
California	2000	33871648
2010	37253956
New York	2000	18976457
2010	19378102
Texas	2000	20851820
2010	25145561

pop['population']

输出：
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
Name: population, dtype: int64

使用行索引需要用ix()，loc()等函数
【极其重要】推荐使用loc()函数

pop.loc['California']

输出：
population
2000	33871648
2010	37253956

pop.loc[('California',2000)]

输出：
population    33871648
Name: (California, 2000), dtype: int64

pop.loc['California',2000]

输出：
population    33871648
Name: (California, 2000), dtype: int64

pop.loc['California',2000][0]

输出：
33871648

pop.loc['California'].loc[2000]

输出：
population    33871648
Name: 2000, dtype: int64

pop.loc[('California',2000),'population']

输出：
33871648

注意在对行索引的时候，若一级行索引还有多个，对二级行索引会遇到问题！也就是说，无法直接对二级索引进行索引，必须让二级索引变成一级索引后才能对其进行索引！
```
pop.loc['California':'New York',2000]

输出：
TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2000] of <class 'int'>
```

索引的堆（unstack）

unstack()

【小技巧】使用unstack()的时候，level等于哪一个，哪一个就消失，出现在列里

populations = [33871648,37253956,
              18976457,19378102,
              20851820,25145561]
Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
pop = DataFrame(populations,index=Mindex,columns=['population'])
pop

输出：
							population
California	2000		33871648
				2010		37253956
New York	2000		18976457
				2010		19378102
Texas		2000		20851820
				2010		25145561

///////////////////////////////////////////////////////////////////////////////
pop.unstack(level=0)

输出：
		population
		California		New York		Texas
2000	33871648		18976457		20851820
2010	37253956		19378102		25145561

pop.unstack(level=1)

输出：
					population
						2000				2010
California		33871648		37253956
New York		18976457		19378102
Texas			20851820		25145561

聚合操作

【注意】
需要指定level
【小技巧】和unstack()相反，聚合的时候，level等于哪一个，哪一个就保留

populations = [33871648,37253956,
              18976457,19378102,
              20851820,25145561]
Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
pop = DataFrame(populations,index=Mindex,columns=['population'])
pop

输出：
							population
California	2000			33871648
				2010			37253956
New York	2000			18976457
				2010			19378102
Texas		2000			20851820
				2010			25145561


pop.mean(level=0)

输出：
					population
California		35562802.0
New York		19177279.5
Texas			22998690.5

pop.mean(level=1)

输出：
				population
2000			2.456664e+07
2010			2.725921e+07

练习：
计算各个科目期中期末平均成绩
计算各科目张三李四的最高分

df = DataFrame(np.random.randint(0,150,(6,3)),
               index = pd.MultiIndex.from_product([['张三','李四','王五'],['期中','期末']]),
              columns = ['语文','数学','编程'])
display(df)
print("计算各个科目期中期末平均成绩")
display(df.mean(level = 1))
print("计算各科目张三李四的最高分")
display(df.loc['张三':'李四'].max(level = 1))

输出：
					语文	数学	编程
张三	期中		5		0		136
		期末		139	4		92
李四	期中		26		74		52
		期末		51		105	18
王五	期中		117	34		51
		期末		58		55		18

计算各个科目期中期末平均成绩

		语文				数学			    编程
期中	49.333333		36.000000		79.666667
期末	82.666667		54.666667		42.666667

计算各科目张三李四的最高分

		语文		数学		编程
期中		26		74		136
期末		139		105		92

二. pandas拼接操作

级联：pd.concat, pd.append

回顾numpy的级联

x = np.array([1,2,3]).reshape(1,3)
y = np.array([4,5,6]).reshape(1,3)
z = np.array([7,8,9]).reshape(1,3)
display(x,y,z)
np.concatenate([x,y,z])

输出：
array([[1, 2, 3]])
array([[4, 5, 6]])
array([[7, 8, 9]])
Out[7]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])


# 练习：
# 生成2个3*3的矩阵，对其分别进行两个维度上的级联
x = np.random.randn(3,3)
x

输出：
array([[-0.09120671, -0.52299759, -0.16349379],
       [ 0.36666059, -1.3587184 ,  0.81883706],
       [ 1.0621649 ,  0.72906367, -0.41325065]])

np.concatenate([x,x])

输出：
array([[-0.09120671, -0.52299759, -0.16349379],
       [ 0.36666059, -1.3587184 ,  0.81883706],
       [ 1.0621649 ,  0.72906367, -0.41325065],
       [-0.09120671, -0.52299759, -0.16349379],
       [ 0.36666059, -1.3587184 ,  0.81883706],
       [ 1.0621649 ,  0.72906367, -0.41325065]])


np.concatenate([x,x],axis=1)

输出：
array([[-0.09120671, -0.52299759, -0.16349379, -0.09120671, -0.52299759,
        -0.16349379],
       [ 0.36666059, -1.3587184 ,  0.81883706,  0.36666059, -1.3587184 ,
         0.81883706],
       [ 1.0621649 ,  0.72906367, -0.41325065,  1.0621649 ,  0.72906367,
        -0.41325065]])

创建生成DataFrame的函数

def make_df(cols,ind):
    '''
    快速生成一个DataFrame
    '''
    data = {c:[str(c)+str(i) for i in ind] for c in cols}
    return DataFrame(data,ind)


调用：
make_df('ABC', range(3))

输出：

    A   B	  C
0	A0 	B0	  C0
1	A1	B1	  C1
2	A2 	B2	  C2

使用pd.concat()级联

# pandas使用pd.concat函数，与np.concatenate函数类似，只是多了一些参数：
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)

简单级联

df1 = make_df('AB',[1,2])
df2 = make_df('AB',[3,4])
display(df1,df2,pd.concat([df1,df2]))

输出：

	A	B
1	A1	B1
2	A2	B2
A	B
3	A3	B3
4	A4	B4
A	B
1	A1	B1
2	A2	B2
3	A3	B3
4	A4	B4

可以通过设置axis来改变级联方向

df3 = make_df('WX',[1,3])
df4 = make_df('QT',[5,8])
display(df3,df4,pd.concat([df3,df4],axis=1))

输出：

	W	X
1	W1	X1
3	W3	X3
	Q	T
5	Q5	T5
8	Q8	T8
	W	X	Q	T
1	W1	X1	NaN	NaN
3	W3	X3	NaN	NaN
5	NaN	NaN	Q5	T5
8	NaN	NaN	Q8	T8

注意index在级联时可以重复

x = make_df('AB',[0,1])
y = make_df('AB',[1,8])
pd.concat([x,y])

输出：

    A	  B
0	A0	 B0
1	A1 	 B1
1	A1   B1
8	A8	 B8

可以选择忽略index，重新索引
x = make_df('AB',[0,1])
y = make_df('AB',[1,8])
pd.concat([x,y],ignore_index=True)

输出：
 

    A	  B
0	A0	 B0
1	A1	 B1
2	A4	 B4
3	A8	 B8

使用多层索引

display(pd.concat([x,y]))
display(pd.concat([x,y],keys=['x','y']))
display(x,y)
display(pd.concat([x,y],axis = 1))
display(pd.concat([x,y],axis = 1,keys = ['v','r']))

输出：

	A	B
0	A0	B0
1	A1	B1
1	A1	B1
8	A8	B8


		A	B
x	0	A0	B0
	1	A1	B1
y	1	A1	B1
	8	A8	B8


	A	B
0	A0	B0
1	A1	B1
	A	B
1	A1	B1
8	A8	B8

	A	B		A	B
0	A0	B0		NaN	NaN
1	A1	B1		A1	B1
8	NaN	NaN	   A8	B8

	v				r
	A	B		A	B
0	A0	B0		NaN	NaN
1	A1	B1		A1	B1
8	NaN	NaN 	A8	B8

不匹配级联

# 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致

外连接：补NaN（默认模式）

x = make_df('ABC',[1,2])
y = make_df('BCD',[3,4])
display(x,y,pd.concat([x,y]))

输出：
A	B	C
1	A1	B1	C1
2	A2	B2	C2
B	C	D
3	B3	C3	D3
4	B4	C4	D4
A	B	C	D
1	A1	B1	C1	NaN
2	A2	B2	C2	NaN
3	NaN	B3	C3	D3
4	NaN	B4	C4	D4

内连接：只连接匹配的项

x = make_df('ABC',[1,2])
y = make_df('BCD',[3,4])
display(x,y,pd.concat([x,y],join="inner"))
输出：
		A	B		C
1		A1	B1		C1
2		A2	B2		C2

		B		C		D
3		B3		C3		D3
4		B4		C4		D4

	B		C
1	B1		C1
2	B2		C2
3	B3		C3
4	B4		C4

连接指定轴

x = make_df('ABC',[1,2])
y = make_df('BCD',[3,4])
display(x,y,pd.concat([x,y],join_axes=[x.columns]))

输出：
	A	B	    C
1	A1	B1	    C1
2	A2	B2 	C2
	B	 C	D
3	B3	 C3	D3
4	B4	 C4	D4

	A   B	 C
1	A1	B1	 C1
2	A2	B2	 C2
3	NaN	B3	 C3
4	NaN	B4	 C4

练习：
假设【期末】考试ddd2的成绩没有张三的，只有李四、王老五、赵小六的，使用多种方法级联

np.random.seed(15)
a = np.random.randint(150,size=(5,3))
ddd2 = DataFrame(a,columns=["王五","李四","赵小六"],
                index = ["语文","数学","英语","理综","Python"])
np.random.seed(10)
ddd = DataFrame(np.random.randint(150,size = (5,3)),
                                  columns = ['张三','李四','小莲'],
                                  index = ["语文","数学","英语","理综","Python"])
display(ddd2)
display(ddd)

输出：
王五	李四	赵小六
语文	140	133	119
数学	128	85	118
英语	17	23	62
理综	10	143	105
Python	39	37	19
张三	李四	小莲
语文	9	125	15
数学	64	113	123
英语	113	8	73
理综	0	40	115
Python	16	100	139

df = pd.concat([ddd,ddd2],axis = 1,ignore_index=False)
df

pd.concat([ddd,ddd2],keys=["期中","期末"],axis = 0)

使用pd.appnd()级联

# 由于在后面级联的使用非常普遍，因此有一个函数append专门用于在后面添加

x = make_df('AB',np.random.randint(10,size = 2))
y = make_df('AB',np.random.randint(10,20,size = 2))
display(x.append(y))

输出：

	  A		     B
6	  A6		  B6
8	  A8		  B8
11   A11	      B11
18	 A18	      B18

练习：
新建一个只有张三李四王老五的期末考试成绩单ddd3，使用append()与期中考试成绩表ddd级联

np.random.seed(15)
a = np.random.randint(150,size=(5,3))
ddd2 = DataFrame(a,columns=["王五","李四","赵小六"],
                index = ["语文","数学","英语","理综","Python"])
np.random.seed(10)
ddd = DataFrame(np.random.randint(150,size = (5,3)),
                                  columns = ['张三','李四','小莲'],
                                  index = ["语文","数学","英语","理综","Python"])
display(ddd2)
display(ddd)

输出：
王五	李四	赵小六
语文	140	133	119
数学	128	85	118
英语	17	23	62
理综	10	143	105
Python	39	37	19
张三	李四	小莲
语文	9	125	15
数学	64	113	123
英语	113	8	73
理综	0	40	115
Python	16	100	139

df = pd.concat([ddd,ddd2],axis = 1,ignore_index=False)
df

pd.concat([ddd,ddd2],keys=["期中","期末"],axis = 0)

合并：pd.merge

merge与concat的区别在于，merge需要依据某一共同的行或列来进行合并
使用pd.merge()合并时，会自动根据两者相同column名称的那一列，作为key来进行合并。
注意每一列元素的顺序不要求一致

一对一合并

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                'hire_date':[2004,2008,2012],
                })
display(df1,df2)

输出：

 	employee	group
0	Bob		Accounting
1	Jake	Engineering
2	Lisa	Engineering


	employee	hire_date
0	Lisa	2004
1	Bob		2008
2	Jake	2012


df3 = pd.merge(df1,df2)
df3

输出：
	employee	group	hire_date
0	Bob		Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004

多对一合并

df3 = DataFrame({
    'employee':['Lisa','Jake'],
    'group':['Accounting','Engineering'],
    'hire_date':[2004,2016]})
df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                       'supervisor':['Carly','Guido','Steve']
                })
display(df3,df4,pd.merge(df3,df4))

输出：
	employee		group		hire_date
0		Lisa		Accounting		2004
2	 	Jake		Engineering		2016

		group			supervisor
0		Accounting		Carly
1		Engineering		Guido
2		Engineering		Steve

	employee  group	  hire_date	 supervisor
0	Lisa	Accounting	2004	Carly
1	Jake	Engineering	2016	Guido
2	Jake	Engineering	2016	Steve

多对多合并

df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                 'group':['Accounting','Engineering','Engineering']})
df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                'supervisor':['Carly','Guido','Steve']
                })
display(df1,df5,pd.merge(df1,df5))#多对多
display(pd.concat([df1,df5]))

输出：
	employee	group
0	Bob			Accounting
1	Jake		Engineering
2	Lisa		Engineering

	group		supervisor
0	Engineering	Carly
1	Engineering	Guido
2		HR		Steve

employee	group	supervisor
0	Jake	Engineering	Carly
1	Jake	Engineering	Guido
2	Lisa	Engineering	Carly
3	Lisa	Engineering	Guido


   	employee		group		supervisor
0		Bob			Accounting		NaN
1		Jake		Engineering		NaN
2		Lisa		Engineering		NaN
0		NaN			Engineering		Carly
1		NaN			Engineering		Guido
2		NaN				HR			Steve

key的规范化

使用on=显式指定哪一列为key

df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                 'group':['Accounting','Finance','Marketing']})

df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                 'hire_date':[2003,2009,2012],
                'group':['Accounting','sell','ceo']})
display(df1,df2,pd.merge(df1,df2),pd.merge(df1,df2,on = 'employee'))

输出：

employee	group
0	Jack	Accounting
1	Summer	Finance
2	Steve	Marketing


employee	group	hire_date
0	Jack	Accounting	2003
1	Bob	sell	2009
2	Jake	ceo	2012

employee	group	hire_date
0	Jack	Accounting	2003

	employee	 group_x	group_y	hire_date
0	Jack 		Accounting	Accounting	2003

使用left_on和right_on指定左右两边的列作为key

df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                'group':['Accounting','Product','Marketing'],
               'hire_date':[1998,2017,2018]})
df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                'hire_dates':[1998,2016,2007]})

display(df1,df5,pd.merge(df1,df5,left_on = 'employee',right_on = 'name'))


输出：
	employee	group		hire_date
0		Bobs	Accounting	1998
1		Linda	Product			2017
2		Bill	Marketing			2018

 hire_dates	 	name
0	1998			Lisa
1	2016			Bobs
2	2007			Bill


employee	group	hire_date	hire_dates	name
0	Bobs	Accounting	1998	2016	Bobs
1	Bill	Marketing	2018	2007	Bill

练习：
假设有两份成绩单，除了ddd是张三李四王老五之外，还有ddd4是张三和赵小六的成绩单，如何合并？
如果ddd4中张三的名字被打错了，成为了张十三，怎么办？
自行练习多对一，多对多的情况

a1 = np.array([[122,107],[21,14],[110,38],[128,111],[50,50]])
ddd1 = DataFrame(a1,columns=["张三","李四"],
                index = ["语文","数学","英语","理综","计算机"])

a2 = [[122,120],[21,25],[110,95],[128,61],[50,45]]
ddd2 = DataFrame(np.random.randint(100,size = (5,3)),columns=["李四","赵小六",'张五'],
                index = ["语文","数学","英语","理综","计算机"])
display(ddd1,ddd2)
pd.merge(ddd1,ddd2,left_index = True,right_index = True,on = ['李四'])

输出：
		张三		李四
语文		122		107
数学		21		14
英语		110		38
理综		128		111
计算机		50		50

		李四		 赵小六		张五
语文		95			76			85
数学		89			15			4
英语		28			79			21
理综		99			58			28
计算机	   21			5			58


Out[94]:
		张三		李四	  赵小六	   张五
语文		122		107		76			85  
数学		21		14		15			4    
英语		110		38		79			21  
理综		128		111		58			28  
计算机	   50		50		5			58

内合并与外合并

内合并：只保留两者都有的key（默认模式）

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})
display(df6,df7,pd.merge(df6,df7))

输出：

	food	name
0	fish	Peter
1	beans	Paul
2	bread	Mary

	drink	name
0	wine	Mary
1	beer	Joseph

	food	name	drink
0	bread	Mary	winee

外合并 how=‘outer’：补NaN

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})
display(df6,df7,pd.merge(df6,df7,how='outer'))


输出：
	food	name
0	fish	Peter
1	beans	Paul
2	bread	Mary

	drink	name
0	wine	Mary
1	beer	Joseph

 	food	   name	       drink 
0	fish		Peter		NaN  
1	beans		Paul		NaN  
2	bread		Mary	 	wine 
3	NaN		  Joseph		beer

左合并、右合并：how=‘left’，how=‘right’

df6 = DataFrame({'name':['Peter','Paul','Mary'],
               'food':['fish','beans','bread']}
               )
df7 = DataFrame({'name':['Mary','Joseph'],
                'drink':['wine','beer']})
display(df6,df7,pd.merge(df6,df7,how='left'))
display(df6,df7,pd.merge(df6,df7,how='right'))

输出：
	food	name
0	fish	Peter
1	beans	Paul
2	bread	Mary

	drink	name
0	wine	Mary
1	beer	Joseph

	food	name	drink	
0	fish	Peter	NaN		
1	beans	Paul	NaN		
2	bread	Mary	wine		


	food	name	drink	
0	bread	Mary	wine		
1	NaN		Joseph	beer

练习：
如果只有张三赵小六语数英三个科目的成绩，如何合并？
考虑应用情景，使用多种方式合并ddd与ddd4

import numpy as np
import pandas as pd
from pandas import Series,DataFrame
np.random.seed(10)
df1 = DataFrame({'张三':[127,119,99,119,73],'李四':[125,64,123,99,139]},
                index = ['语文','数学','英语','理综','编程'],columns = ['张三','李四'])

df2 = DataFrame({'张三':[122,21,110],
                 '赵小六':[78,96,83]},
                 index = ['语文','数学','英语']
                )
display(df1,df2)
pd.merge(df1,df2,how = 'left',on = '张三',left_index = True,right_index = True)
输出：
			张三		李四
语文		127		  125
数学		119		  64
英语		99			123
理综		119		   99
编程		73			139

			张三		赵小六
语文		122		    78
数学		21			96
英语		110		    83

Out[56]:
		张三	  李四	赵小六
语文		127		125		78.0
数学		119		64		96.0
英语		99		123		83.0
理综		119		99		NaN
编程		73		139		NaN

列冲突的解决

当列冲突时，即有多个列名称相同时，需要使用on=来指定哪一个列作为key

df8 = DataFrame({'name':['Peter','Paul','Mary'],
                'rank':[1,2,3]})
df9 = DataFrame({'name':['Peter','Paul','Mary'],
                'rank':[5,6,7]})
display(df8,df9,pd.merge(df8,df9,on = 'name',suffixes=['rank_L','rank_R']))

输出：
	name		rank
0	Peter		1
1	Paul		2
2	Mary		3
	name	rank
0	Peter		5
1	Paul		6
2	Mary		7

	name	rankrank_L	rankrank_R
0	Peter		1			5			
1	Paul		2			6			
2	Mary		3			7

练习：
假设有两个同学都叫李四，ddd5、ddd6都是张三和李四的成绩表，如何合并？

np.random.seed(100)
df4 = DataFrame(np.random.randint(150,size = (5,2)),
               columns = ['张三','李四'],
               index = ['语文','数学','英语','理综','计算机'])
df5 = DataFrame({'张三':[8,67,87,128,50],
                 '李四':[24,103,83,45,58]},
                 index = ['语文','数学','英语','理综','计算机']
                )
display(df4,df5)
pd.merge(df4,df5,left_index = True,right_index = True,suffixes=['-期中','-期末'])


输出：
		张三  	李四
语文		8		24
数学		67		103
英语		87		79
理综		138	    94
计算机	   98		53

		张三		李四
语文		8		24
数学		67		103
英语		87		83
理综		128	45
计算机  	50		58


	张三-期中	李四-期中	张三-期末	李四-期末
语文	8	24	8	24
数学	67	103	67	103
英语	87	79	87	83
理综	138	94	128	45
计算机	98	53	50	58

三. pandas案例

美国各州人口数据分析

四. 回顾：Series/DataFrame运算与ndarray运算的区别

ndarray有广播，通过重复已有值来计算

a = np.array([1,2,3])
b = np.array([2])
display(a,b,a+b)


# 输出：
array([1, 2, 3])
array([2])
array([3, 4, 5])

Series与DataFrame没有广播，如果对应index没有值，则记为NaN；或者使用add的fill_value来补缺失值

# Series
s = Series([1,2,3])
s2 = Series([2])
display(s,s+1,s+s2)
display(s.add(s2,fill_value=0))

输出：
0    1
1    2
2    3
dtype: int64
0    2
1    3
2    4
dtype: int64
0    3.0
1    NaN
2    NaN
dtype: float64
0    3.0
1    2.0
2    3.0
dtype: float64

    
    
    
# DataFrame
df = DataFrame({'A':[1,2,3],
               'B':[4,5,6]})
df2 = DataFrame(np.random.randint(10,size = (3,2)),columns=['B','c'])
display(df,df2,df+df2,df.add(df2,fill_value = 0),df2.add(df,fill_value = 0).astype(int))


输出：
	A	B
0	1	4
1	2	5
2	3	6
	B	c
0	8	4
1	7	0
2	2	8

	A		B			c
0	NaN	12		NaN
1	NaN	12		NaN
2	NaN	 8		NaN

	A		B		c
0	1.0	12		4.0
1	2.0	12		0.0
2	3.0	8		8.0

	A		B		c
0	1		12		4
1	2		12		0
2	3		8		8