Day03_Pandas

本文详细介绍了Pandas的基础知识,包括层次化索引的创建、多层列索引操作,以及数据拼接和合并。重点讨论了如何使用loc()进行索引,concat()和append()的级联操作,以及merge()的一对一、多对一和多对多合并。同时,提供了多个练习案例,如计算平均成绩、处理缺失数据和解决列冲突问题。

Day03_Pandas基础

一. pandas层次化索引

  1. 创建多层索引

    • 隐式构造

      • Series也可以创建多层索引
      df = DataFrame(np.random.rand(4,2),
                     index=[['a','a','b','b'],[1,2,1,2]],
                    columns=['data1','data2'])
      df
      
      输出:
      		  data1		  data2
      a	1	0.495897	0.130101
      	2	0.551563	0.450209
      b	1	0.664648	0.701401
      	2	0.442188	0.500506
      	
      
      s = Series(np.random.rand(4),index = [['a','a','b','b'],[1,2,1,2]])
      s
      
      输出:
      a  1    0.192162
          2    0.236828
      b  1    0.560971
          2    0.751398
      dtype: float64
      
    • 显示构造MultiIndex

      • 创建MultiIndex

        # 使用数组
        Mindex = pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
        Mindex
        
        输出:
        MultiIndex(levels=[['a', 'b'], [1, 2]],
                   labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
                   
              
        # 使用tuple
        Mindex = pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
        Mindex
        
        输出:
        MultiIndex(levels=[['a', 'b'], [1, 2]],
                   labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
                   
        
        # 使用product
        Mindex = pd.MultiIndex.from_product([['a','b'],[1,2]])
        Mindex
        
        输出:
        MultiIndex(levels=[['a', 'b'], [1, 2]],
                   labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
        
      • 人口统计

        index = [('California',2000),('California',2010),
                ('New York',2000),('New York',2010),
                ('Texas',2000),('Texas',2010)]
        Mindex = pd.MultiIndex.from_tuples(index)
        
        Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
        
        pop = DataFrame(populations,index=Mindex,columns = ['人口'])
        pop
        
        输出:
        						人口
        California	2000	33871648
        			2010	37253956
        New York	2000	18976457
        			2010	19378102
        Texas	2000	20851820
        		2010	25145561
        
    • 练习:
      创建一个DataFrame,表示出张三李四期中期末各科成绩

      a = np.random.randint(0,150,(8,2))
      Mindex = pd.MultiIndex.from_product([['语文','数学','英语','理综'],['期中','期末']])
      ddd = DataFrame(data = a,
                      index = Mindex,
                      columns = ['张三','李四'])
      ddd
      
      输出:
      		  张三	李四
      语文	期中	90		55
      	 期末	 59		 20
      数学	期中	132 	85
      	  期末 65		 116
      英语	期中	57		28
      	 期末	 73		 145
      理综	期中	142 	48
      	 期末	 120	 54
      
  2. 多层列索引

    • 除了行索引index,列索引columns也能用同样的方法创建多层索引
    a = np.random.randint(0,150,(2,8))
    Mindex = pd.MultiIndex.from_product([['语文','数学','英语','理综'],['期中','期末']])
    ddd = DataFrame(data = a,
                    index = ['张三','李四'],
                    columns = Mindex)
    ddd
    
    输出:
    语文	数学	英语	理综
    期中	期末	期中	期末	期中	期末	期中	期末
    张三	57	110	65	3	127	66	146	32
    李四	73	24	57	19	35	133	57	127
    
  3. 多层索引对象的索引与切片操作

    • Series的操作

      • 【重要】对于Series来说,直接中括号[]与使用.loc()完全一样,因此,推荐使用中括号索引和切片
      index = [('California',2000),('California',2010),
              ('New York',2000),('New York',2010),
              ('Texas',2000),('Texas',2010)]
      index = [['California','California','New York','New York','Texas','Texas'],[2000,2010,2000,2010,2000,2010]]
      # index = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]]) #相同的效果
      populations = [33871648,37253956,
                    18976457,19378102,
                    20851820,25145561]
      pop = Series(populations,index = index)
      pop
      
      输出:
      California  2000    33871648
                  2010    37253956
      New York    2000    18976457
                  2010    19378102
      Texas       2000    20851820
                  2010    25145561
      dtype: int64
      
      • 索引

        pop['California']
        
        输出:
        2000    33871648
        2010    37253956
        dtype: int64
        
      • 切片

        pop['California':'New York']
        
        输出:
        California  2000    33871648
                    2010    37253956
        New York    2000    18976457
                    2010    19378102
        dtype: int64
        
        
        pop[::-1]
        
        输出:
        Texas       2010    25145561
                    2000    20851820
        New York    2010    19378102
                    2000    18976457
        California  2010    37253956
                    2000    33871648
        dtype: int64
        
        
        pop[:,2010]
        
        输出:
        California    37253956
        New York      19378102
        Texas         25145561
        dtype: int64
        
    • DataFrame的操作

      • 可以直接使用列名称来进行列索引

        index = [('California',2000),('California',2010),
                ('New York',2000),('New York',2010),
                ('Texas',2000),('Texas',2010)]
        populations = [33871648,37253956,
                      18976457,19378102,
                      20851820,25145561]
        Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
        pop = DataFrame(populations,index=Mindex,columns=['population'])
        pop
        
        输出:
        population
        California	2000	33871648
        2010	37253956
        New York	2000	18976457
        2010	19378102
        Texas	2000	20851820
        2010	25145561
        
        pop['population']
        
        输出:
        California  2000    33871648
                    2010    37253956
        New York    2000    18976457
                    2010    19378102
        Texas       2000    20851820
                    2010    25145561
        Name: population, dtype: int64
        
      • 使用行索引需要用ix(),loc()等函数
        【极其重要】推荐使用loc()函数

        pop.loc['California']
        
        输出:
        population
        2000	33871648
        2010	37253956
        
        pop.loc[('California',2000)]
        
        输出:
        population    33871648
        Name: (California, 2000), dtype: int64
        
        pop.loc['California',2000]
        
        输出:
        population    33871648
        Name: (California, 2000), dtype: int64
        
        pop.loc['California',2000][0]
        
        输出:
        33871648
        
        pop.loc['California'].loc[2000]
        
        输出:
        population    33871648
        Name: 2000, dtype: int64
        
        pop.loc[('California',2000),'population']
        
        输出:
        33871648
        
      • 注意在对行索引的时候,若一级行索引还有多个,对二级行索引会遇到问题!也就是说,无法直接对二级索引进行索引,必须让二级索引变成一级索引后才能对其进行索引!

        pop.loc['California':'New York',2000]
        
        输出:
        TypeError: cannot do label indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [2000] of <class 'int'>
        
  4. 索引的堆(unstack)

    • unstack()

      • 【小技巧】使用unstack()的时候,level等于哪一个,哪一个就消失,出现在列里

        populations = [33871648,37253956,
                      18976457,19378102,
                      20851820,25145561]
        Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
        pop = DataFrame(populations,index=Mindex,columns=['population'])
        pop
        
        输出:
        							population
        California	2000		33871648
        				2010		37253956
        New York	2000		18976457
        				2010		19378102
        Texas		2000		20851820
        				2010		25145561
        
        ///////////////////////////////////////////////////////////////////////////////
        pop.unstack(level=0)
        
        输出:
        		population
        		California		New York		Texas
        2000	33871648		18976457		20851820
        2010	37253956		19378102		25145561
        
        pop.unstack(level=1)
        
        输出:
        					population
        						2000				2010
        California		33871648		37253956
        New York		18976457		19378102
        Texas			20851820		25145561
        
  5. 聚合操作

    • 【注意】
      需要指定level
      【小技巧】和unstack()相反,聚合的时候,level等于哪一个,哪一个就保留

      populations = [33871648,37253956,
                    18976457,19378102,
                    20851820,25145561]
      Mindex = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])
      pop = DataFrame(populations,index=Mindex,columns=['population'])
      pop
      
      输出:
      							population
      California	2000			33871648
      				2010			37253956
      New York	2000			18976457
      				2010			19378102
      Texas		2000			20851820
      				2010			25145561
      
      
      pop.mean(level=0)
      
      输出:
      					population
      California		35562802.0
      New York		19177279.5
      Texas			22998690.5
      
      pop.mean(level=1)
      
      输出:
      				population
      2000			2.456664e+07
      2010			2.725921e+07
      
    • 练习:
      计算各个科目期中期末平均成绩
      计算各科目张三李四的最高分

      df = DataFrame(np.random.randint(0,150,(6,3)),
                     index = pd.MultiIndex.from_product([['张三','李四','王五'],['期中','期末']]),
                    columns = ['语文','数学','编程'])
      display(df)
      print("计算各个科目期中期末平均成绩")
      display(df.mean(level = 1))
      print("计算各科目张三李四的最高分")
      display(df.loc['张三':'李四'].max(level = 1))
      
      输出:
      					语文	数学	编程
      张三	期中		5		0		136
      		期末		139	4		92
      李四	期中		26		74		52
      		期末		51		105	18
      王五	期中		117	34		51
      		期末		58		55		18
      
      计算各个科目期中期末平均成绩
      
      		语文				数学			    编程
      期中	49.333333		36.000000		79.666667
      期末	82.666667		54.666667		42.666667
      
      计算各科目张三李四的最高分
      
      		语文		数学		编程
      期中		26		74		136
      期末		139		105		92
      

二. pandas拼接操作

  1. 级联:pd.concat, pd.append

    • 回顾numpy的级联

      x = np.array([1,2,3]).reshape(1,3)
      y = np.array([4,5,6]).reshape(1,3)
      z = np.array([7,8,9]).reshape(1,3)
      display(x,y,z)
      np.concatenate([x,y,z])
      
      输出:
      array([[1, 2, 3]])
      array([[4, 5, 6]])
      array([[7, 8, 9]])
      Out[7]:
      array([[1, 2, 3],
             [4, 5, 6],
             [7, 8, 9]])
      
      
      # 练习:
      # 生成2个3*3的矩阵,对其分别进行两个维度上的级联
      x = np.random.randn(3,3)
      x
      
      输出:
      array([[-0.09120671, -0.52299759, -0.16349379],
             [ 0.36666059, -1.3587184 ,  0.81883706],
             [ 1.0621649 ,  0.72906367, -0.41325065]])
      
      np.concatenate([x,x])
      
      输出:
      array([[-0.09120671, -0.52299759, -0.16349379],
             [ 0.36666059, -1.3587184 ,  0.81883706],
             [ 1.0621649 ,  0.72906367, -0.41325065],
             [-0.09120671, -0.52299759, -0.16349379],
             [ 0.36666059, -1.3587184 ,  0.81883706],
             [ 1.0621649 ,  0.72906367, -0.41325065]])
      
      
      np.concatenate([x,x],axis=1)
      
      输出:
      array([[-0.09120671, -0.52299759, -0.16349379, -0.09120671, -0.52299759,
              -0.16349379],
             [ 0.36666059, -1.3587184 ,  0.81883706,  0.36666059, -1.3587184 ,
               0.81883706],
             [ 1.0621649 ,  0.72906367, -0.41325065,  1.0621649 ,  0.72906367,
              -0.41325065]])
      
    • 创建生成DataFrame的函数

      def make_df(cols,ind):
          '''
          快速生成一个DataFrame
          '''
          data = {c:[str(c)+str(i) for i in ind] for c in cols}
          return DataFrame(data,ind)
      
      
      调用:
      make_df('ABC', range(3))
      
      输出:
      
          A   B	  C
      0	A0 	B0	  C0
      1	A1	B1	  C1
      2	A2 	B2	  C2
      
    • 使用pd.concat()级联

      # pandas使用pd.concat函数,与np.concatenate函数类似,只是多了一些参数:
      pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
                keys=None, levels=None, names=None, verify_integrity=False,
                copy=True)
      
      • 简单级联

        df1 = make_df('AB',[1,2])
        df2 = make_df('AB',[3,4])
        display(df1,df2,pd.concat([df1,df2]))
        
        输出:
        
        	A	B
        1	A1	B1
        2	A2	B2
        A	B
        3	A3	B3
        4	A4	B4
        A	B
        1	A1	B1
        2	A2	B2
        3	A3	B3
        4	A4	B4
        
        • 可以通过设置axis来改变级联方向

          df3 = make_df('WX',[1,3])
          df4 = make_df('QT',[5,8])
          display(df3,df4,pd.concat([df3,df4],axis=1))
          
          输出:
          
          	W	X
          1	W1	X1
          3	W3	X3
          	Q	T
          5	Q5	T5
          8	Q8	T8
          	W	X	Q	T
          1	W1	X1	NaN	NaN
          3	W3	X3	NaN	NaN
          5	NaN	NaN	Q5	T5
          8	NaN	NaN	Q8	T8
          
        • 注意index在级联时可以重复

          x = make_df('AB',[0,1])
          y = make_df('AB',[1,8])
          pd.concat([x,y])
          
          输出:
          
              A	  B
          0	A0	 B0
          1	A1 	 B1
          1	A1   B1
          8	A8	 B8
          
          可以选择忽略index,重新索引
          x = make_df('AB',[0,1])
          y = make_df('AB',[1,8])
          pd.concat([x,y],ignore_index=True)
          
          输出:
           
          
              A	  B
          0	A0	 B0
          1	A1	 B1
          2	A4	 B4
          3	A8	 B8
          
        • 使用多层索引

          display(pd.concat([x,y]))
          display(pd.concat([x,y],keys=['x','y']))
          display(x,y)
          display(pd.concat([x,y],axis = 1))
          display(pd.concat([x,y],axis = 1,keys = ['v','r']))
          
          输出:
          
          	A	B
          0	A0	B0
          1	A1	B1
          1	A1	B1
          8	A8	B8
          
          
          		A	B
          x	0	A0	B0
          	1	A1	B1
          y	1	A1	B1
          	8	A8	B8
          
          
          	A	B
          0	A0	B0
          1	A1	B1
          	A	B
          1	A1	B1
          8	A8	B8
          
          	A	B		A	B
          0	A0	B0		NaN	NaN
          1	A1	B1		A1	B1
          8	NaN	NaN	   A8	B8
          
          	v				r
          	A	B		A	B
          0	A0	B0		NaN	NaN
          1	A1	B1		A1	B1
          8	NaN	NaN 	A8	B8
          
      • 不匹配级联

        # 不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致,横向级联时行索引不一致
        
        • 外连接:补NaN(默认模式)

          x = make_df('ABC',[1,2])
          y = make_df('BCD',[3,4])
          display(x,y,pd.concat([x,y]))
          
          输出:
          A	B	C
          1	A1	B1	C1
          2	A2	B2	C2
          B	C	D
          3	B3	C3	D3
          4	B4	C4	D4
          A	B	C	D
          1	A1	B1	C1	NaN
          2	A2	B2	C2	NaN
          3	NaN	B3	C3	D3
          4	NaN	B4	C4	D4
          
        • 内连接:只连接匹配的项

          x = make_df('ABC',[1,2])
          y = make_df('BCD',[3,4])
          display(x,y,pd.concat([x,y],join="inner"))
          输出:
          		A	B		C
          1		A1	B1		C1
          2		A2	B2		C2
          
          		B		C		D
          3		B3		C3		D3
          4		B4		C4		D4
          
          	B		C
          1	B1		C1
          2	B2		C2
          3	B3		C3
          4	B4		C4
          
        • 连接指定轴

          x = make_df('ABC',[1,2])
          y = make_df('BCD',[3,4])
          display(x,y,pd.concat([x,y],join_axes=[x.columns]))
          
          输出:
          	A	B	    C
          1	A1	B1	    C1
          2	A2	B2 	C2
          	B	 C	D
          3	B3	 C3	D3
          4	B4	 C4	D4
          
          	A   B	 C
          1	A1	B1	 C1
          2	A2	B2	 C2
          3	NaN	B3	 C3
          4	NaN	B4	 C4
          
        • 练习:
          假设【期末】考试ddd2的成绩没有张三的,只有李四、王老五、赵小六的,使用多种方法级联

          np.random.seed(15)
          a = np.random.randint(150,size=(5,3))
          ddd2 = DataFrame(a,columns=["王五","李四","赵小六"],
                          index = ["语文","数学","英语","理综","Python"])
          np.random.seed(10)
          ddd = DataFrame(np.random.randint(150,size = (5,3)),
                                            columns = ['张三','李四','小莲'],
                                            index = ["语文","数学","英语","理综","Python"])
          display(ddd2)
          display(ddd)
          
          输出:
          王五	李四	赵小六
          语文	140	133	119
          数学	128	85	118
          英语	17	23	62
          理综	10	143	105
          Python	39	37	19
          张三	李四	小莲
          语文	9	125	15
          数学	64	113	123
          英语	113	8	73
          理综	0	40	115
          Python	16	100	139
          
          df = pd.concat([ddd,ddd2],axis = 1,ignore_index=False)
          df
          
          pd.concat([ddd,ddd2],keys=["期中","期末"],axis = 0)
          
          
    • 使用pd.appnd()级联

      # 由于在后面级联的使用非常普遍,因此有一个函数append专门用于在后面添加
      
      x = make_df('AB',np.random.randint(10,size = 2))
      y = make_df('AB',np.random.randint(10,20,size = 2))
      display(x.append(y))
      
      输出:
      
      	  A		     B
      6	  A6		  B6
      8	  A8		  B8
      11   A11	      B11
      18	 A18	      B18
      
      • 练习:
        新建一个只有张三李四王老五的期末考试成绩单ddd3,使用append()与期中考试成绩表ddd级联

        np.random.seed(15)
        a = np.random.randint(150,size=(5,3))
        ddd2 = DataFrame(a,columns=["王五","李四","赵小六"],
                        index = ["语文","数学","英语","理综","Python"])
        np.random.seed(10)
        ddd = DataFrame(np.random.randint(150,size = (5,3)),
                                          columns = ['张三','李四','小莲'],
                                          index = ["语文","数学","英语","理综","Python"])
        display(ddd2)
        display(ddd)
        
        输出:
        王五	李四	赵小六
        语文	140	133	119
        数学	128	85	118
        英语	17	23	62
        理综	10	143	105
        Python	39	37	19
        张三	李四	小莲
        语文	9	125	15
        数学	64	113	123
        英语	113	8	73
        理综	0	40	115
        Python	16	100	139
        
        df = pd.concat([ddd,ddd2],axis = 1,ignore_index=False)
        df
        
        pd.concat([ddd,ddd2],keys=["期中","期末"],axis = 0)
        
        
  2. 合并:pd.merge

    merge与concat的区别在于,merge需要依据某一共同的行或列来进行合并
    使用pd.merge()合并时,会自动根据两者相同column名称的那一列,作为key来进行合并。
    注意每一列元素的顺序不要求一致
    
    • 一对一合并

      import numpy as np
      import pandas as pd
      from pandas import Series,DataFrame
      df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                      'group':['Accounting','Engineering','Engineering'],
                      })
      df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                      'hire_date':[2004,2008,2012],
                      })
      display(df1,df2)
      
      输出:
      
       	employee	group
      0	Bob		Accounting
      1	Jake	Engineering
      2	Lisa	Engineering
      
      
      	employee	hire_date
      0	Lisa	2004
      1	Bob		2008
      2	Jake	2012
      
      
      df3 = pd.merge(df1,df2)
      df3
      
      输出:
      	employee	group	hire_date
      0	Bob		Accounting	2008
      1	Jake	Engineering	2012
      2	Lisa	Engineering	2004
      
    • 多对一合并

      df3 = DataFrame({
          'employee':['Lisa','Jake'],
          'group':['Accounting','Engineering'],
          'hire_date':[2004,2016]})
      df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                             'supervisor':['Carly','Guido','Steve']
                      })
      display(df3,df4,pd.merge(df3,df4))
      
      输出:
      	employee		group		hire_date
      0		Lisa		Accounting		2004
      2	 	Jake		Engineering		2016
      
      		group			supervisor
      0		Accounting		Carly
      1		Engineering		Guido
      2		Engineering		Steve
      
      	employee  group	  hire_date	 supervisor
      0	Lisa	Accounting	2004	Carly
      1	Jake	Engineering	2016	Guido
      2	Jake	Engineering	2016	Steve
      
    • 多对多合并

      df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                       'group':['Accounting','Engineering','Engineering']})
      df5 = DataFrame({'group':['Engineering','Engineering','HR'],
                      'supervisor':['Carly','Guido','Steve']
                      })
      display(df1,df5,pd.merge(df1,df5))#多对多
      display(pd.concat([df1,df5]))
      
      输出:
      	employee	group
      0	Bob			Accounting
      1	Jake		Engineering
      2	Lisa		Engineering
      
      	group		supervisor
      0	Engineering	Carly
      1	Engineering	Guido
      2		HR		Steve
      
      employee	group	supervisor
      0	Jake	Engineering	Carly
      1	Jake	Engineering	Guido
      2	Lisa	Engineering	Carly
      3	Lisa	Engineering	Guido
      
      
         	employee		group		supervisor
      0		Bob			Accounting		NaN
      1		Jake		Engineering		NaN
      2		Lisa		Engineering		NaN
      0		NaN			Engineering		Carly
      1		NaN			Engineering		Guido
      2		NaN				HR			Steve
      
    • key的规范化

      • 使用on=显式指定哪一列为key
      df1 = DataFrame({'employee':['Jack',"Summer","Steve"],
                       'group':['Accounting','Finance','Marketing']})
      
      df2 = DataFrame({'employee':['Jack','Bob',"Jake"],
                       'hire_date':[2003,2009,2012],
                      'group':['Accounting','sell','ceo']})
      display(df1,df2,pd.merge(df1,df2),pd.merge(df1,df2,on = 'employee'))
      
      输出:
      
      employee	group
      0	Jack	Accounting
      1	Summer	Finance
      2	Steve	Marketing
      
      
      employee	group	hire_date
      0	Jack	Accounting	2003
      1	Bob	sell	2009
      2	Jake	ceo	2012
      
      employee	group	hire_date
      0	Jack	Accounting	2003
      
      	employee	 group_x	group_y	hire_date
      0	Jack 		Accounting	Accounting	2003
      
      • 使用left_on和right_on指定左右两边的列作为key

        df1 = DataFrame({'employee':['Bobs','Linda','Bill'],
                        'group':['Accounting','Product','Marketing'],
                       'hire_date':[1998,2017,2018]})
        df5 = DataFrame({'name':['Lisa','Bobs','Bill'],
                        'hire_dates':[1998,2016,2007]})
        
        display(df1,df5,pd.merge(df1,df5,left_on = 'employee',right_on = 'name'))
        
        
        输出:
        	employee	group		hire_date
        0		Bobs	Accounting	1998
        1		Linda	Product			2017
        2		Bill	Marketing			2018
        
         hire_dates	 	name
        0	1998			Lisa
        1	2016			Bobs
        2	2007			Bill
        
        
        employee	group	hire_date	hire_dates	name
        0	Bobs	Accounting	1998	2016	Bobs
        1	Bill	Marketing	2018	2007	Bill
        
      • 练习:
        假设有两份成绩单,除了ddd是张三李四王老五之外,还有ddd4是张三和赵小六的成绩单,如何合并?
        如果ddd4中张三的名字被打错了,成为了张十三,怎么办?
        自行练习多对一,多对多的情况

        a1 = np.array([[122,107],[21,14],[110,38],[128,111],[50,50]])
        ddd1 = DataFrame(a1,columns=["张三","李四"],
                        index = ["语文","数学","英语","理综","计算机"])
        
        a2 = [[122,120],[21,25],[110,95],[128,61],[50,45]]
        ddd2 = DataFrame(np.random.randint(100,size = (5,3)),columns=["李四","赵小六",'张五'],
                        index = ["语文","数学","英语","理综","计算机"])
        display(ddd1,ddd2)
        pd.merge(ddd1,ddd2,left_index = True,right_index = True,on = ['李四'])
        
        输出:
        		张三		李四
        语文		122		107
        数学		21		14
        英语		110		38
        理综		128		111
        计算机		50		50
        
        		李四		 赵小六		张五
        语文		95			76			85
        数学		89			15			4
        英语		28			79			21
        理综		99			58			28
        计算机	   21			5			58
        
        
        Out[94]:
        		张三		李四	  赵小六	   张五
        语文		122		107		76			85  
        数学		21		14		15			4    
        英语		110		38		79			21  
        理综		128		111		58			28  
        计算机	   50		50		5			58  
        
        
    • 内合并与外合并

      • 内合并:只保留两者都有的key(默认模式)

        df6 = DataFrame({'name':['Peter','Paul','Mary'],
                       'food':['fish','beans','bread']}
                       )
        df7 = DataFrame({'name':['Mary','Joseph'],
                        'drink':['wine','beer']})
        display(df6,df7,pd.merge(df6,df7))
        
        输出:
        
        	food	name
        0	fish	Peter
        1	beans	Paul
        2	bread	Mary
        
        	drink	name
        0	wine	Mary
        1	beer	Joseph
        
        	food	name	drink
        0	bread	Mary	winee
        
      • 外合并 how=‘outer’:补NaN

        df6 = DataFrame({'name':['Peter','Paul','Mary'],
                       'food':['fish','beans','bread']}
                       )
        df7 = DataFrame({'name':['Mary','Joseph'],
                        'drink':['wine','beer']})
        display(df6,df7,pd.merge(df6,df7,how='outer'))
        
        
        输出:
        	food	name
        0	fish	Peter
        1	beans	Paul
        2	bread	Mary
        
        	drink	name
        0	wine	Mary
        1	beer	Joseph
        
         	food	   name	       drink 
        0	fish		Peter		NaN  
        1	beans		Paul		NaN  
        2	bread		Mary	 	wine 
        3	NaN		  Joseph		beer  
        
      • 左合并、右合并:how=‘left’,how=‘right’

        df6 = DataFrame({'name':['Peter','Paul','Mary'],
                       'food':['fish','beans','bread']}
                       )
        df7 = DataFrame({'name':['Mary','Joseph'],
                        'drink':['wine','beer']})
        display(df6,df7,pd.merge(df6,df7,how='left'))
        display(df6,df7,pd.merge(df6,df7,how='right'))
        
        输出:
        	food	name
        0	fish	Peter
        1	beans	Paul
        2	bread	Mary
        
        	drink	name
        0	wine	Mary
        1	beer	Joseph
        
        	food	name	drink	
        0	fish	Peter	NaN		
        1	beans	Paul	NaN		
        2	bread	Mary	wine		
        
        
        	food	name	drink	
        0	bread	Mary	wine		
        1	NaN		Joseph	beer		
        
      • 练习:
        如果只有张三赵小六语数英三个科目的成绩,如何合并?
        考虑应用情景,使用多种方式合并ddd与ddd4

        import numpy as np
        import pandas as pd
        from pandas import Series,DataFrame
        np.random.seed(10)
        df1 = DataFrame({'张三':[127,119,99,119,73],'李四':[125,64,123,99,139]},
                        index = ['语文','数学','英语','理综','编程'],columns = ['张三','李四'])
        
        df2 = DataFrame({'张三':[122,21,110],
                         '赵小六':[78,96,83]},
                         index = ['语文','数学','英语']
                        )
        display(df1,df2)
        pd.merge(df1,df2,how = 'left',on = '张三',left_index = True,right_index = True)
        输出:
        			张三		李四
        语文		127		  125
        数学		119		  64
        英语		99			123
        理综		119		   99
        编程		73			139
        
        			张三		赵小六
        语文		122		    78
        数学		21			96
        英语		110		    83
        
        Out[56]:
        		张三	  李四	赵小六
        语文		127		125		78.0
        数学		119		64		96.0
        英语		99		123		83.0
        理综		119		99		NaN
        编程		73		139		NaN
        
    • 列冲突的解决

      • 当列冲突时,即有多个列名称相同时,需要使用on=来指定哪一个列作为key

        df8 = DataFrame({'name':['Peter','Paul','Mary'],
                        'rank':[1,2,3]})
        df9 = DataFrame({'name':['Peter','Paul','Mary'],
                        'rank':[5,6,7]})
        display(df8,df9,pd.merge(df8,df9,on = 'name',suffixes=['rank_L','rank_R']))
        
        输出:
        	name		rank
        0	Peter		1
        1	Paul		2
        2	Mary		3
        	name	rank
        0	Peter		5
        1	Paul		6
        2	Mary		7
        
        	name	rankrank_L	rankrank_R
        0	Peter		1			5			
        1	Paul		2			6			
        2	Mary		3			7			
        
      • 练习:
        假设有两个同学都叫李四,ddd5、ddd6都是张三和李四的成绩表,如何合并?

        np.random.seed(100)
        df4 = DataFrame(np.random.randint(150,size = (5,2)),
                       columns = ['张三','李四'],
                       index = ['语文','数学','英语','理综','计算机'])
        df5 = DataFrame({'张三':[8,67,87,128,50],
                         '李四':[24,103,83,45,58]},
                         index = ['语文','数学','英语','理综','计算机']
                        )
        display(df4,df5)
        pd.merge(df4,df5,left_index = True,right_index = True,suffixes=['-期中','-期末'])
        
        
        输出:
        		张三  	李四
        语文		8		24
        数学		67		103
        英语		87		79
        理综		138	    94
        计算机	   98		53
        
        		张三		李四
        语文		8		24
        数学		67		103
        英语		87		83
        理综		128	45
        计算机  	50		58
        
        
        	张三-期中	李四-期中	张三-期末	李四-期末
        语文	8	24	8	24
        数学	67	103	67	103
        英语	87	79	87	83
        理综	138	94	128	45
        计算机	98	53	50	58		
        

三. pandas案例

美国各州人口数据分析

四. 回顾:Series/DataFrame运算与ndarray运算的区别

  1. ndarray有广播,通过重复已有值来计算

    a = np.array([1,2,3])
    b = np.array([2])
    display(a,b,a+b)
    
    
    # 输出:
    array([1, 2, 3])
    array([2])
    array([3, 4, 5])
    
  2. Series与DataFrame没有广播,如果对应index没有值,则记为NaN;或者使用add的fill_value来补缺失值

    # Series
    s = Series([1,2,3])
    s2 = Series([2])
    display(s,s+1,s+s2)
    display(s.add(s2,fill_value=0))
    
    输出:
    0    1
    1    2
    2    3
    dtype: int64
    0    2
    1    3
    2    4
    dtype: int64
    0    3.0
    1    NaN
    2    NaN
    dtype: float64
    0    3.0
    1    2.0
    2    3.0
    dtype: float64
    
        
        
        
    # DataFrame
    df = DataFrame({'A':[1,2,3],
                   'B':[4,5,6]})
    df2 = DataFrame(np.random.randint(10,size = (3,2)),columns=['B','c'])
    display(df,df2,df+df2,df.add(df2,fill_value = 0),df2.add(df,fill_value = 0).astype(int))
    
    
    输出:
    	A	B
    0	1	4
    1	2	5
    2	3	6
    	B	c
    0	8	4
    1	7	0
    2	2	8
    
    	A		B			c
    0	NaN	12		NaN
    1	NaN	12		NaN
    2	NaN	 8		NaN
    
    	A		B		c
    0	1.0	12		4.0
    1	2.0	12		0.0
    2	3.0	8		8.0
    
    	A		B		c
    0	1		12		4
    1	2		12		0
    2	3		8		8
    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值