dataframe如何替换某列元素值_Python干货宝典:如何处理Pandas中丢失的数据?

本文介绍了如何在Pandas DataFrame中处理缺失数据,包括使用isnull()和notnull()检查缺失值,用fillna()、replace()和interpolate()填充缺失值,以及使用dropna()删除缺失值的行或列。示例代码展示了不同方法的使用场景和效果。
ACE-Step

ACE-Step

音乐合成
ACE-Step

ACE-Step是由中国团队阶跃星辰(StepFun)与ACE Studio联手打造的开源音乐生成模型。 它拥有3.5B参数量,支持快速高质量生成、强可控性和易于拓展的特点。 最厉害的是,它可以生成多种语言的歌曲,包括但不限于中文、英文、日文等19种语言

当一个或多个项目或整个单元没有提供信息时,可能会出现丢失数据。在现实生活中,丢失数据是一个很大的问题,往往找半天还找不回来。

在Pandas中,缺少的数据由两个值表示:

  • None:None是Python单例对象,通常用于丢失Python代码中的数据。
  • NaN(非数字的缩写),是所有使用标准ieee浮点表示的系统所认可的特殊浮点值。

pandas对于None和NaN本质上是可互换的,用于表示缺失或空值。

在Pandas DataFrame中有几个用于检测、删除和替换空值的有用函数:

  • isnull()
  • notnull()
  • dropna()
  • fillna()
  • replace()
  • interpolate()

使用isnull()和notnull()

使用函数isnull()和notnull()检查PandasDataFrame中缺少的值。

使用isnull()

为了检查PandasDataFrame中的空值,我们使用isnull()函数返回布尔值的数据,这些值是NaN值的真值。

代码1:

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from list
df = pd.DataFrame(dict)
  
# using isnull() function  
df.isnull()

产出:

d54c56a372fc1d38efe4b9f847974132.png

代码2:

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv") 
    
# creating bool series True for NaN values 
bool_series = pd.isnull(data["Gender"]) 
    
# filtering data 
# displaying data only with Gender = NaN 
data[bool_series] 

产出:
如输出映像所示,只有具有Gender = NULL都会显示。

e97a82a222be8929b6011fd3b81e1ec4.png

使用notnull()

为了检查PandasDataframe中的空值,我们使用NOTNULL()函数来返回对于NaN值为false的布尔值的数据。

代码3:

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe using dictionary
df = pd.DataFrame(dict)
  
# using notnull() function 
df.notnull()

产出:

8d75e865ccc919e787dcdf1f5a78bd53.png

代码4:

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv") 
    
# creating bool series True for NaN values 
bool_series = pd.notnull(data["Gender"]) 
    
# filtering data 
# displayind data only with Gender = Not NaN 
data[bool_series] 

产出:
如输出映像所示,只有具有Gender = NOT NULL都会显示。

159b44196a3a699c11cf33de6ab6718e.png

使用fillna(), replace()和interpolate()

使用fillna(), replace()和interpolate()函数这些函数将NaN值替换为它们自己的一些值。在DataFrame的数据集中填充空值。

插值()函数主要用于填充NA数据中的值,使用各种插值技术来填充丢失的值,不是对值进行硬编码。

代码1:用单个值填充空值

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# filling missing value using fillna()  
df.fillna(0)

产出:

5d621ea2862c687000760eda108975b1.png

代码2:用前面的值填充空值

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# filling a missing value with
# previous ones  
df.fillna(method ='pad')

产出:

963c5cc4f0515a67094883ef32755c6e.png

代码3:用下一个值填充空值

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# filling  null value using fillna() function  
df.fillna(method ='bfill')

产出:

b62558933f56aee2ee46b202db3a7006.png

代码4:在CSV文件中填充空值

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv")
  
# Printing the first 10 to 24 rows of
# the data frame for visualization   
data[10:25]

44cd0d7b280ad3996d663b6fc1500e75.png

现在,我们将用“无性别”填充性别列中的所有空值。

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv") 
  
# filling a null values using fillna() 
data["Gender"].fillna("No Gender", inplace = True) 
  
data

产出:

5e9bbdef09ea7c365dce1c12e8f7edcd.png

代码5:使用替换()方法填充空值

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv")
  
# Printing the first 10 to 24 rows of
# the data frame for visualization   
data[10:25]

产出:

10785ae163696bfacdb3206a849b7ce0.png

现在,我们将将数据帧中的ALNAN值替换为-99值。

# importing pandas package 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv") 
    
# will replace  Nan value in dataframe with value -99  
data.replace(to_replace = np.nan, value = -99) 

产出:

d300a32de6d188544939f4122a8b93c6.png

代码6:使用插值()函数来使用线性方法填充缺失的值。

# importing pandas as pd 
import pandas as pd 
    
# Creating the dataframe  
df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[None, 2, 54, 3, None], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6]}) 
    
# Print the dataframe 
df 

d61e3e364efff66cc3ca0251635d7de3.png

让我们用线性方法插值缺失的值。请注意,线性方法忽略索引,并将值视为等距。

# to interpolate the missing values 
df.interpolate(method ='linear', limit_direction ='forward')

产出:

522de94b7fc4081564cb56bc3b80ab9e.png

正如我们可以看到的输出,第一行中的值无法被填充,因为填充值的方向是向前的,并且没有以前的值可以用于插值。

使用dropna()

从dataframe中删除空值,使用dropna()函数以不同的方式删除具有Null值的数据集的行/列。

代码1:删除至少1空值的行。

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
    
df

2d3e1686644802db9bb41e93e9af7a9f.png

使用至少一个Nan值(Null值)删除行。

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
  
# using dropna() function  
df.dropna()

产出:

724c4648d127d0896c21e7f9ed1b5d32.png

代码2:如果该行中的所有值都丢失,则删除行。

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
    
df

3440d2ca1505d846a60b50ce1cd5d97f.png

删除所有数据丢失或包含空值(Nan)的行。

# importing pandas as pd
import pandas as pd
  
# importing numpy as np
import numpy as np
  
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
  
df = pd.DataFrame(dict)
  
# using dropna() function    
df.dropna(how = 'all')

产出:

be40c1cec6df5f6b79fcb73d596c0af5.png

代码3:删除至少1空值的列。

# importing pandas as pd
import pandas as pd
   
# importing numpy as np
import numpy as np
   
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[60, 67, 68, 65]}
  
# creating a dataframe from dictionary 
df = pd.DataFrame(dict)
     
df

557a9ab0cfa1abd0caf057c0c5f5482d.png

删除至少有1个缺失值的列。

# importing pandas as pd
import pandas as pd
   
# importing numpy as np
import numpy as np
   
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, np.nan, 80, 98],
        'Fourth Score':[60, 67, 68, 65]}
  
# creating a dataframe from dictionary  
df = pd.DataFrame(dict)
  
# using dropna() function     
df.dropna(axis = 1)

产出:

01b7f100ffff353f69e02b8fec0da34e.png

代码4:在CSV文件中删除至少1空值的行

# importing pandas module 
import pandas as pd 
    
# making data frame from csv file 
data = pd.read_csv("employees.csv") 
    
# making new data frame with dropped NA values 
new_data = data.dropna(axis = 0, how ='any') 
    
new_data

产出:

80e63aabf069bfd1aace26b8941cc4aa.png

现在我们比较数据帧的大小,这样我们就可以知道有多少行至少有一个空值。

print("Old data frame length:", len(data))
print("New data frame length:", len(new_data)) 
print("Number of rows with at least 1 NA value: ", (len(data)-len(new_data)))

产出:

Old data frame length: 1000
New data frame length: 764
Number of rows with at least 1 NA value:  236

由于差值为236,因此在任何列中都有236行,其中至少有1空值。

您可能感兴趣的与本文相关的镜像

ACE-Step

ACE-Step

音乐合成
ACE-Step

ACE-Step是由中国团队阶跃星辰(StepFun)与ACE Studio联手打造的开源音乐生成模型。 它拥有3.5B参数量,支持快速高质量生成、强可控性和易于拓展的特点。 最厉害的是,它可以生成多种语言的歌曲,包括但不限于中文、英文、日文等19种语言

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值