pandas求协方差、相关系数、显著性检验

最新推荐文章于 2023-01-29 18:16:44 发布

原创最新推荐文章于 2023-01-29 18:16:44 发布 · 2.6k 阅读

8 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#python #pandas

2020mcm 专栏收录该内容

4 篇文章

订阅专栏

这篇博客探讨如何利用pandas库在Python中进行数据分析，包括通过列索引筛选内容，进行情感分析，以及绘制数据图像。作者遇到在画图时因极性取值范围导致的错误，并展示了将日期设为DataFrame的日期类型的过程。此外，还提及了尝试统计不同评分的次数，最终采用众数作为统计结果。

通过列索引值筛选内容的写法。

pcfr = pd.read_excel('hair.xlsx')
df = pcfr
# '=='后面替换品牌名字即可
m = df[df['product_title']=='remington ac2015 t|studio salon collection pearl ceramic hair dryer, deep purple']

情感分析函数。

def s_c_f(df):
    
    # 去重
    df.duplicated().value_counts() 
    
    # NaN remove
    df['review_body'].str.split(expand = True)
    
    # date format convert
    '''经常报错，参考这个https://stackoverflow.com/questions/51367393/when-i-use-apply-function-in-pandas-it-shows-typeerror-must-be-string-not-fl
    有时候改了好了，换一个表又不行了'''
    # df['review_date'] = df.review_date.apply(lambda x : parser.parse(str(x)))
    # df['review_date'] = df.review_date.apply(parser.parse)
    df['review_date'] = pd.to_datetime(df['review_date'])
    
    #将date设置为index
    df=df.set_index('review_date')
    
    ## sentiment analysis
    # func for polarity
    def sentiment_calc(text):
        try:
            return TextBlob(text).sentiment.polarity
        except:
            return None
        
    # func for subjectivity    
    def sentiment_calc_sub(text):
        try:
            return TextBlob(text).sentiment.subjectivity
        except:
            return None
        
    df['polarity'] = df['review_body'].apply(sentiment_calc)
    df['subjectivity'] = df['review_body'].apply(sentiment_calc_sub)
    
    return df

tmp = s_c_f(pcfr)
g = tmp.reset_index()
g.head(2)

现在的g长这个样子。后面还有两列polarity和subjectivity没有截上去。
在这里插入图片描述

一个蛮好看的图，以dataframe指定的两列为坐标轴，画出第三列的图像。但是不能直接画polarity，因为极性取值区间是[-1,1]，会报ValueError: When stacked is True, each column must be either all positive or negative.1 contains both positive and negative values。必须全正/负才可以。

g.groupby(
    ['review_date','star_rating']
)['subjectivity'].mean().unstack().plot(
    kind='area',
    figsize=(12,8),
    cmap="Blues", # defaults to orangish
)

画出来长这个样子

grey = pd.DataFrame(columns = ['review_date','star_rating', 'polarity','subjectivity'])#创建指定新列的dataframe的方式
grey.review_date = g.review_date
grey.star_rating = g.star_rating
grey.polarity = g.polarity
grey.subjectivity = g.subjectivity

此时的grey.
在这里插入图片描述

grey.cov() #协方差

在这里插入图片描述

grey.corr() # 相关系数

在这里插入图片描述

import scipy.stats as stats
# 显著性检验 第一个pearsonr相关系数 第二个值p-value
stats.pearsonr(grey['star_rating'], grey['polarity'])

Out[53]:(0.41693928228069066, 0.0)

将review_date设为日期，方便后面groupie。

grey = grey.set_index('review_date')
# type(grey)
# grey.index
grey.head(2)

可以看到改变前后的对比。
在这里插入图片描述

……我有点忘了下面写的这一坨代码是干嘛的了。我当时好像是想，分别统计出一年里，star_rating=1,2,3,4,5的次数。捣鼓了半天不成功，就算了众数。参考

grey.loc['2015']
tmp2 = grey.groupby(grey.index.year).mean()
tmp2.star_rating = tmp.star_rating
tmp2['star_mode'] = '0'
tmp2.iloc[0,3] = grey.loc['2015'].star_rating.mode()[0]
tmp2.iloc[1,3] = grey.loc['2014'].star_rating.mode()[0]
tmp2.iloc[2,3] = grey.loc['2013'].star_rating.mode()[0]
tmp2.iloc[3,3] = grey.loc['2012'].star_rating.mode()[0]
tmp2.iloc[4,3] = grey.loc['2011'].star_rating.mode()[0]
tmp2.iloc[5,3] = grey.loc['2010'].star_rating.mode()[0]
tmp2.iloc[6,3] = grey.loc['2009'].star_rating.mode()[0]
tmp2.iloc[7,3] = grey.loc['2008'].star_rating.mode()[0]
tmp2.iloc[8,3] = grey.loc['2006'].star_rating.mode()[0]
tmp2.star_rating = true_mean
tmp2.head(20)