【数据挖掘】一、基于LDA的用户兴趣建模(兴趣标签生成模型)--用户兴趣挖掘模型

!pip install gensim -i https://mirrors.aliyun.com/pypi/simple/
!pip install  pyLDAvis  -i https://mirrors.aliyun.com/pypi/simple/
!pip install snownlp -i https://mirrors.aliyun.com/pypi/simple/
!pip install wordcloud -i https://mirrors.aliyun.com/pypi/simple/ 
......

需要什么依赖包就安装什么依赖包，就不一一列举出来了。

1.2 导入依赖包

在环境中导入依赖包：

import numpy as np
import pandas as pd
import jieba
import gensim
from gensim.models import LdaModel, CoherenceModel
from gensim.corpora import Dictionary
from gensim import corpora, models
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from PIL import Image
import warnings
import codecs
import re
import time
import matplotlib
from snownlp import SnowNLP

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # 使用黑体
plt.rcParams['axes.unicode_minus'] = False  # 解决负号显示问题

# 忽略警告
warnings.filterwarnings("ignore")

2、数据预处理

2.1 数据加载

首先是加载数据集，因为数据集存在一列“Unnamed：0”是多余的，所以需要删除掉该列，加载数据集并删除多余的列，如下：：

df=pd.read_csv("../datasets/extopic.csv")
df = df.drop(columns=['Unnamed: 0'])
df

运行结果如下：

2.2 数据异常值处理

异常值处理：

#处理异常值
print("在NewMessage列中总共有 %d 个空值." % df['NewMessage'].isnull().sum())
df[df.isnull().values==True]#isnull返回一个布尔数组
df = df[pd.notnull(df['NewMessage'])]#保留非null的news

df['content']=df['NewMessage'].astype(str) #将数据类型都换成str
df

运行截图

2.3 中文文本处理

中文文本处理，包括提取中文字符、过滤掉单个汉字以及去重，如下：

#中文文本处理
def extract_chinese(text):
    chinese_pattern = re.compile(r'[\u4e00-\u9fa5]+') 
    chinese_words = chinese_pattern.findall(text)  
    return ' '.join(chinese_words)  

df['content'] = df['content'].apply(extract_chinese)
df['content'] = df['content'].apply(lambda x: ' '.join([word for word in x.split() if len(word) > 1]))
df.drop_duplicates(inplace=True)
df.head()

运行截图

2.4 情感分析

情感分析，判断文本的情感方向：

# 定义情感分类函数
def classify_sentiment(sentiment):
    if sentiment >= 0.5:
        return 'positive'
    else:
        return 'negative'

def classify_sentiments(text):
    if pd.isna(text) or text.strip() == '':
        return None  # 返回 None 或者其他默认值
    s = SnowNLP(text)
    sentiment = s.sentiments
    return sentiment

# 处理空值
df['content'] = df['content'].fillna('')  # 将空值填充为空字符串

# 对评论列进行情感分类
df['sentiment'] = df['content'].apply(classify_sentiments)

# 删除包含空值的行
df = df.dropna(subset=['sentiment'])
df

运行截图

2.5 jieba分词

（1）文本处理结果查看

jieba分词前，我们可以查看每一个句子的完整句子，例如输出前5条句子的完整句子：

content = df['content'].values.tolist()
content[:5]

运行截图

（2）jieba分词

jieba分词并输出前十个词，如下：

segment=[]
for line in content:
    try:
        segs = jieba.lcut(line)#分词
        for seg in segs:
            if len(seg)>1 and seg != '\r\n':
                segment.append(seg)
    except:
        print(line)
        continue
segment[:10]

运行截图

（3）停用词过滤

读取停用词表，然后停用词过滤，停用词表用的是这个：停用词表

停用词过滤如下：

# 读取停用词文件
try:
    stopwords = pd.read_csv(r'../datasets/stop_words.txt', header=None, encoding='utf-8', names=['stopword'], on_bad_lines='skip')
except pd.errors.ParserError as e:
    print(f"ParserError: {e}")
    stopwords = pd.read_csv(r'../datasets/stop_words.txt', header=None, encoding='utf-8', names=['stopword'], delimiter='\t', on_bad_lines='skip')

# 确保停用词列只有一个单词或短语
stopwords = stopwords['stopword'].str.strip()

words_df = pd.DataFrame({'segment': segment})

# 过滤掉停用词
words_df = words_df[~words_df['segment'].isin(stopwords)]

# 显示前几行数据
words_df.head()

运行结果：

3、生成词云图

3.1 词语计数

接下来是对每一个词语进行统计计数，通过降序排列，按照数量从高到底的方式对词语进行排序，如下：

words_stat=words_df.groupby(by=['segment'])['segment'].agg([("计数",np.size)])
words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False)
words_stat.head()