英文文本去停用词

最新推荐文章于 2026-04-06 09:10:42 发布

原创最新推荐文章于 2026-04-06 09:10:42 发布 · 1k 阅读

2 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#nlp

本文介绍了一个使用Python的nltk库进行文本停用词处理的例子。通过去除英语停用词，对一段关于水溶液中阿莫西林去除方法的文本进行了过滤。展示了从文本分词到应用停用词过滤的完整过程。

需要安装nltk，安装完之后还有stopwords，装在copora文件夹下边
!
[文件夹一定要放对，不然特别麻烦，会一直报错说找不到stopwords](https://img-blog.csdnimg.cn/20210519145514680.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L1hYQUNZMTIzMzIx,size_16,color_FFFFFF,t_

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
set(stopwords.words('english'))
text="""Removal of amoxicillin from aqueous solution using sludge-based activated carbon modified ."""#插入需要停用词处理的txt
stop_words=set(stopwords.words('english'))
word_tokens=word_tokenize(text)

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)



print("\n\nFiltered Sentence \n\n")
print(" ".join(filtered_sentence))

输出的结果是：

Filtered Sentence 


Removal amoxicillin aqueous solution using sludge-based activated carbon modified walnut shell nano-titanium dioxide . Dewatered municipal sludge used raw material prepare activated carbon ( SAC ) , SAC modified walnut shell nano-titanium dioxide ( MSAC ) . The results showed MSAC higher specific surface area ( S-BET ) ( 279.147 ( 2 ) /g ) total pore volume ( V-T ) ( 0.324 cm ( 3 ) /g ) SAC . 

Process finished with exit code 0

我也是个小白菜鸡文科硕士生……
正在记录自己的处理过程