python--10行代码搞定词频统计

最新推荐文章于 2026-05-05 09:42:47 发布

原创

最新推荐文章于 2026-05-05 09:42:47 发布 · 5.4w 阅读

标签

#collections #函数 #库 #python #Counter

收录于

本文介绍如何使用Python的collections库中的Counter函数，简洁地统计英文电子书的单词出现次数，并将结果合并成字典展示。

问题描述：现在有两篇英文电子书（含中文行），统计他们各自的单词出现次数并进行加和，结果以字典形式呈现：

{'the': 2154, 'and': 1394, 'to': 1080, 'of': 871, 'a': 861, 'his': 639, 'The': 637, 'in': 515, 'he': 461, 'with': 310, 'that': 308, 'you': 295, 'for': 280, 'A': 269, 'was': 258, 'him': 246, 'I': 234, 'had': 220, 'as': 217, 'not': 215, 'by': 196, 'on': 189, 'it': 178, 'be': 164, 'at': 153, 'from': 149, 'they': 149, 'but': 149, 'is': 144, 'her': 144, 'their': 143, 'who': 131, 'all': 121, 'one': 119, 'which': 119,}#部分结果展示

借助python强大的标准库，解决方法的实现只需要10行代码：（本文需要用到的两篇文档下载：http://pan.baidu.com/s/1pKuO7fP）

import re,collections
def get_words(file):
    with open (file) as f:
        words_box=[]
        for line in f:                         
            if re.match(r'[a-zA-Z0-9]*',line):#避免中文影响
                words_box.extend(line.strip().split())               
    return collections.Counter(words_box)
print(get_nums('emma.tx