from sklearn.feature_extraction.text import TfidfVectorizer
def make_corpus(doc_files):
for doc in doc_files:
yield load_doc_from_file(doc) #load_doc_from_file is a custom function for loading a doc from file
file_list = ... # list of files you want to load
corpus = make_corpus(file_list)
vectorizer = TfidfVectorizer(min_df=1)
vectorizer.fit(corpus)
Yes you can, just make your corpus an iterator. For example, if your documents reside on a disc, you can define an iterator that takes as an argument the list of file names, and returns the documents one by one without loading everything into memory at once.
https://stackoverflow.com/questions/16453855/tfidfvectorizer-for-corpus-that-cannot-fit-in-memory
本文介绍如何使用TF-IDFVectorizer处理无法一次性载入内存的大型文档集合,通过定义迭代器逐个读取文档,避免内存溢出。

1万+

被折叠的 条评论
为什么被折叠?



