使用NLTK删除停用词

Question 1

我正在尝试通过使用nltk工具包删除停用词来处理用户输入的文本，但是使用停用词删除功能会删除“和”，“或”，“不”之类的词。我希望这些词在停用词删除过程之后出现，因为它们是稍后将文本作为查询处理所必需的运算符。我不知道哪些字符可以作为文本查询中的运算符，我还想从文本中删除不必要的词。

Question 2

建议您创建自己的从停用词列表中删除的操作员词列表。集可以方便地减去，因此：

operators = set(('and', 'or', 'not'))
stop = set(stopwords...) - operators

然后，您可以简单地测试一个单词是否是单词in或not in集合，而不必依赖于您的运算符是否在停用词列表中。然后，您可以稍后切换到另一个停用词列表或添加运算符。

if word.lower() not in stop:
    # use word

Question 3

有一个内置的停用词列表，NLTK由11种语言的2400个停用词组成（Porter等），请参见http://nltk.org/book/ch02.html

>>> from nltk import word_tokenize
>>> from nltk.corpus import stopwords
>>> stop = set(stopwords.words('english'))
>>> sentence = "this is a foo bar sentence"
>>> print([i for i in sentence.lower().split() if i not in stop])
['foo', 'bar', 'sentence']
>>> [i for i in word_tokenize(sentence.lower()) if i not in stop] 
['foo', 'bar', 'sentence']

我建议您使用tf-idf删除停用词，请参阅词干对词频的影响？

Question 4

@alvas的答案可以完成工作，但是可以更快地完成。假设您有documents：字符串列表。

from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']) # remove it if you need punctuation 

for doc in documents:
    list_of_words = [i.lower() for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]

请注意，由于您实际上是在一个集合中（而不是在列表中）进行搜索，因此从理论上讲速度会len(stop_words)/2快几倍，如果您需要处理许多文档，这将非常重要。

对于5000个文档（每个单词约300个单词），差异在我的示例中为1.8秒，在@alvas中为20秒。

PS在大多数情况下，您需要将文本分成单词才能执行使用tf-idf的其他一些分类任务。因此，最有可能的是也最好使用词干分析器：

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

并[porter.stem(i.lower()) for i in wordpunct_tokenize(doc) if i.lower() not in stop_words]在循环内使用。

Question 5

@alvas有一个很好的答案。但这又取决于任务的性质，例如，在您的应用程序中，您要考虑所有，conjunction例如和，或者，但是，如果，虽然又是所有，determiner例如，一个，一些，大多数，每一个都不是停用词，考虑所有其他词性是合法的，那么您可能想研究一下使用词性标签集丢弃单词的解决方案，请参见表5.1：

import nltk

STOP_TYPES = ['DET', 'CNJ']

text = "some data here "
tokens = nltk.pos_tag(nltk.word_tokenize(text))
good_words = [w for w, wtype in tokens if wtype not in STOP_TYPES]

Question 6

您可以将string.punctuation与内置的NLTK停用词列表结合使用：

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation

words = tokenize(text)
wordsWOStopwords = removeStopWords(words)

def tokenize(text):
        sents = sent_tokenize(text)
        return [word_tokenize(sent) for sent in sents]

def removeStopWords(words):
        customStopWords = set(stopwords.words('english')+list(punctuation))
        return [word for word in words if word not in customStopWords]

NLTK停用词完整列表