如何使用NLTK标记器消除标点符号？

125

我刚刚开始使用NLTK，但我不太了解如何从文本中获取单词列表。如果使用nltk.word_tokenize()，则会得到单词和标点的列表。我只需要这些词。我如何摆脱标点符号？同样word_tokenize不适用于多个句子：点号会添加到最后一个单词中。

— 轻浮
source

12

为什么不自己删除标点符号？nltk.word_tokenize(the_text.translate(None, string.punctuation))应该可以在python2中工作，而在python3中可以nltk.work_tokenize(the_text.translate(dict.fromkeys(string.punctuation)))。

— Bakuriu

3

这行不通。文本没有任何反应。

— lizarisk

NLTK假定的工作流程是您首先将其标记为句子，然后将每个句子标记为单词。这就是为什么word_tokenize()不适用于多个句子的原因。要摆脱标点符号，可以使用正则表达式或python isalnum()函数。

— Suzana

2

它确实起作用：（>>> 'with dot.'.translate(None, string.punctuation) 'with dot'请注意结果末尾没有点）如果您有类似的内容'end of sentence.No space'，则可能会引起问题，在这种情况下，请改为：the_text.translate(string.maketrans(string.punctuation, ' '*len(string.punctuation)))用空格替换所有标点符号。

— Bakuriu

糟糕，它确实有效，但不适用于Unicode字符串。

— lizarisk 2013年

162

看看nltk 在此处提供的其他标记化选项。例如，您可以定义一个令牌生成器，该令牌生成器将字母数字字符序列选作令牌，并丢弃其他所有内容：

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

输出：

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

— 马卢夫
source

55

请注意，如果使用此选项，则会失去自然语言功能，这些功能特别word_tokenize类似于拆分收缩。您可以天真的在正则表达式上拆分，\w+而无需使用NLTK。

— sffc 2015年

3

为了说明@sffc的评论，您可能会丢失诸如“先生”之类的词。

— geekazoid

如何将其替换为“ n't”至“ t”？

— 阿西库尔·拉曼（Ms. Ashikur Rahman）

46

您实际上并不需要NLTK来删除标点符号。您可以使用简单的python将其删除。对于字符串：

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

或对于unicode：

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

然后在令牌生成器中使用此字符串。

PS字符串模块还有一些其他可以删除的元素集（例如数字）。

— 萨尔瓦多·达利
source

3

使用也可以使用列表表达式删除所有标点符号。a = "*fa,fd.1lk#$" print("".join([w for w in a if w not in string.punctuation]))

— 约翰尼·张

32

下面的代码将删除所有标点符号以及非字母字符。从他们的书中复制。

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

输出

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

— 马杜拉·普拉迪普（Madura Pradeep）
source

17

请注意，使用这种方法在“不能”或“不能”等情况下会丢失“不”一词，这对于理解和分类句子非常重要。最好使用句子.translate（string.maketrans（“”，“”，），chars_to_remove），其中chars_to_remove可以是“。，'：;！？”

— MikeL

3

@MikeL在进行kankaning之前，无法通过导入收缩和contractions.fix（sentence_here）绕过“不能”和“不要”之类的词。它将把“不能”变成“不能”，而“不”变成“不”。

— zipline86 '19

16

正如注释中所注意到的那样，因为word_tokenize（）仅适用于单个句子，所以它以send_tokenize（）开头。您可以使用filter（）过滤出标点符号。如果您有一个unicode字符串，请确保它是一个unicode对象（而不是使用“ utf-8”之类的编码编码的“ str”）。

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

— 帕洛
source

14

Penn Treebank令牌生成器涉及的大多数复杂性都与正确处理标点有关。如果您只打算删除标点符号，为什么还要使用昂贵的令牌处理程序来很好地处理标点符号呢？

— rmalouf

3

word_tokenize是返回的函数[token for sent in sent_tokenize(text, language) for token in _treebank_word_tokenize(sent)]。因此，我认为您的答案正在做nltk已经做的事情：在使用sent_tokenize()之前使用word_tokenize()。至少这是针对nltk3的。

— Kurt Bourbaki

2

@rmalouf，因为您不需要标点符号吗？所以，你要did和n't而不是.

— 西普里安Tomoiagă

11

我只使用了以下代码，删除了所有标点符号：

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

— 愿
source

2

为什么将令牌转换为文本？

— Sadik

6

我认为您需要某种正则表达式匹配（以下代码在Python 3中）：

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

输出：

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

在大多数情况下应该可以正常使用，因为它可以删除标点符号，同时保留“ n't”之类的令牌，而这些令牌不能从regex令牌生成器（如）获得wordpunct_tokenize。

— 全干
source

这也将删除之类的东西...，并--同时保持收缩，这s.translate(None, string.punctuation)会不会

— CJ杰克逊

5

真诚的问，这是什么字？如果您假设一个单词仅由字母字符组成，那您就错了，因为如果在标记化之前删除标点符号，can't诸如的单词将被破坏成碎片（例如can和t），这很可能会对程序产生负面影响。

因此，解决方案是先标记化然后删除标点标记。

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

......然后，如果你愿意，你可以替换某些标记，如'm用am。

— 波拉·M·阿尔珀
source

4

我使用以下代码删除标点符号：

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

而且，如果您想检查令牌是否为有效的英语单词，则可能需要PyEnchant

教程：

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

— 真女5
source

2

注意该解决方案可以消除宫缩。这是因为word_tokenize使用标准的分词器（）可TreebankWordTokenizer收缩（例如，分解can't为（ca，n't）。但是n't它不是字母数字，并且在此过程中会丢失。）

— Diego Ferri 18'Jan

1

删除标点符号（它将删除和标点符号处理的一部分，使用下面的代码）

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string

样本输入/输出：

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']

— ascii_walker
source

非常感谢

1

只需添加@rmalouf的解决方案，就不会包含任何数字，因为\ w +等效于[a-zA-Z0-9_]

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'[a-zA-Z]')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

— Himanshu Aggarwal
source

这个为每个字母创建一个令牌。

— Rishabh Gupta

1

您可以在没有nltk（python 3.x）的情况下一行完成此操作。

import string
string_text= string_text.translate(str.maketrans('','',string.punctuation))

— 尼桑·维克拉玛拉特那
source