我正在寻找一种将文本拆分为n-gram的方法。通常我会做类似的事情:
import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams
我知道nltk仅提供二元组和三元组,但是有没有办法将我的文本分为四克,五克甚至一百克?
谢谢!
我正在寻找一种将文本拆分为n-gram的方法。通常我会做类似的事情:
import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams
我知道nltk仅提供二元组和三元组,但是有没有办法将我的文本分为四克,五克甚至一百克?
谢谢!
Answers:
其他用户提供的基于本地Python的出色答案。但是这就是nltk
方法(以防万一,OP会因为重新发明nltk
库中已经存在的内容而受到惩罚)。
有一个NGRAM模块,人们很少使用nltk
。这不是因为很难读取ngram,而是基于ngram训练模型,其中n> 3将导致大量数据稀疏。
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print grams
sixgrams
吗?
我很惊讶这还没有出现:
In [34]: sentence = "I really like python, it's pretty awesome.".split()
In [35]: N = 4
In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]
In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']
仅使用nltk工具
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
def get_ngrams(text, n ):
n_grams = ngrams(word_tokenize(text), n)
return [ ' '.join(grams) for grams in n_grams]
输出示例
get_ngrams('This is the simplest text i could think of', 3 )
['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']
为了使ngram保持数组格式,只需删除 ' '.join
这是做n-gram的另一种简单方法
>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]
对于需要二元组或三元组的情况,人们已经很好地回答了,但是在这种情况下,如果需要句子的每一个整组,您可以使用nltk.util.everygrams
>>> from nltk.util import everygrams
>>> message = "who let the dogs out"
>>> msg_split = message.split()
>>> list(everygrams(msg_split))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]
如果您有一个限制,如三字母组的最大长度应为3,则可以使用max_len参数来指定它。
>>> list(everygrams(msg_split, max_len=2))
[('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]
您可以修改max_len参数以实现任意克,即4克,5克,6甚至100克。
可以修改前面提到的解决方案以实现上面提到的解决方案,但是此解决方案比这要简单得多。
欲了解更多信息,请点击这里
而且,当您只需要一个特定的语法,例如bigram或trigram等时,可以使用MAHassan的答案中提到的nltk.util.ngrams。
您可以使用以下命令轻松启动自己的功能itertools
:
from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]
izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
我不太了解。
使用python的buildin构建双字母组的一种更优雅的方法zip()
。只需通过将原始字符串转换为列表split()
,然后正常传递一次列表,然后一次偏移一个元素即可。
string = "I really like python, it's pretty awesome."
def find_bigrams(s):
input_list = s.split(" ")
return zip(input_list, input_list[1:])
def find_ngrams(s, n):
input_list = s.split(" ")
return zip(*[input_list[i:] for i in range(n)])
find_bigrams(string)
[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]
我从未处理过nltk,但是在一些小班项目中做了N-grams。如果要查找字符串中所有N-gram出现的频率,可以采用这种方法。D
会给你N个单词的直方图。
D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
try:
D[tuple(strparts[i:i+N])] += 1
except:
D[tuple(strparts[i:i+N])] = 1
collections.Counter(tuple(strparts[i:i+N]) for i in xrange(len(strparts)-N))
的工作速度将比“试穿”的工作速度更快
对于four_grams,它已经在NLTK中,这是一段代码,可以帮助您实现这一目标:
from nltk.collocations import *
import nltk
#You should tokenize your text
text = "I do not like green eggs and ham, I do not like them Sam I am!"
tokens = nltk.wordpunct_tokenize(text)
fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
for fourgram, freq in fourgrams.ngram_fd.items():
print fourgram, freq
希望对您有所帮助。
您可以使用sklearn.feature_extraction.text.CountVectorizer:
import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
输出:
4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']
您可以设置为ngram_size
任何正整数。也就是说,您可以将文本拆分为四克,五克甚至一百克。
如果效率是一个问题,您必须构建多个不同的n-gram(如您所说的最多一百个),但是您想使用纯python,我会这样做:
from itertools import chain
def n_grams(seq, n=1):
"""Returns an itirator over the n-grams given a listTokens"""
shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
shiftedTokens = (shiftToken(i) for i in range(n))
tupleNGrams = zip(*shiftedTokens)
return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)
def range_ngrams(listTokens, ngramRange=(1,2)):
"""Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))
用法:
>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngramRange=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]
〜与NLTK相同的速度:
import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngramRange=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
重新发布我以前的答案。
Nltk很棒,但有时对于某些项目来说是一项开销:
import re
def tokenize(text, ngrams=1):
text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
text = re.sub(r'\s+', ' ', text)
tokens = text.split()
return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]
使用示例:
>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]
您可以使用以下代码获得所有4-6克的代码,而无需以下其他软件包:
from itertools import chain
def get_m_2_ngrams(input_list, min, max):
for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
yield ' '.join(s)
def get_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
if __name__ == '__main__':
input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
for s in get_m_2_ngrams(input_list, 4, 6):
print(s)
输出如下:
I am aware that
am aware that nltk
aware that nltk only
that nltk only offers
nltk only offers bigrams
only offers bigrams and
offers bigrams and trigrams
bigrams and trigrams ,
and trigrams , but
trigrams , but is
, but is there
but is there a
is there a way
there a way to
a way to split
way to split my
to split my text
split my text in
my text in four-grams
text in four-grams ,
in four-grams , five-grams
four-grams , five-grams or
, five-grams or even
five-grams or even hundred-grams
I am aware that nltk
am aware that nltk only
aware that nltk only offers
that nltk only offers bigrams
nltk only offers bigrams and
only offers bigrams and trigrams
offers bigrams and trigrams ,
bigrams and trigrams , but
and trigrams , but is
trigrams , but is there
, but is there a
but is there a way
is there a way to
there a way to split
a way to split my
way to split my text
to split my text in
split my text in four-grams
my text in four-grams ,
text in four-grams , five-grams
in four-grams , five-grams or
four-grams , five-grams or even
, five-grams or even hundred-grams
I am aware that nltk only
am aware that nltk only offers
aware that nltk only offers bigrams
that nltk only offers bigrams and
nltk only offers bigrams and trigrams
only offers bigrams and trigrams ,
offers bigrams and trigrams , but
bigrams and trigrams , but is
and trigrams , but is there
trigrams , but is there a
, but is there a way
but is there a way to
is there a way to split
there a way to split my
a way to split my text
way to split my text in
to split my text in four-grams
split my text in four-grams ,
my text in four-grams , five-grams
text in four-grams , five-grams or
in four-grams , five-grams or even
four-grams , five-grams or even hundred-grams
您可以在此博客上找到更多详细信息
大约七年后,这是一个更优雅的答案collections.deque
:
def ngrams(words, n):
d = collections.deque(maxlen=n)
d.extend(words[:n])
words = words[n:]
for window, word in zip(itertools.cycle((d,)), words):
print(' '.join(window))
d.append(word)
words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']
输出:
In [15]: ngrams(words, 3)
I am become
am become death,
become death, the
death, the destroyer
the destroyer of
In [16]: ngrams(words, 4)
I am become death,
am become death, the
become death, the destroyer
death, the destroyer of
In [17]: ngrams(words, 1)
I
am
become
death,
the
destroyer
of
In [18]: ngrams(words, 2)
I am
am become
become death,
death, the
the destroyer
destroyer of
如果您想为具有恒定内存使用量的大字符串提供纯迭代器解决方案:
from typing import Iterable
import itertools
def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
input_iters = [
map(lambda m: m.group(0), re.finditer(token_regex, input))
for n in range(ngram_size)
]
# Skip first words
for n in range(1, ngram_size): list(map(next, input_iters[n:]))
output_iter = itertools.starmap(
lambda *args: " ".join(args),
zip(*input_iters)
)
return output_iter
测试:
input = "If you want a pure iterator solution for large strings with constant memory usage"
list(ngrams_iter(input, 5))
输出:
['If you want a pure',
'you want a pure iterator',
'want a pure iterator solution',
'a pure iterator solution for',
'pure iterator solution for large',
'iterator solution for large strings',
'solution for large strings with',
'for large strings with constant',
'large strings with constant memory',
'strings with constant memory usage']