我在玩自然语言工具包(NLTK)。
它的文档(Book和HOWTO)非常庞大,示例有时会稍作改进。
有没有很好的NLTK用途/应用的基本示例?我正在考虑诸如Stream Hacker博客上的NTLK文章。
Answers:
这是我自己的实际示例,它可以使其他任何人都受益于查找此问题(对不起示例文本,这是我在Wikipedia上发现的第一件事):
import nltk
import pprint
tokenizer = None
tagger = None
def init_nltk():
global tokenizer
global tagger
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
def tag(text):
global tokenizer
global tagger
if not tokenizer:
init_nltk()
tokenized = tokenizer.tokenize(text)
tagged = tagger.tag(tokenized)
tagged.sort(lambda x,y:cmp(x[1],y[1]))
return tagged
def main():
text = """Mr Blobby is a fictional character who featured on Noel
Edmonds' Saturday night entertainment show Noel's House Party,
which was often a ratings winner in the 1990s. Mr Blobby also
appeared on the Jamie Rose show of 1997. He was designed as an
outrageously over the top parody of a one-dimensional, mute novelty
character, which ironically made him distinctive, absurd and popular.
He was a large pink humanoid, covered with yellow spots, sporting a
permanent toothy grin and jiggling eyes. He communicated by saying
the word "blobby" in an electronically-altered voice, expressing
his moods through tone of voice and repetition.
There was a Mrs. Blobby, seen briefly in the video, and sold as a
doll.
However Mr Blobby actually started out as part of the 'Gotcha'
feature during the show's second series (originally called 'Gotcha
Oscars' until the threat of legal action from the Academy of Motion
Picture Arts and Sciences[citation needed]), in which celebrities
were caught out in a Candid Camera style prank. Celebrities such as
dancer Wayne Sleep and rugby union player Will Carling would be
enticed to take part in a fictitious children's programme based around
their profession. Mr Blobby would clumsily take part in the activity,
knocking over the set, causing mayhem and saying "blobby blobby
blobby", until finally when the prank was revealed, the Blobby
costume would be opened - revealing Noel inside. This was all the more
surprising for the "victim" as during rehearsals Blobby would be
played by an actor wearing only the arms and legs of the costume and
speaking in a normal manner.[citation needed]"""
tagged = tag(text)
l = list(set(tagged))
l.sort(lambda x,y:cmp(x[1],y[1]))
pprint.pprint(l)
if __name__ == '__main__':
main()
输出:
[('rugby', None),
('Oscars', None),
('1990s', None),
('",', None),
('Candid', None),
('"', None),
('blobby', None),
('Edmonds', None),
('Mr', None),
('outrageously', None),
('.[', None),
('toothy', None),
('Celebrities', None),
('Gotcha', None),
(']),', None),
('Jamie', None),
('humanoid', None),
('Blobby', None),
('Carling', None),
('enticed', None),
('programme', None),
('1997', None),
('s', None),
("'", "'"),
('[', '('),
('(', '('),
(']', ')'),
(',', ','),
('.', '.'),
('all', 'ABN'),
('the', 'AT'),
('an', 'AT'),
('a', 'AT'),
('be', 'BE'),
('were', 'BED'),
('was', 'BEDZ'),
('is', 'BEZ'),
('and', 'CC'),
('one', 'CD'),
('until', 'CS'),
('as', 'CS'),
('This', 'DT'),
('There', 'EX'),
('of', 'IN'),
('inside', 'IN'),
('from', 'IN'),
('around', 'IN'),
('with', 'IN'),
('through', 'IN'),
('-', 'IN'),
('on', 'IN'),
('in', 'IN'),
('by', 'IN'),
('during', 'IN'),
('over', 'IN'),
('for', 'IN'),
('distinctive', 'JJ'),
('permanent', 'JJ'),
('mute', 'JJ'),
('popular', 'JJ'),
('such', 'JJ'),
('fictional', 'JJ'),
('yellow', 'JJ'),
('pink', 'JJ'),
('fictitious', 'JJ'),
('normal', 'JJ'),
('dimensional', 'JJ'),
('legal', 'JJ'),
('large', 'JJ'),
('surprising', 'JJ'),
('absurd', 'JJ'),
('Will', 'MD'),
('would', 'MD'),
('style', 'NN'),
('threat', 'NN'),
('novelty', 'NN'),
('union', 'NN'),
('prank', 'NN'),
('winner', 'NN'),
('parody', 'NN'),
('player', 'NN'),
('actor', 'NN'),
('character', 'NN'),
('victim', 'NN'),
('costume', 'NN'),
('action', 'NN'),
('activity', 'NN'),
('dancer', 'NN'),
('grin', 'NN'),
('doll', 'NN'),
('top', 'NN'),
('mayhem', 'NN'),
('citation', 'NN'),
('part', 'NN'),
('repetition', 'NN'),
('manner', 'NN'),
('tone', 'NN'),
('Picture', 'NN'),
('entertainment', 'NN'),
('night', 'NN'),
('series', 'NN'),
('voice', 'NN'),
('Mrs', 'NN'),
('video', 'NN'),
('Motion', 'NN'),
('profession', 'NN'),
('feature', 'NN'),
('word', 'NN'),
('Academy', 'NN-TL'),
('Camera', 'NN-TL'),
('Party', 'NN-TL'),
('House', 'NN-TL'),
('eyes', 'NNS'),
('spots', 'NNS'),
('rehearsals', 'NNS'),
('ratings', 'NNS'),
('arms', 'NNS'),
('celebrities', 'NNS'),
('children', 'NNS'),
('moods', 'NNS'),
('legs', 'NNS'),
('Sciences', 'NNS-TL'),
('Arts', 'NNS-TL'),
('Wayne', 'NP'),
('Rose', 'NP'),
('Noel', 'NP'),
('Saturday', 'NR'),
('second', 'OD'),
('his', 'PP$'),
('their', 'PP$'),
('him', 'PPO'),
('He', 'PPS'),
('more', 'QL'),
('However', 'RB'),
('actually', 'RB'),
('also', 'RB'),
('clumsily', 'RB'),
('originally', 'RB'),
('only', 'RB'),
('often', 'RB'),
('ironically', 'RB'),
('briefly', 'RB'),
('finally', 'RB'),
('electronically', 'RB-HL'),
('out', 'RP'),
('to', 'TO'),
('show', 'VB'),
('Sleep', 'VB'),
('take', 'VB'),
('opened', 'VBD'),
('played', 'VBD'),
('caught', 'VBD'),
('appeared', 'VBD'),
('revealed', 'VBD'),
('started', 'VBD'),
('saying', 'VBG'),
('causing', 'VBG'),
('expressing', 'VBG'),
('knocking', 'VBG'),
('wearing', 'VBG'),
('speaking', 'VBG'),
('sporting', 'VBG'),
('revealing', 'VBG'),
('jiggling', 'VBG'),
('sold', 'VBN'),
('called', 'VBN'),
('made', 'VBN'),
('altered', 'VBN'),
('based', 'VBN'),
('designed', 'VBN'),
('covered', 'VBN'),
('communicated', 'VBN'),
('needed', 'VBN'),
('seen', 'VBN'),
('set', 'VBN'),
('featured', 'VBN'),
('which', 'WDT'),
('who', 'WPS'),
('when', 'WRB')]
('called', 'VBN')
被说called
是past participle verb
。看起来好像使用了Global,以便可以在函数范围内更改变量(这样就不必在每次调用函数时都传递变量)。
NLP通常非常有用,因此您可能希望将搜索范围扩大到文本分析的一般应用。我使用NLTK通过提取概念图生成文件分类法来帮助MOSS 2010。真的很好 不久之后,文件便开始以有用的方式进行群集。
通常,要理解文本分析,您必须与习惯的思维方式相切。例如,文本分析对于发现非常有用。但是,大多数人甚至都不知道搜索和发现之间的区别。如果您阅读了这些主题,则可能会“发现”一些使NLTK发挥作用的方法。
此外,请考虑不带NLTK的文本文件的世界视图。您有一堆用空格和标点符号分隔的随机长度字符串。某些标点符号会更改其用法,例如句点(这也是小数点和缩写的后缀标记)。使用NLTK,您可以获得单词,而更多的是获得词性。现在您已经掌握了内容。使用NLTK发现文档中的概念和操作。使用NLTK来获取文档的“含义”。在这种情况下,含义是指文档中的本质关系。
对NLTK感到好奇是一件好事。未来几年,Text Analytics有望大举突破。那些了解它的人将更适合更好地利用新机会。
我是streamhacker.com的作者(感谢您的提及,我从这个特定问题中获得了大量点击流量)。您到底想做什么?NLTK有很多用于执行各种操作的工具,但是在某种程度上缺少有关如何使用这些工具以及如何最好地使用它们的明确信息。它还面向学术问题,因此要将教学示例转化为实际解决方案可能会很繁重。