如何使用Python检查单词是否为英语单词？

134

我想检查Python程序中英语词典中是否有单词。

我相信可以使用nltk wordnet接口，但是我不知道如何将其用于如此简单的任务。

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

将来，我可能想检查单词的单数形式是否在字典中（例如，属性->属性->英语单词）。我将如何实现？

python nltk wordnet

— 巴泰勒米
source

215

要获得更大的功能和灵活性，请使用专用的拼写检查库，例如PyEnchant。有一个教程，或者您可以直接学习：

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant带有一些词典（en_GB，en_US，de_DE，fr_FR），但是如果您需要更多语言，可以使用任何OpenOffice。

似乎有一个名为的多元化图书馆inflect，但我不知道它是否有用。

— 卡特里尔
source

2

谢谢，我不了解PyEnchant，对于我想进行的检查确实有用得多。

— Barthelemy

无法识别<helo>吗？这不是一个普通的词，但是我知道<helo>是<helicopter>的缩写，我也不知道<Helot>。只是想指出，解决方案并非千篇一律，而且不同的项目可能需要不同的词典或完全不同的方法。

— dmh 2012年

15

软件包对于我来说基本上是无法安装的。超级令人沮丧。

— 莫妮卡·赫德内克

9

目前不支持Windows上的python 64bit附魔：（（github.com/rfk/pyenchant/issues/42

— Ricky

9

pyenchant不再维护。pyhunspell最近有活动。也/usr/share/dict/并/var/lib/dict可能在* nix设置引用。

— pkfm

48

它不适用于WordNet，因为WordNet并不包含所有英语单词。基于NLTK却没有附魔的另一种可能性是NLTK的语料库

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

— 萨迪克
source

5

同样的提法也适用于此：转换为集合时快得多：set(words.words())

— Iulius Curt 2014年

— 当心

2

注意：此列表中找不到像意大利面或汉堡这样的词

— Paroksh Saxena

45

使用NLTK：

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

如果您在安装wordnet时遇到问题或想要尝试其他方法，则应该参考本文。

— 苏舍尔·贾瓦迪
source

2

这对于cygwin用户特别有用，因为安装附魔非常麻烦。

— alehro 2012年

27

WordNet并不包含英语中的每个单词，它仅包含一小部分。

— justhalf

2

在词网的顶部，缺少大量常见的词（如“将”和“如何”），这比kindall的解决方案要慢得多。

— Ryan Epp 2013年

3

此外，wordnet.synsets并不只是检查其中是否包含单词。它试图首先使问题变得合法化。因此它将“ saless”（不是真正的英语单词）转换为“ sales”。

— 林登·怀特

考虑到同义词集如何工作，这样做是有缺陷的方法。把“倾斜”看我在说什么

— -RetroCode

37

使用集合存储单词列表，因为查找它们会更快：

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

为了回答问题的第二部分，复数已经在一个好的单词列表中了，但是如果出于某种原因要专门从列表中排除那些复数，则确实可以编写一个函数来处理它。但是英语的复数规则非常棘手，以至于我只在单词列表中包括复数。

至于在哪里找到英语单词列表，我只是通过谷歌搜索“英语单词列表”找到了几个。这是其中之一：http : //www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt 如果您想特别使用其中一种方言，则可以使用Google的英式或美式英语。

— 金德尔
source

9

如果english_words用set代替list，则is_english_word运行速度会快很多。

— dan04'9

实际上，我只是将其重做为命令，但您是对的，一组甚至更好。更新。

— kindall 2010年

1

您也可以沟渠，.xreadlines()然后迭代word_file。

— FogleBird 2010年

3

ubuntu下包wamerican并wbritish提供美国和英国英语单词列表作为/usr/share/dict/*-english。软件包信息提供wordlist.sourceforge.net作为参考。

— 直觉

1

我找到了一个包含479k个英语单词的GitHub存储库。

— haolee

6

对于更快的基于NLTK的解决方案，您可以对单词集进行哈希处理以避免线性搜索。

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

— 埃巴阿巴迪
source

2

代替字典，使用组合

— jhuang

4

我发现有3种基于包的解决方案可以解决该问题。它们是pyenchant，wordnet和语料库（自定义或来自ntlk）。使用py3无法在Win64中轻松安装Pyenchant。Wordnet不能很好地运行，因为它的语料库不完整。所以对我来说，我选择@Sadik回答的解决方案，并使用'set（words.words（））'加快速度。

第一：

pip3 install nltk
python3

import nltk
nltk.download('words')

然后：

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

— 杨扬
source

3

使用pyEnchant.checker SpellChecker：

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))

> False
> True

— 格里斯敏
source

1

如果文本长度超过3个单词且错误少于4个（无法识别的单词），则返回true。通常，对于我的用例，这些设置可以很好地工作。

— grizmin

1

对于语义Web方法，您可以以RDF格式针对WordNet运行sparql查询。基本上只使用urllib模块发出GET请求并以JSON格式返回结果，然后使用python'json'模块进行解析。如果不是英文单词，您将不会获得任何结果。

另外，您可以查询Wiktionary的API。

— 伯克星
source

1

对于所有Linux / Unix用户

如果您的操作系统使用Linux内核，则有一种简单的方法可以从英语/美国词典中获取所有单词。在目录中，/usr/share/dict您有一个words文件。还有一个更具体american-english和british-english文件。这些包含该特定语言的所有单词。您可以通过每种编程语言来访问它，这就是为什么我认为您可能想了解这一点的原因。

现在，对于特定于python的用户，下面的python代码应该将列表单词分配为具有每个单词的值：

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^\w]", " ",  file.read()).split()

def is_word(word):
    return word.lower() in words

is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

希望这可以帮助！！！

— Linux4Life531
source