1.简介
这是一种系统地解决此问题的方法:如果您有一个能很好地执行绞刑man的算法,那么您可以将每个单词的难度作为猜单词时程序将要进行的错误猜测的次数。
2.除了子手策略
在其他答案和注释中还暗含了一个想法,即求解器的最佳策略是根据英语中字母的出现频率或某些语料库中单词的出现频率做出决策。这是一个诱人的主意,但并不完全正确。如果求解程序能够准确地模拟设置者选择的单词的分布,则求解器将发挥最佳作用,而人类的设置者很可能会根据单词的稀有性或避免经常使用的字母来选择单词。例如,虽然E
是英语中最常用的字母,如果制定者总是从字选择JUGFUL
,RHYTHM
,SYZYGY
,和ZYTHUM
,再完美的解决者不通过猜测开始E
!
对setter进行建模的最佳方法取决于上下文,但是我猜想,在求解器与同一个setter或一组类似setter一起玩很多游戏的情况下,某种贝叶斯归纳推理会很好地起作用。
3.子手算法
在这里,我将概述一个很好的求解器(但远非完美)。它将设置程序建模为从固定词典中统一选择单词。这是一个贪婪的算法:在每个阶段,它都会猜测使未命中次数最少的字母,即不包含猜测的单词。例如,如果没有猜测已经进行了迄今,和可能的话DEED
,DEAD
和DARE
,则:
- 如果您猜测
D
或E
,则不会错过任何机会;
- 如果您猜到了
A
,那就有一个小姐(DEED
);
- 如果您猜到了
R
,则有两个未命中(DEED
和DEAD
);
- 如果您还猜到其他字母,则有3个未命中。
因此,在这种情况下D
或者E
是一个很好的猜测。
(感谢Panic上校在评论中指出,子手可以免费进行正确的猜测-我在第一次尝试中完全忘记了这一点!)
4.实施
这是此算法在Python中的实现:
from collections import defaultdict
from string import ascii_lowercase
def partition(guess, words):
"""Apply the single letter 'guess' to the sequence 'words' and return
a dictionary mapping the pattern of occurrences of 'guess' in a
word to the list of words with that pattern.
>>> words = 'deed even eyes mews peep star'.split()
>>> sorted(list(partition('e', words).items()))
[(0, ['star']), (2, ['mews']), (5, ['even', 'eyes']), (6, ['deed', 'peep'])]
"""
result = defaultdict(list)
for word in words:
key = sum(1 << i for i, letter in enumerate(word) if letter == guess)
result[key].append(word)
return result
def guess_cost(guess, words):
"""Return the cost of a guess, namely the number of words that don't
contain the guess.
>>> words = 'deed even eyes mews peep star'.split()
>>> guess_cost('e', words)
1
>>> guess_cost('s', words)
3
"""
return sum(guess not in word for word in words)
def word_guesses(words, wrong = 0, letters = ''):
"""Given the collection 'words' that match all letters guessed so far,
generate tuples (wrong, nguesses, word, guesses) where
'word' is the word that was guessed;
'guesses' is the sequence of letters guessed;
'wrong' is the number of these guesses that were wrong;
'nguesses' is len(guesses).
>>> words = 'deed even eyes heel mere peep star'.split()
>>> from pprint import pprint
>>> pprint(sorted(word_guesses(words)))
[(0, 1, 'mere', 'e'),
(0, 2, 'deed', 'ed'),
(0, 2, 'even', 'en'),
(1, 1, 'star', 'e'),
(1, 2, 'eyes', 'en'),
(1, 3, 'heel', 'edh'),
(2, 3, 'peep', 'edh')]
"""
if len(words) == 1:
yield wrong, len(letters), words[0], letters
return
best_guess = min((g for g in ascii_lowercase if g not in letters),
key = lambda g:guess_cost(g, words))
best_partition = partition(best_guess, words)
letters += best_guess
for pattern, words in best_partition.items():
for guess in word_guesses(words, wrong + (pattern == 0), letters):
yield guess
5.示例结果
使用这种策略,可以评估猜测集合中每个单词的难度。在这里,我考虑系统字典中的六个字母的单词:
>>> words = [w.strip() for w in open('/usr/share/dict/words') if w.lower() == w]
>>> six_letter_words = set(w for w in words if len(w) == 6)
>>> len(six_letter_words)
15066
>>> results = sorted(word_guesses(six_letter_words))
在该词典中最容易猜测的单词(以及求解器猜测它们所需的猜测顺序)如下:
>>> from pprint import pprint
>>> pprint(results[:10])
[(0, 1, 'eelery', 'e'),
(0, 2, 'coneen', 'en'),
(0, 2, 'earlet', 'er'),
(0, 2, 'earner', 'er'),
(0, 2, 'edgrew', 'er'),
(0, 2, 'eerily', 'el'),
(0, 2, 'egence', 'eg'),
(0, 2, 'eleven', 'el'),
(0, 2, 'enaena', 'en'),
(0, 2, 'ennead', 'en')]
最难的词是:
>>> pprint(results[-10:])
[(12, 16, 'buzzer', 'eraoiutlnsmdbcfg'),
(12, 16, 'cuffer', 'eraoiutlnsmdbpgc'),
(12, 16, 'jugger', 'eraoiutlnsmdbpgh'),
(12, 16, 'pugger', 'eraoiutlnsmdbpcf'),
(12, 16, 'suddle', 'eaioulbrdcfghmnp'),
(12, 16, 'yucker', 'eraoiutlnsmdbpgc'),
(12, 16, 'zipper', 'eraoinltsdgcbpjk'),
(12, 17, 'tuzzle', 'eaioulbrdcgszmnpt'),
(13, 16, 'wuzzer', 'eraoiutlnsmdbpgc'),
(13, 17, 'wuzzle', 'eaioulbrdcgszmnpt')]
之所以很难,是因为在您猜测之后-UZZLE
,您仍然有七种可能性:
>>> ' '.join(sorted(w for w in six_letter_words if w.endswith('uzzle')))
'buzzle guzzle muzzle nuzzle puzzle tuzzle wuzzle'
6.选择词表
当然,在为孩子准备单词列表时,您不会从计算机的系统字典开始,而会从您认为他们可能知道的单词列表开始。例如,您可能会看看Wiktionary列出的各种英语语料库中最常用的单词。
例如,截至2006年,古腾堡计划中10,000个最常用的单词中有1,700个六个字母的单词,其中最困难的十个是:
[(6, 10, 'losing', 'eaoignvwch'),
(6, 10, 'monkey', 'erdstaoync'),
(6, 10, 'pulled', 'erdaioupfh'),
(6, 10, 'slaves', 'erdsacthkl'),
(6, 10, 'supper', 'eriaoubsfm'),
(6, 11, 'hunter', 'eriaoubshng'),
(6, 11, 'nought', 'eaoiustghbf'),
(6, 11, 'wounds', 'eaoiusdnhpr'),
(6, 11, 'wright', 'eaoithglrbf'),
(7, 10, 'soames', 'erdsacthkl')]
(Soames Forsyte是John Galsworthy在Forsyte Saga中的一个字符;单词表已转换为小写,因此我无法快速删除专有名称。)
f(w) = (# unique letters) * (7 - # vowels) * (sum of the positions of unique letters in a list, ordered by frequency)
。从那里,您可以将函数的范围分为三个部分,然后将这些问题称为困难。