本着xkcd的精神
编写一个程序,使用任意对列表进行正则表达式高尔夫。该程序至少应尝试使正则表达式变短,/^(item1|item2|item3|item4)$/
不允许仅输出或类似程序。
评分基于生成最短正则表达式的能力。测试清单是在这里找到成功和失败的美国总统候选人的清单(感谢@Peter)。当然,该程序必须适用于所有不连续的列表,因此仅返回总裁的答案就不算在内。
本着xkcd的精神
编写一个程序,使用任意对列表进行正则表达式高尔夫。该程序至少应尝试使正则表达式变短,/^(item1|item2|item3|item4)$/
不允许仅输出或类似程序。
评分基于生成最短正则表达式的能力。测试清单是在这里找到成功和失败的美国总统候选人的清单(感谢@Peter)。当然,该程序必须适用于所有不连续的列表,因此仅返回总裁的答案就不算在内。
Answers:
use Regexp::Assemble;@ARGV=shift;my$r=new Regexp::Assemble;chomp,add$r "^\Q$_\E\$"while<>;$_=as_string$r;s/\(\?:/(/g;print
这使用称为CPAN的模块Regexp::Assemble
来优化正则表达式。因为正则表达式比Perl更好的语言是什么。
此外,它是可读版本,仅出于娱乐目的(在的帮助下-MO=Deparse
)。
use Regexp::Assemble;
my $r = Regexp::Assemble->new;
while (<>) {
chomp($_);
$r->add("^\Q$_\E\$");
}
$_ = $r->as_string;
# Replace wasteful (?:, even if it's technically correct.
s/\(\?:/(/g;
print $_;
示例输出(我在之后执行CTRL-D item4
)。
$ perl assemble.pl
item1
atem2
item3
item4
^(item[134]|atem2)$
另外,作为奖励,我正在为问题中的每个单词编写正则表达式。
^(a((ttemp)?t|llowed\.|rbitrary)?|\/\^item1\|atem2\|item3\|item4\$\/|s(ho(rt,|uld)|imilar)|p((air|lay)s|rogram)|(Writ|mak|Th)e|l(ists\.|east)|o([fr]|utputs)|t(h(at|e)|o)|(jus|no)t|regex|golf|with|is)$
另外,还列出了总裁列表(262字节)。
^(((J(effer|ack|ohn)s|W(ashingt|ils)|Nix)o|Van Bure|Lincol)n|C(l(eveland|inton)|oolidge|arter)|H(a(r(rison|ding)|yes)|oover)|M(cKinley|adison|onroe)|T(a(ylor|ft)|ruman)|R(oosevelt|eagan)|G(arfield|rant)|Bu(chanan|sh)|P(ierce|olk)|Eisenhower|Kennedy|Adams|Obama)$
不是我的解决方案(显然我不是peter norvig!),但这是他的(稍作修改)问题的解决方案:http : //nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313.ipynb
他给出的程序如下(他的工作,不是我的):
def findregex(winners, losers):
"Find a regex that matches all winners but no losers (sets of strings)."
# Make a pool of candidate components, then pick from them to cover winners.
# On each iteration, add the best component to 'cover'; finally disjoin them together.
pool = candidate_components(winners, losers)
cover = []
while winners:
best = max(pool, key=lambda c: 3*len(matches(c, winners)) - len(c))
cover.append(best)
pool.remove(best)
winners = winners - matches(best, winners)
return '|'.join(cover)
def candidate_components(winners, losers):
"Return components, c, that match at least one winner, w, but no loser."
parts = set(mappend(dotify, mappend(subparts, winners)))
wholes = {'^'+winner+'$' for winner in winners}
return wholes | {p for p in parts if not matches(p, losers)}
def mappend(function, *sequences):
"""Map the function over the arguments. Each result should be a sequence.
Append all the results together into one big list."""
results = map(function, *sequences)
return [item for result in results for item in result]
def subparts(word):
"Return a set of subparts of word, consecutive characters up to length 4, plus the whole word."
return set(word[i:i+n] for i in range(len(word)) for n in (1, 2, 3, 4))
def dotify(part):
"Return all ways to replace a subset of chars in part with '.'."
if part == '':
return {''}
else:
return {c+rest for rest in dotify(part[1:]) for c in ('.', part[0]) }
def matches(regex, strings):
"Return a set of all the strings that are matched by regex."
return {s for s in strings if re.search(regex, s)}
answer = findregex(winners, losers)
answer
# 'a.a|i..n|j|li|a.t|a..i|bu|oo|n.e|ay.|tr|rc|po|ls|oe|e.a'
其中赢家和输家分别是赢家和输家列表(或当然有2个列表),请参阅本文以获取详细说明。
USING:
formatting fry
grouping
kernel
math math.combinatorics math.ranges
pcre
sequences sets ;
IN: xkcd1313
: name-set ( str -- set )
"\\s" split members ;
: winners ( -- set )
"washington adams jefferson jefferson madison madison monroe
monroe adams jackson jackson vanburen harrison polk taylor pierce buchanan
lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley
mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt
roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon
nixon carter reagan reagan bush clinton clinton bush bush obama obama" name-set ;
: losers ( -- set )
"clinton jefferson adams pinckney pinckney clinton king adams
jackson adams clay vanburen vanburen clay cass scott fremont breckinridge
mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan
parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey
stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale
dukakis bush dole gore kerry mccain romney" name-set winners diff
{ "fremont" } diff "fillmore" suffix ;
: matches ( seq regex -- seq' )
'[ _ findall empty? not ] filter ;
: mconcat ( seq quot -- set )
map concat members ; inline
: dotify ( str -- seq )
{ t f } over length selections [ [ CHAR: . rot ? ] "" 2map-as ] with map ;
: subparts ( str -- seq )
1 4 [a,b] [ clump ] with mconcat ;
: candidate-components ( winners losers -- seq )
[
[ [ "^%s$" sprintf ] map ]
[ [ subparts ] mconcat [ dotify ] mconcat ] bi append
] dip swap [ matches empty? ] with filter ;
: find-cover ( winners candidates -- cover )
swap [ drop { } ] [
2dup '[ _ over matches length 3 * swap length - ] supremum-by [
[ dupd matches diff ] [ rot remove ] bi find-cover
] keep prefix
] if-empty ;
: find-regex ( winners losers -- regex )
dupd candidate-components find-cover "|" join ;
: verify ( winners losers regex -- ? )
swap over [
dupd matches diff "Error: should match but did not: %s\n"
] [
matches "Error: should not match but did: %s\n"
] 2bi* [
dupd '[ ", " join _ printf ] unless-empty empty?
] 2bi@ and ;
: print-stats ( legend winners regex -- )
dup length rot "|" join length over /
"separating %s: '%s' (%d chars %.1f ratio)\n" printf ;
: (find-both) ( winners losers legend -- )
-rot 2dup find-regex [ verify t assert= ] 3keep nip print-stats ;
: find-both ( winners losers -- )
[ "1 from 2" (find-both) ] [ swap "2 from 1" (find-both) ] 2bi ;
IN: scratchpad winners losers find-both
separating 1 from 2: 'a.a|a..i|j|li|a.t|i..n|bu|oo|ay.|n.e|ma|oe|po|rc|ls|l.v' (55 chars 4.8 ratio)
separating 2 from 1: 'go|e..y|br|cc|hu|do|k.e|.mo|o.d|s..t|ss|ti|oc|bl|pa|ox|av|st|du|om|cla|k..g' (75 chars 3.3 ratio)
它与Norvig的算法相同。如果损害可读性是目标,那么您可能可以消除更多角色。
我的代码不是一种高尔夫风格的压缩形式,但是您可以在https://github.com/amitayd/regexp-golf-coffeescript/(或特别是src / regexpGolf.coffee的算法)处进行检查。
它基于Peter Norvig的算法,有两个改进:
(并且还添加了可选的随机性)
对于本测验中的获胜者/失败者,我发现使用了76个字符的正则表达式:
[Jisn]e..|[dcih]..o|[AaG].a|[sro].i|T[ar]|[PHx]o|V|[oy]e|lev|sh$|u.e|rte|nle
在我的博客文章中,有关将求解器移植到coffeescript的更多详细信息。
/^item1|atem2|item3|item4$/
可能具有意想不到的优先级(字符串必须以item1
,包含atem2
,包含item3
或以结束item4
)。