297

这是一个1.2Mb ASCII文本文件，其中包含Herman Melville的Moby-Dick的文本；或者，鲸鱼。您的任务是编写一个程序或函数（或类等，请参见下文），每次给该文件一个字符，并且在每个步骤中都必须猜测下一个字符。

这是代码挑战。您的分数将是

2*L + E

这里L是您以字节为单位提交的大小，并且E是猜测错误的字符数。最低分获胜。

进一步的细节

您提交的内容将是一个程序或函数（等），将被多次调用或调用或发送数据。（1215235倍要准确。）在被要求的Ñ ^第一次将给出Ñ ^个的字符whale.txt或whale2.txt与它必须输出其猜测为（N + 1）^个字符。E分数的组成部分将是它猜错的字符总数。

大多数提交将需要在两次调用之间存储一些状态，以便它们可以跟踪它们被调用了多少次以及以前的输入是什么。您可以通过使用static或全局变量写入外部文件，提交类而不是函数，使用状态monad或其他适用于您的语言的方法来做到这一点。您的提交必须包含在首次调用之前初始化其状态所需的任何代码。

您的程序应确定性地运行，以便在输入相同的情况下始终做出相同的猜测（因此始终获得相同的分数）。

您的答案不仅必须包括您提交的内容，还必须包括您用于计算E分数部分的代码。无需使用与提交内容相同的语言编写，也不会计入其字节数。鼓励您使其可读。

关于您的提交和此计分程序之间的接口，只要您的程序在接收下一个输入字节之前始终给出一个字节的输出，就可以了。（因此，例如，您不能仅将包含所有输入的字符串传递给它，而将包含所有输出的字符串传递回去。）

在提交条目之前，您必须实际运行测试程序并计算/验证分数。如果您提交的内容运行得太慢而无法验证其分数，则即使您知道其分数原则上也没有资格参加比赛。

L分数的组成部分将根据打高尔夫球挑战赛的常规规则进行计算。如果您提交的文件包含多个文件，请在这种情况下注意评分和目录结构的规则。您的代码使用的所有数据都必须包含在您的L分数中。

您可以导入现有库，但不能加载任何其他外部文件，并且您的代码可能无法访问whale.txt或whale2.txt以上述以外的任何方式归档文件。您可能不会加载任何预先训练的神经网络或其他统计数据源。（使用神经网络很好，但是您必须在提交的数据中包含权重数据，并将其计入字节数。）如果由于某种原因，您的语言或库包含提供Moby Dick的部分或全部文本的功能，您可能无法使用该功能。除此之外，您可以使用自己喜欢的任何其他内置或库功能，包括与文本处理，预测或压缩有关的功能，只要它们是您的语言或其标准库的一部分即可。对于更特殊的，包含统计数据源的常规例程，您必须自己实现它们并将它们包括在字节数中。

某些提交可能包含其本身由代码生成的组件。如果是这种情况，请在您的答案中包括用于生成它们的代码，并说明其工作原理。（只要不需要此代码即可运行您的提交，它就不会包含在您的字节数中。）

由于历史原因，该文件有两个版本，您可以在答案中使用其中两个版本。在whale2.txt（上面链接的）文本中没有换行，因此换行符仅出现在段落末尾。在原始whale.txt文本中，文本被包装为74个字符的宽度，因此您必须预测每行的结尾以及文本。这使挑战变得更加挑剔，因此whale2.txt建议新的答案。两个文件的大小相同，均为1215236字节。

总而言之，所有答案都应包括以下内容：

您的提交本身。（代码以及它使用的所有数据文件-如果它们很大，可以作为链接。）
有关代码工作方式的说明。请说明I / O方法以及它如何预测下一个字符。您对算法的解释很重要，好的解释将为我带来很多帮助。
您用来评估分数的代码。（如果与以前的答案相同，则可以链接到它。）
您用于生成提交内容的任何代码，以及对该代码的说明。这包括用于优化参数，生成数据文件等的代码。（这不计入字节数，但应包含在答案中。）

排行榜

显示代码段

var QUESTION_ID=152856,OVERRIDE_USER=21034;function answersUrl(e){return"https://api.stackexchange.com/2.2/questions/"+QUESTION_ID+"/answers?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+ANSWER_FILTER}function commentUrl(e,s){return"https://api.stackexchange.com/2.2/answers/"+s.join(";")+"/comments?page="+e+"&pagesize=100&order=desc&sort=creation&site=codegolf&filter="+COMMENT_FILTER}function getAnswers(){jQuery.ajax({url:answersUrl(answer_page++),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){answers.push.apply(answers,e.items),answers_hash=[],answer_ids=[],e.items.forEach(function(e){e.comments=[];var s=+e.share_link.match(/\d+/);answer_ids.push(s),answers_hash[s]=e}),e.has_more||(more_answers=!1),comment_page=1,getComments()}})}function getComments(){jQuery.ajax({url:commentUrl(comment_page++,answer_ids),method:"get",dataType:"jsonp",crossDomain:!0,success:function(e){e.items.forEach(function(e){e.owner.user_id===OVERRIDE_USER&&answers_hash[e.post_id].comments.push(e)}),e.has_more?getComments():more_answers?getAnswers():process()}})}function getAuthorName(e){return e.owner.display_name}function process(){var e=[];answers.forEach(function(s){var r=s.body;s.comments.forEach(function(e){OVERRIDE_REG.test(e.body)&&(r="<h1>"+e.body.replace(OVERRIDE_REG,"")+"</h1>")});var a=r.match(SCORE_REG);a&&e.push({user:getAuthorName(s),size:+a[2],language:a[1],link:s.share_link})}),e.sort(function(e,s){var r=e.size,a=s.size;return r-a});var s={},r=1,a=null,n=1;e.forEach(function(e){e.size!=a&&(n=r),a=e.size,++r;var t=jQuery("#answer-template").html();t=t.replace("{{PLACE}}",n+".").replace("{{NAME}}",e.user).replace("{{LANGUAGE}}",e.language).replace("{{SIZE}}",e.size).replace("{{LINK}}",e.link),t=jQuery(t),jQuery("#answers").append(t);var o=e.language;/<a/.test(o)&&(o=jQuery(o).text()),s[o]=s[o]||{lang:e.language,user:e.user,size:e.size,link:e.link}});var t=[];for(var o in s)s.hasOwnProperty(o)&&t.push(s[o]);t.sort(function(e,s){return e.lang>s.lang?1:e.lang<s.lang?-1:0});for(var c=0;c<t.length;++c){var i=jQuery("#language-template").html(),o=t[c];i=i.replace("{{LANGUAGE}}",o.lang).replace("{{NAME}}",o.user).replace("{{SIZE}}",o.size).replace("{{LINK}}",o.link),i=jQuery(i),jQuery("#languages").append(i)}}var ANSWER_FILTER="!t)IWYnsLAZle2tQ3KqrVveCRJfxcRLe",COMMENT_FILTER="!)Q2B_A2kjfAiU78X(md6BoYk",answers=[],answers_hash,answer_ids,answer_page=1,more_answers=!0,comment_page;getAnswers();var SCORE_REG=/<h\d>\s*([^\n,]*[^\s,]),.*?(\d+)(?=[^\n\d<>]*(?:<(?:s>[^\n<>]*<\/s>|[^\n<>]+>)[^\n\d<>]*)*<\/h\d>)/,OVERRIDE_REG=/^Override\s*header:\s*/i;

body{text-align:left!important}#answer-list,#language-list{padding:10px;width:380px;float:left}table thead{font-weight:700}table td{padding:5px}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <link rel="stylesheet" type="text/css" href="//cdn.sstatic.net/codegolf/all.css?v=83c949450c8b"> <div id="answer-list"> <h2>Leaderboard</h2> <table class="answer-list"> <thead> <tr><td></td><td>Author</td><td>Language</td><td>Score</td></tr></thead> <tbody id="answers"> </tbody> </table> </div><div id="language-list"> <h2>Winners by Language</h2> <table class="language-list"> <thead> <tr><td>Language</td><td>User</td><td>Score</td></tr></thead> <tbody id="languages"> </tbody> </table> </div><table style="display: none"> <tbody id="answer-template"> <tr><td>{{PLACE}}</td><td>{{NAME}}</td><td>{{LANGUAGE}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table> <table style="display: none"> <tbody id="language-template"> <tr><td>{{LANGUAGE}}</td><td>{{NAME}}</td><td>{{SIZE}}</td><td><a href="{{LINK}}">Link</a></td></tr></tbody> </table>

展开摘要

赏金

我会不时提供赏金鼓励不同的方法。

第一个得分是50分，是当时A. Rex得分最高的答案。

对于同样的答案，第二个分数100分也被授予了A. Rex，因为他们在现有答案中添加了很好的解释。

下一个赏金200分，将授予其中一个

使用新技术的竞争性答案。（这将基于我的主观判断，因为这是我的代表得到的赏金，但是您可以相信我的说法是公平的。请注意，您的答案需要包含足够的解释以便我理解其工作原理！）不要拿最高分，只需要与现有答案相比做得好。我特别希望看到基于递归神经网络的解决方案，但我将把悬赏金奖励给看起来与支配当前最高得分的马尔可夫模型足够不同的事物。

要么：

使用任何方法击败A. Rex最高得分（目前为444444）的其他人。

一旦获得200点奖励，我很可能会提供400点奖励，并相应地更新要求。

code-challenge optimization compression

— 纳撒尼尔
source

评论不作进一步讨论；此对话已转移至聊天。

— 丹尼斯

9

xkcd.com/1960似乎是对此挑战的参考！

— A. Rex

我想压缩它...但是我的计算机崩溃耸了耸肩

— 太久了

135

///，2 * 1 + 1020874 = 1020876

打印一个空格。

— 达涅罗
source

评论不作进一步讨论；此对话已转移至聊天。

— 丹尼斯

那是一些非常聪明的奖励黑客！您必须是AGI；）

— Alex

97

Node.js，2 * 224 + 524279 = 524727

请参阅此帖子末尾的更改日志以获取分数更新。

一个获取并返回字节的函数。

a=[...l='14210100'],m={},s={},b={}
f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32

它由一个简单的PPM模型组成，该模型查看最后8个字符以预测下一个字符。

当我们遇到长度L至少T [L]次时，我们相信长度L的模式，其中T是任意阈值的数组：[1,1,2,1,2,3,5,2]。此外，我们始终相信第一个字符匹配的模式[A-Z '"(]。

我们选择最长的可信任模式，并在通话时返回与此模式相关联的最高分数的预测。

笔记

这显然没有针对速度进行优化，但是在我的笔记本电脑上可以运行约15秒。
如果允许我们连续多次重复该过程而不重置模型，则经过5次迭代后，错误数将收敛至〜268000。

预测功能的当前成功率为〜56.8％。正如@immibis在评论中所指出的那样，如果将正确的猜测与错误的猜测混合在一起，结果甚至将很难被阅读。

例如，本书结尾附近的以下代码段：

Here be it said, that this pertinacious pursuit of one particular whale,[LF]
continued through day into night, and through night into day, is a thing[LF]
by no means unprecedented in the South sea fishery.

变成：

"e e be it said, that thes woacangtyous sarsuet of tie oort cular thale[LF][LF]
 orsinued toeough tir on e togh   and sheough toght an o ters af t shin[LF][LF]
be to means insrocedented tn hhe sputh Sevsaonh ry,

通过用下划线代替错误的猜测，我们可以更好地了解该函数的正确性：

_e_e be it said, that th_s _____n___ous __rsu_t of __e __rt_cular _hale_[LF]
_o__inued t__ough ___ _n__ __gh__ and _h_ough __ght _n_o ____ __ _ _hin_[LF]
b_ _o means _n_r_cedented _n _he __uth _e_____h_ry_

_{注意：上面的示例是使用代码的先前版本创建的，适用于输入文件的第一版本。}

测试码

/**
  The prediction function f() and its variables.
*/
a=[...l='14210100'],m={},s={},b={}
f=c=>a.some((t,n)=>x=s[y=l.slice(n)]>t|/^[A-Z '"(]/.test(y)&&b[y],l+=String.fromCharCode(c),a.map((_,n)=>(m[x=l.slice(n)]=-~m[x])<s[y=l.slice(n,8)]||(s[y]=m[x],b[y]=c)),l=l.slice(1))&&x||32

/**
  A closure containing the test code and computing E.
  It takes f as input.
  (f can't see any of the variables defined in this scope.)
*/
;
(f => {
  const fs = require('fs');

  let data = fs.readFileSync('whale2.txt'),
      len = data.length,
      err = 0;

  console.time('ElapsedTime');

  data.forEach((c, i) => {
    i % 100000 || console.log((i * 100 / len).toFixed(1) + '%');

    if(i < len - 1 && f(c) != data[i + 1]) {
      err++;
    }
  })

  console.log('E = ' + err);
  console.timeEnd('ElapsedTime');
})(f)

变更记录

524727-通过切换到whale2.txt保存了19644点（挑战更新）
544371-通过强制以大写字母，引号，双引号或开头括号开头的模式也始终被信任，从而节省了327点
544698-通过强制始终以空格开头的模式来保存2119点
546817-通过调整阈值和高尔夫预测功能节省了47分
546864-通过将最大图案长度扩展到8个字符节省了1496点
548360-通过引入可信模式的概念节省了6239点，阈值取决于它们的长度
554599-通过改进换行预测节省了1030点
555629-通过打高尔夫球的预测功能保存22点
555651-通过打高尔夫球的预测功能保存40点
555691-初始得分

— Arnauld
source

44

出于好奇，不，这不会产生像Moby Dick这样的东西。这很多

sidg tlanses,oeth to, shuld hottut tild aoersors Ch, th! Sa, yr! Sheu arinning whales aut ihe e sl he traaty of rrsf tg homn  Bho dla tiasot  a shab  sor ty, af etoors tnd hocket sh bts ait mtubb tiddin tis aeewnrs, dnhost maundy cnd sner aiwt d boelh  cheugh  -aaieiyns   aasiyns  taaeiins! th, tla

。有时确实会说出一些完整的单词。喜欢whales。

— immibis

23

@immibis挑战的标题是明智选择的。大约是Moby Dick 。:-)

— Arnauld

3

@Nathaniel有很多更新，因此几乎不可读，也没有真正意义。我添加了一个更改日志，其中包含有关改进的简短说明。

— Arnauld

45

我认为您的程序实际上正在完美地转换成盖尔语。

— Beska

1

@ Draco18s很难说这个逗号是好是坏。如果这是一个错误的猜测，则预测功能可能会在收到信件后合理地尝试在实际上是逗号而不是逗号的其他任何字母之后写一封信。

— Arnauld

91

Perl，2·70525 + 326508 = 467558

预测变量

$m=($u=1<<32)-1;open B,B;@e=unpack"C*",join"",<B>;$e=2903392593;sub u{int($_[0]+($_[1]-$_[0])*pop)}sub o{$m&(pop()<<8)+pop}sub g{($h,%m,@b,$s,$E)=@_;if($d eq$h){($l,$u)=(u($l,$u,$L),u($l,$u,$U));$u=o(256,$u-1),$l=o($l),$e=o(shift@e,$e)until($l^($u-1))>>24}$M{"@c"}{$h}++-++$C{"@c"}-pop@c for@p=($h,@c=@p);@p=@p[0..19]if@p>20;@c=@p;for(@p,$L=0){$c="@c";last if" "ne pop@c and@c<2 and$E>99;$m{$_}+=$M{$c}{$_}/$C{$c}for sort keys%{$M{$c}};$E+=$C{$c}}$s>5.393*$m{$_}or($s+=$m{$_},push@b,$_)for sort{$m{$b}<=>$m{$a}}sort keys%m;$e>=u($l,$u,$U=$L+$m{$_}/$s)?$L=$U:return$d=$_ for sort@b}

要运行该程序，您需要在此处将此文件命名为B。（您可以在上面字符的第二个实例中更改此文件名B。）有关如何生成此文件的信息，请参见下文。

该程序使用了Markov模型的组合，本质上与user2699的答案相同，但做了一些小的修改。这将产生下一个字符的分布。我们使用信息论来决定是接受错误还是在B编码提示中花费一些存储空间（如果可以，如何使用）。我们使用算术编码来最佳存储模型中的小数位。

该程序的长度为582个字节（包括不必要的最终换行符），二进制文件的B长度为69942个字节，因此在对多个文件进行评分的规则下，我们的得分L为582 + 69942 + 1 = 70525。

该程序几乎肯定需要64位（little-endian？）体系结构。m5.large在Amazon EC2 上的实例上运行大约需要2.5分钟。

测试码

# Golfed submission
require "submission.pl";

use strict; use warnings; use autodie;

# Scoring length of multiple files adds 1 penalty
my $length = (-s "submission.pl") + (-s "B") + 1;

# Read input
open my $IN, "<", "whale2.txt";
my $input = do { local $/; <$IN> };

# Run test harness
my $errors = 0;
for my $i ( 0 .. length($input)-2 ) {
    my $current = substr $input, $i, 1;
    my $decoded = g( $current );

    my $correct = substr $input, $i+1, 1;
    my $error_here = 0 + ($correct ne $decoded);
    $errors += $error_here;
}

# Output score
my $score = 2 * $length + $errors;
print <<EOF;
length $length
errors $errors
score  $score
EOF

测试工具假定提交位于文件中submission.pl，但是可以在第二行中轻松更改。

文字比较

"And did none of ye see it before?" cried Ahab, hailing the perched men all around him.\\"I saw him almost that same instant, sir, that Captain 
"And wid note of te fee bt seaore   cried Ahab, aasling the turshed aen inl atound him. \"' daw him wsoost thot some instant, wer, that Saptain 
"And _id no_e of _e _ee _t _e_ore__ cried Ahab, _a_ling the __r_hed _en __l a_ound him._\"_ _aw him ___ost th_t s_me instant, __r, that _aptain 

Ahab did, and I cried out," said Tashtego.\\"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I 
Ahab aid  ind I woued tut,  said tashtego, \"No, the same instant, tot the same -tow nhe woubloon ws mane. alte ieserved the seubloon ior te, I 
Ahab _id_ _nd I ___ed _ut,_ said _ashtego__\"No_ the same instant_ _ot the same_-_o_ _he _oubloon _s m_ne_ __te _eserved the __ubloon _or _e_ I 

only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows! There again!--there again!" he cr
gnly  towe of ye sould have tersed the shite Whale aisst  Ihere ihe blows! -there she blows! -there she blows! Ahere arains -mhere again!  ce cr
_nly_ _o_e of ye _ould have ___sed the _hite Whale _i_st_ _here _he blows!_-there she blows!_-there she blows! _here a_ain__-_here again!_ _e cr

该示例（在另一个答案中选择）在文本中出现得很晚，因此到那时为止，该模型已经相当完善。请记住，该模型增加了70 KB的“提示”，可直接帮助其猜测字符。它不仅仅是由上面的简短代码片段驱动的。

产生提示

以下程序接受上面的确切提交代码（在标准输入上）并生成B上面的确切文件（在标准输出上）：

@S=split"",join"",<>;eval join"",@S[0..15,64..122],'open W,"whale2.txt";($n,@W)=split"",join"",<W>;for$X(0..@W){($h,$n,%m,@b,$s,$E)=($n,$W[$X]);',@S[256..338],'U=0)',@S[343..522],'for(sort@b){$U=($L=$U)+$m{$_}/$s;if($_ eq$n)',@S[160..195],'X<128||print(pack C,$l>>24),',@S[195..217,235..255],'}}'

由于它执行类似的计算，因此与提交运行所需的时间大约相同。

说明

在本节中，我们将尝试详细描述此解决方案的作用，以便您可以自己“在家尝试”。将此答案与其他答案区分开的主要技术是“倒带”机制，但在到达此处之前，我们需要建立基础知识。

模型

解决方案的基本要素是语言模型。就我们的目的而言，模型是一种需要一定数量的英文文本并在下一个字符处返回概率分布的事物。当我们使用模型时，英文文本将是Moby Dick的一些（正确）前缀。请注意，所需的输出是一个distribution，而不仅仅是对最可能出现的字符的一次猜测。

在我们的案例中，我们实际上是通过user2699在此答案中使用模型。我们之所以没有使用Anders Kaseorg得分最高的答案（不是我们自己的答案）中的模型，恰恰是因为我们无法提取分布而不是单个最佳猜测。从理论上讲，该答案计算的是加权几何平均值，但是当我们从字面上解释时，得出的结果有些差。我们从另一个答案中“窃取”一个模型，因为我们的“秘密调味料”不是模型，而是整体方法。如果某人拥有“更好”的模型，那么他们应该能够使用我们的其余技术获得更好的结果。

值得一提的是，大多数压缩方法（例如Lempel-Ziv）可以被视为是这种“语言模型”，尽管可能需要斜视一下。（对于执行Burrows-Wheeler转换的操作特别棘手！）此外，请注意user2699的模型是对Markov模型的修改。基本上，没有什么比这个挑战甚至对文本建模更具竞争力。

整体架构

为了便于理解，将整个体系结构分解为几部分是很不错的。从最高级别的角度来看，需要一些状态管理代码。这并不是特别有趣，但是为了完整起见，我们要强调的是，程序在每个点都被要求进行下一个猜测，它可以使用正确的Moby Dick前缀。我们不会以任何方式使用过去的错误猜测。为了提高效率，语言模型可以重用前N个字符的状态来计算前（N + 1）个字符的状态，但是原则上，每次调用它时，它都可以从头开始重新计算。

让我们将程序的基本“驱动程序”放在一边，并在猜测下一个字符的部分内进行浏览。从概念上讲，它有助于分离三个部分：语言模型（如上所述），“提示”文件和“解释器”。在每个步骤中，解释器都会向语言模型询问下一个字符的分布，并可能从提示文件中读取一些信息。然后将这些部分组合成一个猜测。提示文件中的确切信息以及使用方法将在稍后进行解释，但是目前它有助于在精神上将这些部分分开。请注意，在实现方面，提示文件实际上是一个单独的（二进制）文件，但它可能是字符串或程序中存储的内容。作为一个近似值，

如果在此答案中使用的是诸如bzip2之类的标准压缩方法，则“提示”文件对应于压缩文件。“解释器”对应于解压缩器，而“语言模型”则有点隐式（如上所述）。

为什么要使用提示文件？

让我们选择一个简单的示例进行进一步分析。假设文本是一个N字符长的字符，并且由一个模型很好地近似，其中每个字符（独立地）是字母E，概率略小于一半，T类似地，概率略小于一半，且A概率为1/1000 = 0.1％。假设没有其他字符可以使用；在任何情况下，它A都与以前看不见的字符突然消失的情况非常相似。

如果我们在L 0体制下进行操作（与该问题的大多数其他答案一样，但不是全部），对于口译员而言，没有比选择E和更好的策略了T。平均而言，它将获得大约一半的正确字符。因此，E≈N/ 2，分数也≈N/ 2。但是，如果使用压缩策略，则每个字符可以压缩到多于一位。因为L以字节为单位，所以我们得到L≈N / 8，因此得分≈N / 4，是以前策略的两倍。

对于此模型，每个字符要达到一个多于一位的速率是不平凡的，但是一种方法是算术编码。

算术编码

众所周知，编码是一种使用位/字节表示某些数据的方式。例如，ASCII是英语文本和相关字符的7位/字符编码，它是所考虑的原始Moby Dick文件的编码。如果某些字母比其他字母更常见，那么像ASCII这样的固定宽度编码不是最佳的。在这种情况下，许多人开始使用霍夫曼编码。如果您想要一个固定（无前缀）代码且每个字符的位数为整数，则这是最佳选择。

但是，算术编码甚至更好。粗略地说，它能够使用“小数”位对信息进行编码。在线有许多算术编码指南。由于在线上有其他可用资源，我们将在这里跳过详细信息（尤其是实际实现，从编程角度来看可能有些棘手），但是如果有人抱怨，也许可以进一步完善本节。

如果一个人的文本实际上是由一种已知的语言模型生成的，则算术编码将提供该模型中文本的本质上最优的编码。从某种意义上说，这“解决”了该模型的压缩问题。（因此，在实践中，主要的问题是该模型不为人所知，有些模型在建模人工文本方面比其他模型要好。）如果不允许在比赛中犯错，则使用上一节的语言。，一种解决此问题的方法是使用算术编码器从语言模型生成“提示”文件，然后将算术解码器用作“解释器”。

在这种本质上最优的编码中，我们最终花费了-log_2（p）位用于概率为p的字符，并且编码的总体位速率为Shannon熵。这意味着一个概率接近1/2的字符需要大约一位进行编码，而概率为1/1000的一个字符需要大约10位（因为2 ^ 10大约为1000）。

但是，针对此挑战的评分标准是经过精心选择的，可以避免将压缩作为最佳策略。我们必须找出一些方法来犯一些错误，以作为获取较短提示文件的折衷方案。例如，一种可能尝试的策略是一种简单的分支策略：通常，我们会尽可能尝试使用算术编码，但是如果模型中的概率分布在某种程度上“不好”，我们只会猜测最可能的特征，而不会请尝试对其进行编码。

为什么会出错？

让我们从以前开始分析示例，以激发为什么我们可能要“有意地”犯错误。如果我们使用算术编码来编码正确的字符，则在E或的情况下，我们将花费大约一位T，而在的情况下，将花费约10位A。

总体而言，这是一种非常不错的编码，即使存在三种可能性，每个字符也会花费一点点；基本上，这A不太可能，而且我们最终不会花费太多相应的十位。但是，如果只发生错误而不是发生错误，那不是很好A吗？毕竟，问题的度量标准认为1字节= 8位长度等于2个错误；因此，似乎应该更喜欢一种错误，而不是在一个字符上花费超过8/2 = 4位。花费多于一个字节来保存一个错误，听起来绝对不是最佳选择！

“倒带”机制

本节描述了此解决方案的主要巧妙方面，这是一种无需花费太多时间即可处理错误猜测的方法。

对于我们一直在分析的简单示例，倒带机制特别简单。解释器从提示文件中读取一位。如果为0，则猜测为E。如果为1，则猜测为T。下次调用它时，它将看到正确的字符。如果提示文件设置正确，我们可以确保在E或的情况下T，解释器可以正确猜测。但是呢A？倒带机制的想法是根本不代码A在所有。更准确地讲，如果解释器后来得知正确的字符是A，则隐喻地“ 倒带 ”：它返回先前读取的位。它读取的位确实打算进行编码E或T，但是不是现在; 稍后将使用。在这个简单的示例中，这基本上意味着它一直猜测相同的字符（E或T），直到正确为止。然后它再读一点，然后继续前进。

此提示文件的编码非常简单：将所有Es都转换为0位，将Ts转换为1位，而全部都A完全忽略s。通过上一节末尾的分析，该方案会产生一些错误，但由于不对任何As进行编码，因此总体上降低了得分。作为较小的效果，它实际上也节省了提示文件的长度，因为我们最终对E和和T仅仅使用了一位，而不是稍微多一点。

一点定理

我们如何确定何时出错？假设我们的模型为我们提供了下一个字符的概率分布P。我们将可能的字符分为两类：已编码和未编码。如果没有编码正确的字符，那么我们将最终使用“倒带”机制免费接受错误。如果编码了正确的字符，那么我们将使用其他分布Q通过算术编码对其进行编码。

但是，我们应该选择哪种分布Q？不难发现编码字符都应比未编码字符具有更高的概率（以P为单位）。同样，分布Q应该只包括编码字符；毕竟，我们不编码其他编码，所以我们不应该在它们上“花费”熵。看到概率分布Q应该与编码字符上的P成比例有点棘手。将这些观察结果放在一起意味着我们应该对最可能的字符进行编码，而可能对不太可能的字符进行编码，并且对编码的字符简单地重新缩放Q。

此外，事实证明，存在一个很酷的定理，其中一个编码字符应选择“截断”：您应编码一个字符，只要它与其他编码字符组合的可能性至少为1 / 5.393。这“解释”了5.393接近上面程序结尾处的看似随机常数的外观。1 / 5.393≈0.18542是方程-p log（16）-p log p +（1 + p）log（1 + p）= 0的解。

用代码写出此过程也许是一个合理的想法。此代码段在C ++中：

// Assume the model is computed elsewhere.
unordered_map<char, double> model;

// Transform p to q
unordered_map<char, double> code;
priority_queue<pair<double,char>> pq;
for( char c : CHARS )
    pq.push( make_pair(model[c], c) );
double s = 0, p;
while( 1 ) {
    char c = pq.top().second;
    pq.pop();
    p = model[c];
    if( s > 5.393*p )
        break;
    code[c] = p;
    s += p;
}
for( auto& kv : code ) {
    char c = kv.first;
    code[c] /= s;
}

全部放在一起

不幸的是，上一节的内容有点技术性，但是如果将所有其他部分放在一起，其结构如下。每当要求程序预测给定正确字符后的下一个字符时：

将正确的字符添加到Moby Dick的已知正确前缀中。
更新文本的（Markov）模型。
的秘诀：如果先前的猜测是不正确的，倒退的算术解码器的状态，其状态之前的猜测面前！
要求马尔可夫模型预测下一个字符的概率分布P。
使用上一部分中的子例程将P转换为Q。
根据分布Q，要求算术解码器从提示文件的其余部分解码字符。
猜猜结果字符。

提示文件的编码操作类似。在这种情况下，程序将知道正确的下一个字符是什么。如果它是一个应该被编码的字符，那么当然应该在其上使用算术编码器。但是，如果它是未编码的字符，则不会更新算术编码器的状态。

如果您了解诸如概率分布，熵，压缩和算术编码之类的信息理论背景，但尝试并未能理解本文（除了定理为何成立），请告诉我们，我们可以尝试解决问题。谢谢阅读！

— 雷克斯
source

8

哇，令人印象深刻的答案。我假设还需要其他代码来生成B文件？如果是这样，请您将其包括在答案中吗？

— 纳撒尼尔（Nathaniel）

8

优秀的！打破500k得分障碍的第一个（也是迄今为止唯一的）答案。

— ShreevatsaR

5

“哭了，这是鲸鱼”，我在哭

— Phill

5

由于在赏金期间没有发布新的答案，因此我将它授予您的答案，因为它是最佳的评分方式和最先进的方法。如果您有时间，我将非常感谢您对该答案的工作原理进行更深入的说明，即算法到底是什么？

— 纳撒尼尔（Nathaniel）

2

@Nathaniel：我在这篇文章中添加了解释。让我知道您是否认为它足够详细，可以自己重现该解决方案。

— A. Rex

77

Python 3，2·267 + 510193 = 510727

预测变量

def p():
 d={};s=b''
 while 1:
  p={0:1};r=range(len(s)+1)
  for i in r:
   for c,n in d.setdefault(s[:i],{}).items():p[c]=p.get(c,1)*n**b'\1\6\f\36AcWuvY_v`\270~\333~'[i]
  c=yield max(sorted(p),key=p.get)
  for i in r:e=d[s[:i]];e[c]=e.get(c,1)+1
  s=b'%c'%c+s[:15]

它使用权重为[1、6、12、30、65、99、87、117、118、89、95、118、96、184、126、0，…，16的马尔可夫模型的加权贝叶斯组合。 219，126]。

结果对这些权重的选择不是很敏感，但是我优化了它们，因为我可以使用我在回答“汇集参议院多数”时所使用的相同的后期验收爬山算法，其中每个候选突变为单个重量仅增加±1。

测试码

with open('whale2.txt', 'rb') as f:
    g = p()
    wrong = 0
    a = next(g)
    for b in f.read():
        wrong += a != b
        a = g.send(b)
    print(wrong)

— 安德斯·卡塞格（Anders Kaseorg）
source

2

这项工作的正确工具。好分数。好东西。

— agtoever

1

可能的说明：b"\0\3\6\r\34'&-20'\22!P\n[\26"权重的ascii表示，其中小的不可打印的值以八进制形式转义。

— 心教堂

我已使用文本未换行的文件版本更新了问题-您可以尝试在该文件上重新运行代码（可能会做得更好）

— Nathaniel

3

感谢您的解释-如果您可以将提要编辑成问题，那就太好了。（我以前的挑战“ 油漆繁星之夜 ”的经验是，这些优化过程是答案中最有趣的部分，因此，如果答案中包括用于执行此操作的代码并对其进行解释，那会更好。挑战说他们应该这样做。）

— Nathaniel

1

@Christoph我的模型组合实际上是加权几何平均值。但是，PAQ在物流领域的平均水平略有不同-我必须看看是否更好。

— Anders Kaseorg

55

Python 3中，2 * 279 + 592920 = 593478 2 * 250 + 592467 = 592967 2 * 271 + 592084 = 592626 2 * 278 + 592059 = 592615 2 * 285 + 586660 = 587230 2 * 320 + 585161 = 585801 2 * 339 + 585050 = 585728

d=m={}
s=1
w,v='',0
def f(c):
 global w,m,v,s,d
 if w not in m:m[w]={}
 u=m[w];u[c]=c in u and 1+u[c]or 1;v+=1;q=n=' ';w=w*s+c;s=c!=n
 if w in m:_,n=max((m[w][k],k)for k in m[w])
 elif s-1:n=d in'nedtfo'and't'or'a'
 elif'-'==c:n=c
 elif"'"==c:n='s'
 elif'/'<c<':':n='.'
 if v>4*(n!=q)+66:n='\n'
 if s:d=c
 if c<q:w=w[:-1]+q;v=s=0
 return n

在线尝试！

使用全局变量的函数。不断学习，在单词级别上建立模型：鉴于到目前为止在该单词中所见，最常见的下一个字符是什么？随着输入的增加，它可以很好地从文本中学习常用词，并且还可以学习最常用的字符来开始下一个词。

例如：

如果到目前为止所看到的是“ Captai”，则表示“ n”
如果是“队长”，则表示有空间
如果这是一个单词的开头，而最后一个单词是“ Captain”，则表示“ A”
如果目前为止的单词是“ A”，它将预示为“ h”（然后是“ a”和“ b”；与“ C”类似）。

刚开始时效果不佳，但是到最后，大部分实际单词都出现了。后备选项是一个空格，在单个空格之后是“ a”，除非前面的字母是“ nedtfo”，数字，连字符或撇号之一。它还积极地预测在71个字符后会出现换行符，或者在66个字符后会出现空格。这两个参数都已被调整为数据（“ t”在空格后更为常见，但更常见的是，所以“ “ a”是这六个特殊情况之外的更好的猜测）。

了解哪些单词对组合在一起并预先映射是不值得的。

它的结尾是这样的文本：

nl tneund his    I woi tis tnlost ahet toie tn tant  wod, ihet taptain Ahab ses
 snd t
oeed Sft   aoid thshtego    Io, fhe soie tn tant  tot the soie      ahe sewbtoon
swn tagd  aoths eatmved fhe sewbtoon wor ta  I sfey  aote of totsonld nive betse
d ahe
hate Whale iorst  Ihe e ioi beaos! -there soi beaos! -there soi beaos!

对应于输入的这一部分：

在他周围。

塔什特戈说：“先生，我几乎和亚哈船长一样，看到了他，我哭了。”

“不一样的瞬间；不一样的-不，混血儿是我的，命运给我保留了混血儿。我只是；你们没有人能先举起白鲸。她在吹！！她在吹！- -她在吹！

您会看到专有名词在哪里特别好出现，但是单词的结尾也几乎是正确的。当看到“ dou”时，它会期望“ doubt”，但是一旦出现“ l”，它就会成为“ doubloon”。

如果您使用相同的模型第二次运行它，则刚刚构建的它会立即获得92k的正确率（51.7％-> 59.3％），但是从第二次迭代开始，它始终不到60％。

测量代码在TIO链接中，或者下面是一个更好的版本：

total = 0
right = 0
with open('whale.txt') as fp:
    with open('guess.txt', 'w') as dest:
        for l in fp.readlines():
            for c in l:
                last = c
                if p == c: right += 1
                n = f(c)
                p = n
                total += 1
                dest.write(n)
                if total % 10000 == 0:
                    print('{} / {} E={}\r'.format(right, total, total-right), end='')
print('{} / {}: E={}'.format(right, total, total - right))

guess.txt 在末尾有猜测的输出。

— 迈克尔·荷马
source

3

这是一个极好的方法！

— 斯凯勒

2

太多<s> </ s>;）

— FantaC

1

+1，因为这种方法让我想起了LZW压缩算法。

— 马科斯（Marcos）

25

C ++，得分：2 * 132 + 865821 = 866085

感谢@Quentin节省217个字节！

int f(int c){return c-10?"t \n 2  sS \n  -  08........       huaoRooe oioaoheu thpih eEA \n   neo    enueee neue hteht e"[c-32]:10;}

一个非常简单的解决方案，给定一个字符，仅输出最经常出现在输入字符之后的字符。

使用以下方法验证分数：

#include <iostream>
#include <fstream>

int f(int c);

int main()
{
    std::ifstream file;
    file.open("whale2.txt");

    if (!file.is_open())
        return 1;

    char p_ch, ch;
    file >> std::noskipws >> p_ch;
    int incorrect = 0;
    while (file >> std::noskipws >> ch)
    {
        if (f(p_ch) != ch)
            ++incorrect;
        p_ch = ch;
    }

    file.close();

    std::cout << incorrect;
}

编辑：使用whale2.txt给出更好的分数。

— 稳定箱
source

5

您可以将此数组转换为字符串文字，并直接内联它以代替L保存一堆字符:)

— Quentin

@Quentin谢谢！现在我想知道为什么我一开始就没想到...

— Steadybox

20

Python，2 * 516 + 521122 = 522154

算法：

还有另一条python提交，此算法计算看似长度为1，...，l的序列的最可能的下一个字母。使用了概率的总和，并且有一些技巧可以得到更好的结果。

from collections import Counter as C, defaultdict as D
R,l=range,10
s,n='',[D(C) for _ in R(l+1)]
def A(c):
 global s;s+=c;
 if len(s)<=l:return ' '
 P=D(lambda:0)
 for L in R(1,l+1):
  w=''.join(s[-L-1:-1]);n[L][w].update([c]);w=''.join(s[-L:])
  try:
   q,z=n[L][w].most_common(1)[0];x=sum(list(n[L][w].values()))
  except IndexError:continue
  p=z/x
  if x<3:p*=1/(3-x)
  P[q]+=p
 if not P:return ' '
 return max(P.items(),key=lambda i:i[1])[0]
import this, codecs as d
[A(c) for c in d.decode(this.s, 'rot-13')]

结果：

尽管您可以看到偶尔出现的短语，例如“父亲Mapple”，但大部分都是乱七八糟的。

errors: 521122
TRAINING:
result:  tetlsnowleof the won -opes  aIther Mapple,woneltnsinkeap hsd   lnd the  thth a shoey,aeidorsbine ao
actual: ntal knobs of the man-ropes, Father Mapple cast a look upwards, and then with a truly sailor-like bu
FINAL:
result: mnd wnd round  ahe   ind tveryaonsracting th ards the sol ens-ike aeock tolblescn the sgis of thet t
actual: und and round, then, and ever contracting towards the button-like black bubble at the axis of that s

测试代码：

很简单，在不同点输出一些文本示例。使用whale2.txt，因为这避免了一些额外的逻辑来计算换行符。

from minified import A

def score(predict, text):
    errors = 0
    newtext = []
    for i, (actual, current) in  enumerate(zip(text[1:], text[:-1])):
        next = predict(current)
        errors += (actual != next)
        newtext.append(next)
        if (i % (len(text) // 100) == 0):
            print ('.', end='', flush=True)
    return errors, ''.join(newtext)

t = open('whale2.txt')
text = t.read()
err2, text2 = score(A, text)
print('errors:', err2)
print("TRAINING:")
print(text2[100000:100100].replace('\n', '\\n'))
print(text1[100001:100101].replace('\n', '\\n'))
print("FINAL:")
print(text2[121400:1215500].replace('\n', '\\n'))
print(text[121401:1215501].replace('\n', '\\n'))

— 用户名
source

3

欢迎光临本站！这是一个很棒的初稿。:)

— DJMcMayhem

@DJMcMayhem，谢谢您的欢迎。我喜欢观看已有一段时间了，这是第一项引起我注意的比赛。

— user2699

19

C（gcc），679787 652892

84 76字节，~~679619~~ 652740错误的猜测

p[128][128][128][128];a,b,c,d;g(h){p[a][b][c][d]=h;h=p[a=b][b=c][c=d][d=h];}

在线尝试！

更新：使用更新的文件时，大约27000点，具有更好功能的16点（8字节）。

说明

这种工作方式是在代码遍历文本时，它会记住终止任何给定4个字符序列的最后一个字符，并返回该值。有点类似于上面的Arnauld的方法，但是依赖于两个给定的4个字符序列以相同方式终止的固有可能性。

脱胶

p[128][128][128][128];
a,b,c,d;
g(h){
    p[a][b][c][d]=h; // Memorize the last character.
    h=p[a=b][b=c][c=d][d=h]; // Read the guess. We save several
                             // bytes with the assignments inside indices.
}

... TIO链接没有用。那么该函数返回上一次赋值的值？

— user202729

让我用一个解释来编辑答案，然后:)

1

@Rogem我添加了一个去高尔夫球的版本（之所以这样做，是因为我也无法遵循它）-希望这不会打扰您，但如果需要，请回滚。

— 亚当·戴维斯

在大多数C实现中，@ AdamDavis的所有全局变量均从零开始。这是未定义的行为，因此仅在代码高尔夫中使用。

— NieDzejkob

1

@NieDzejkob啊，你是对的，谢谢！“ ANSI-C要求所有未初始化的静态/全局变量都必须初始化为0。”

— 亚当·戴维斯

16

sh + bzip2，2 * 364106 = 728212

~~2 * 381249 + 0 = 762498~~

dd if=$0 bs=1 skip=49|bunzip2&exec cat>/dev/null

随后是bzip2压缩的whale2.txt，缺少第一个字节

忽略其输入；输出正确答案。这提供了一个基准。daniero在另一端提供了基线。

生成器脚本：

#!/bin/sh
if [ $# -ne 3 ]
then
    echo "Usage $0 gen.sh datafile output.sh"
    exit 1
fi

cat $1 > $3
dd ibs=1 if=$2 skip=1 | bzip2 -9 >> $3
chmod +x $3

I / O测试工具（tcc；切断gcc的第一行）。该测试工具可以在提交了期望读取/写入I / O的完整程序的合适平台上的任何人使用。它使用一次字节I / O避免作弊。子程序必须在每个字节之后刷新输出，以避免阻塞。

#!/usr/bin/tcc -run
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>

int main(int argc, char **argv)
{
    volatile int result;
    int readfd[2];
    int writefd[2];
    int cppid;
    int bytecount;
    char c1, c2, c3;
    if (argc != 2) {
        printf("write X approximately -- service host\n");
        printf("Usage: %s serviceprocessbinary < source.txt\n", argv[0]);
        return 1;
    }
    /* Start service process */
    if (pipe(readfd)) {
        perror("pipe()");
        return 3;
    }
    if (pipe(writefd)) {
        perror("pipe()");
        return 3;
    }
    result = 0;
    if (!(cppid = vfork())) {
        char *argtable[3];
        argtable[0] = argv[1];
        argtable[1] = NULL;
        dup2(readfd[0], 0);
        dup2(writefd[1], 1);
        close(readfd[1]);
        close(writefd[0]);
        close(readfd[0]);
        close(writefd[1]);
        execvp(argv[1], argtable);
        if (errno == ENOEXEC) {
            argtable[0] = "/bin/sh";
            argtable[1] = argv[1];
            argtable[2] = NULL;
            /* old standard -- what isn't an executable
             * can be exec'd as a /bin/sh script */
            execvp("/bin/sh", argtable);
            result = ENOEXEC;
        } else {
            result = errno;
        }
        _exit(3);
    } else if (cppid < 0) {
        perror("vfork()");
        return 3;
    }
    if (result) {
        errno = result;
        perror("execvp()");
        return 3;
    }
    close(readfd[0]);
    close(writefd[1]);
    /* check results */
    read(0, &c2, 1);
    bytecount = 1;
    errno = 0;
    while (read(0, &c1, 1) > 0) {
        write(readfd[1], &c2, 1);
        if (read(writefd[0], &c3, 1) <= 0) {
            printf("%d errors (%d bytes)\n", result, bytecount);
            if (errno == 0)
                fprintf(stderr, "pipe: unexpected EOF\n");
            else
                perror("pipe");
            return 3;
        }
        if (c3 != c1)
            ++result;
        c2 = c1;
        ++bytecount;
    }
    printf("%d errors (%d bytes)\n", result, bytecount);
    return 0;
}

— 约书亚记
source

6

我认为他要问的是：这怎么不违反该

but may not load any other external files, and your code may not access the whale.txt file in any way other than described above.

条款？

8

@Rogem压缩的数据放在此处显示的内容之后，并且代码本身可以访问。

— user202729

4

问题说：“您的提交将是将被多次调用或调用的程序或函数（等）。在被调用的nth时间内，它将被赋予whale.txtor 的第n个字符，whale2.txt并且必须输出对之的猜测。(n+1)th字符。” -如何满足此要求？该代码将在whale.txt每次执行时显示整个文本。

— axiac

1

@axiac“任何事情都可以，只要您的程序在接收下一个输入字节之前始终给出一个字节的输出即可。”

— user202729

5

@axiac提供了测试工具，我很高兴将从STDIN发送一个字节的程序视为“调用或调用”它。重要的是，程序在通过测试工具运行时，在输入的每个字节之后，该程序实际上都将执行此操作，从而返回一个字节的输出。正如问题所言，“任何事情都可以，只要您的程序在接收下一个输入字节之前始终给出一个字节的输出即可。”

— 纳撒尼尔（Nathaniel）

13

蟒蛇 3，879766

F=[[0]*123for _ in range(123)]
P=32
def f(C):global P;C=ord(C);F[P][C]+=1;P=C;return chr(max(enumerate(F[C]),key=lambda x:x[1])[0])

在线尝试！

... ///打印空格的答案将获得10票赞成，而我的代码只能获得3个投票...

说明：

对于每个字符，该程序：

增加 frequency[prev][char]
查找出现次数最多的角色 frequency[char]
并输出。

TIO链接中的取消代码，已注释掉。
该代码是131个字节。
在我的机器上运行的代码报告：

879504 / 1215235
Time: 62.01348257784468

总分

2*131 + 879504 = 879766

由于无法将大文件上载到TIO（询问Dennis除外），因此在TIO链接中运行的示例仅对文本的一小部分运行程序。

与较早的答案相比，此答案包含362个错误的字符，但是代码短了255个字节。乘数使我的提交分数较低。

— 用户名
source

13

C＃，378 * 2 + 569279 = 570035

using System.Collections.Generic;using System.Linq;class P{Dictionary<string,Dictionary<char,int>>m=new
Dictionary<string,Dictionary<char,int>>();string b="";public char N(char
c){if(!m.ContainsKey(b))m[b]=new Dictionary<char,int>();if(!m[b].ContainsKey(c))m[b][c]=0;m[b][c]++;b+=c;if(b.Length>4)b=b.Remove(0,1);return
m.ContainsKey(b)?m[b].OrderBy(k=>k.Value).Last().Key:' ';}}

该方法使用查找表来学习给定字符串后的最常见字符。查找表的键最多包含4个字符，因此该函数首先使用当前字符更新查找表，然后仅检查在前4个字符（包括当前字符）之后最有可能发生的字符。如果在查找表中找不到这4个字符，它将打印一个空格。

此版本使用该whale2.txt文件，因为它大大提高了成功猜测的次数。

以下是用于测试该类的代码：

using System;
using System.IO;
using System.Text;

public class Program
{
    public static void Main(string[] args)
    {
        var contents = File.OpenText("whale2.txt").ReadToEnd();
        var predictor = new P();

        var errors = 0;
        var generated = new StringBuilder();
        var guessed = new StringBuilder();
        for (var i = 0; i < contents.Length - 1; i++)
        {
            var predicted = predictor.N(contents[i]);
            generated.Append(predicted);
            if (contents[i + 1] == predicted)
                guessed.Append(predicted);
            else
            {
                guessed.Append('_');
                errors++;
            }
        }

        Console.WriteLine("Errors/total: {0}/{1}", errors, contents.Length);
        File.WriteAllText("predicted-whale.txt", generated.ToString());
        File.WriteAllText("guessed-whale.txt", guessed.ToString());

        Console.ReadKey();
    }
}

该代码运行仅2秒钟。仅作记录，这是我修改查找表的键的大小时得到的结果（包括第二次运行的结果，而无需重置模型）：

Size   Errors   Errors(2)
-------------------------
1      866162   865850
2      734762   731533
3      621019   604613
4      569279   515744
5      579446   454052
6      629829   396855
7      696912   335034
8      765346   271275
9      826821   210552
10     876471   158263

知道为什么4个字符的密钥大小是此算法的最佳选择会很有趣。

文字比较

原版的：

"And did none of ye see it before?" cried Ahab, hailing the perched men all around him.

"I saw him almost that same instant, sir, that Captain Ahab did, and I cried out," said Tashtego.

"Not the same instant; not the same--no, the doubloon is mine, Fate reserved the doubloon for me. I only; none of ye could have raised the White Whale first. There she blows!--there she blows!--there she blows! There again!--there again!"

重新创建：

"Tnd tes note of to seamtn we ore  
sried thab  wedleng the srriead te  a l tneund tes  
"T day tim t lost shet toie tn tand  aor, ahet taptain thab sid  tnd t waued tnt   said teshtego  
"To, ahe shme tn tand  aot the shme whot nhe sewbteodsan tagd  althsteatnved the sewbteodsaor te, I hncy  aote of to sanld bave beised the shate Whale iorst  Bhe e ati boaos  -the   ati boaos  -the   ati boaos  the e anains -ahe   anains

猜猜：

"_nd ___ no_e of __ se____ _e_ore____ried _hab_ ___l_ng the __r___d _e_ a_l ___und _____
"_ _a_ _im ___ost _h_t ___e _n_tan__ __r, _h_t _aptain _hab _id_ _nd _ ___ed __t__ said __shtego__
"_o_ _he s_me _n_tan__ _ot the s_me___o_ _he ___b__o____ _____ __t___e___ved the ___b__o___or _e_ I _n_y_ _o_e of __ ___ld _ave __ised the _h_te Whale __rst_ _he_e ___ b___s__-the__ ___ b___s__-the__ ___ b___s_ _he_e a_ain__-_he__ a_ain__

变更记录

569279-更改为whale2.txt，因此取消了优化。
577366-使用试图猜测何时返回换行符的代码进行了优化。
590354-原始版本。

— 查理
source

4

感谢您在更改键大小和列阈值时显示差异！

— 杰里米·魏里希

我已使用文件未包装的文件版本更新了问题-您可以使用该文件保存一些要点

— Nathaniel

@Nathaniel确实如此。我已经更新了答案。

— 查理

您可以使用var来保存一些字节，而不用声明类型。

— Ed T

1

随着键的大小变大，命中次数和未命中次数将减少，因此，当较短的键可能猜到了正确的字符时，将输出更多的空格。随着密钥大小变小，对于匹配的段，单个猜测的准确性降低。我怀疑这就是为什么长度为四个最佳的原因。如果您维护多个长度的键并在没有较长键的情况下使用较短的匹配项，那么我预计在较长的键长度下命中率（以及得分）将大大提高。

— Jeffrey L Whitledge，

11

Java 7，1995个字符，（1995 * 2 + 525158）529148

Java很吸引小程序。无论如何，我尝试了几种极其复杂和棘手的方法，这些方法产生了令人惊讶的废话结果。随后，我回过头来，只是做了一个简单的方法，这导致了较小的程序大小和更好的结果。

这种方法实际上非常简单。它盲目地将先前的x个字符（除了那些字符的所有子字符串之外）馈入哈希表，并映射到当前字符。然后，它跟踪哪些模式最准确地预测了当前字符。如果多次遇到某些字符之前的模式，则它们可以成功预测该字符。它优先于较长的字符串，并且优先于给定字符串之后最常出现的任何字符。该算法对文档类型或英语一无所知。

我决定使用9个字符，并尝试尽可能匹配前9个字符中的整个单词。当您不尝试在字符串中进行单词匹配时，最佳长度为6个字符，这会产生数千种错误预测。

一个有趣的发现是，使用20个字符会导致第一次错误的预测，但随后的通过率却达到99.9％。该算法基本上能够以重叠的20字节块的形式来记住这本书，而且这种区别非常明显，可以一次将整本书召回一个字符。

（1950 * 2 + 532919）536819
（2406 * 2 + 526233）531045检查标点符号以做出更好的猜测
（1995 * 2 + 525158）529148更多的调整，客场golfed一些空话

package mobydick; import java.util.HashMap; public class BlindRankedPatternMatcher { String previousChars = ""; int FRAGLENGTH = 9; HashMap > patternPredictor = new HashMap<>(); void addWordInfo(String key, String prediction) { HashMap predictions = patternPredictor.get(key); if (predictions == null) { predictions = new HashMap(); patternPredictor.put(key, predictions); } WordInfo info = predictions.get(prediction); if (info == null) { info = new WordInfo(prediction); predictions.put(prediction, info); } info.freq++; } String getTopGuess (String pattern) { if (patternPredictor.get(pattern) != null) { java.util.List predictions = new java.util.ArrayList<>(); predictions.addAll(patternPredictor.get(pattern).values()); java.util.Collections.sort(predictions); return predictions.get(0).word; } return null; 
} String mainGuess() { 
if (trimGuess(",") != null) return trimGuess(","); if (trimGuess(";") != null) return trimGuess(";"); 
if (trimGuess(":") != null) return trimGuess(":"); 
if (trimGuess(".") != null) return trimGuess("."); if (trimGuess("!") != null) return trimGuess("!"); if (trimGuess("?") != null) return trimGuess("?"); if (trimGuess(" ") != null) return trimGuess(" "); for (int x = 0;x< previousChars.length();x++) { String tg = getTopGuess(previousChars.substring(x)); if (tg != null) { return tg; } } return "\n"; } String trimGuess(String c) { if (previousChars.contains(c)) { 
String test = previousChars.substring(previousChars.indexOf(c)); return getTopGuess(test); } return null; } public String predictNext(String newChar) { if (previousChars.length() < FRAGLENGTH) { previousChars+= newChar; } else { for (int x = 0; x addWordInfo(previousChars.substring(x), newChar); } previousChars = previousChars.substring(1) + newChar; } return mainGuess(); 
} class WordInfo implements Comparable { public WordInfo (String text) { this.word = text; } 
String word; int freq = 0; @Override public int compareTo(WordInfo arg0) { return Integer.compare(arg0.freq, this.freq); }

— 吉姆·W
source

对于这种冗长的语言来说，这是一个相当不错的成绩。

— DJMcMayhem

1

我认为值得一试，因为与程序大小相比，文件的大小具有很大的改进空间。

— Jim W

3

在Java 7（或任何价值的Java版本）下，这不能编译。您能修正您的代码吗？完成后，我会很乐意打高尔夫，以提高您的得分。

— 奥利维尔·格雷戈尔

未经测试，但这应该是您略微熟悉的完全相同的代码：950个字节。但是，您当前的代码包含很多错误，因此我不确定是否正确填写了所有内容。同样，未经测试，因此只需比较版本以查看我已更改/重命名的内容，并查看一切是否仍与原始代码相同。当然可以打更多的球。

— 凯文·克鲁伊森

废话，我在做旧工作时很无聊，没有带代码。我必须看看它，看看错字在哪里。

— Jim W

10

Python 3，2 ×497 + 619608 = 620602 2×496 + 619608 = 620600

import operator as o
l=''
w=''
d={}
p={}
s=0
def z(x,y):
 return sorted([(k,v) for k,v in x.items() if k.startswith(y)],key=o.itemgetter(1))
def f(c):
 global l,w,d,p,s
 r=' '
 if c in' \n':
  s+=1
  if w in d:d[w]+=1
  else:d[w]=1
  if w:
   if l:
    t=l+' '+w
    if t in p:p[t]+=1
    else:p[t]=1
   n=z(p,w+' ')
   if n:g=n[-1];l=w;w='';r=g[0][len(l)+1]
   else:l=w;w='';r='t'
 else:
  w=w+c;m=z(p,w)
  if m:
   g=m[-1]
   if g[0]==w:
    if s>12:s=0;r='\n'
   else:r=g[0][len(w)]
 return r

我独立尝试此操作，但最终得到的是迈克尔·霍默答案的次等版本。我希望这不会使我的答案完全过时。

随着时间的流逝，这会建立一个单词词典（粗定义为以或结尾的字符串\n，区分大小写，包括标点符号）。然后，它将在字典中搜索以到目前为止对当前单词所知的单词开头的单词，并按出现频率（缓慢）对结果列表进行排序，并猜测下一个字符是最常见的匹配词中的下一个字符。如果我们已经有了最常见的匹配词，或者不再存在匹配词，它将返回。

它还建立了一个令人讨厌的低效率单词对字典。碰到单词边界时，它会猜测下一个字符是最常见的匹配单词对中第二个单词的第一个字母，或者t如果没有匹配项。不过，它不是很聪明。接下来Moby，该程序正确地猜测下一个字符为D，但是随后它忘记了所有上下文，通常最终将鲸鱼称为“白鲸”（因为在文本的前半部分单词“ Dutch”似乎更常见））。通过将单词对优先于单个单词来解决此问题很容易，但是我希望收益会很小（因为从第三个字符开始通常是正确的，并且单词对首先没有帮助）。

我可以对其进行调整以更好地匹配所提供的文本，但是我不认为基于输入的先验知识手动调整算法确实符合游戏的精神，因此，除了选择t作为空格后的后备字符（我可能也不应该这样做），我避免了这一点。我忽略了输入文件的已知行长，而是\n每隔13个空格插入一次-几乎可以肯定这是一个非常差的匹配，主要目的是保持行长合理而不是匹配输入。

代码并不完全很快（在我的机器上大约2个小时），但是总体来说大约可以得到一半的字符（49％）。我希望如果继续使用whale2.txt，得分会略胜一筹，但我还没有做到。

输出的开始看起来像这样：

T t t t t t t t t L t t t tsher t t t ty t to t t te t t t t t tem t t t d b ta tnL te t tv tath a to tr t tl t l toe g to tf ahe gi te we th austitam ofd laammars, tn te to t tis nf tim oic t t th tn cindkth ae tf t d bh ao toe tr ai tat tnLiat tn to ay to tn hf to tex tfr toe tn toe kex te tia t l t l ti toe ke tf hhe kirl tou tu the tiach an taw th t t Wh tc t d t te the tnd tn tate tl te tf teu tl tn oan. HeAL. tn nn tf r t-H ta t WhALE.... S tn nort ts tlom rhe ka tnd Dr t t tALL th teuli th tis t-H taCTIONARY " t r t o t a t A t . t eALT t I t HLW t I t e t w t AO t t t AOLE, I T t t t ALE t w t t R t EK t T t R tSupplied by wnLw t t iit ty cce thet whe to tal ty tnd

但最后，它看起来更像是某种东西。我最喜欢的那本书结尾处的段落

既然这都不是我的，那我该死的鲸鱼虽然束缚着你，却仍在追逐着你，尽管我追着你，却把它拖成碎片！因此，我放弃了长矛！”

出来作为

I dhrnery oyay ooom the woc Ihal iiw chshtego -tit my ti ddohe bidmer Hh, ho sheee opdeprendera toetis of tygd ahesgapdo tnep tnd tf y arosl tinl ahesgaorsltoak, and tidlhty ai p, cnd telas taep toip syst ho she tachlhe tnd tith ut ay Rnet hor bf toom the wist tord oaeve of ty nsst toip recked,hontain th, tingly toadh af tingly tike 'h, tot a hoet ty oh ost sreat ess iik in ty oh ost sremf Hew hiw"aoom tnl tou oolthert tyand . taoneoo sot an ao syad tytlows of ty oii e oor hoi tike and th ohes if oaped uoueid tf ty ooadh Ih ards the t houle lhesganl p tyt tpdomsuera tiile ah the wist t hrenelidtith the Ioom ti p s di dd o hoinbtn the Ior tid toie o hoetefy oist tyoakh on the Opr tnl toufin and tnl ti dd .mh tf ooueon gaor tnd todce tovther lon by tygd ait my the th aih tapce ciice toill moaneng she thesgh thmd th the thesgaoy d jiile YhE t hrve tpothe woerk "

那会使汗的愤怒更加混乱。“孤独”→“刺痛”是一个特别令人满意的替代词。

编辑：通过删除多余的空间来保存一个字节

计分

#! /usr/bin/env python3
import sys
import os
import mobydick as moby


def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

total = 0
right = 0
real_char = ''
guess_char = 'T'
print('T',end='')
with open("whale.txt") as whale:
    while True:
        if real_char == guess_char:
            right += 1
        real_char = whale.read(1)
        if not real_char:
            eprint(str(right) + " / " + str(total) + " (" +
                str(right/total*100) + "%)")
            size = os.path.getsize("mobydick.py")
            eprint("Source size: " + str(size) + "B")
            eprint("Score: " + str(2*size + total - right))
            sys.exit(0)
        guess_char = moby.f(real_char)
        print(guess_char,end='')
        total += 1

这将为Moby Dick的文本运行该程序，并将“预测的”文本输出到stdout，并滥用stderr编写分数。我建议将输出重定向到文件。

— 乔治沃森
source

2

欢迎来到PPCG！

— Martin Ender

1

会lambda i:i[1]比与之相比便宜operator吗？

— Draconis '18年

@Draconis几乎可以肯定。

— georgewatson

9

C ++，2·62829 + 318786 = 444444

要运行该程序，您需要在此处将此文件命名为C。

该程序使用的马尔可夫模型与我们先前的回答相同。和以前一样，此组合实质上是user2699对此回答的模型，但做了一些小的修改。

看到此答案如何使用与以前完全相同的模型，与以前描述的“倒带”机制相比，这种改进是一种更好的信息理论机制。这使得它可以减少错误，同时具有更小的组合长度。该程序本身并不怎么打高尔夫，因为它不是得分的主要贡献者。

该程序长2167字节（包括用于缩进的所有选项卡和许多其他不必要的字符，但在测试代码之前），并且二进制文件C长60661字节，因此在对多个文件评分的规则下，我们的得分L为2167 + 60661 + 1 = 62829。

该程序大约需要8分钟才能m5.4xlarge在Amazon EC2 上的实例上运行，并使用略超过16 GB的内存。（不需要过多的内存使用-我们也没有对此进行优化。）

#include <map>
#include <queue>
#include <vector>
using namespace std;

FILE *in;
unsigned int a, b = -1, c, d;
string s, t;
double l, h = 1, x[128][129], y[129], m[128];
map<string, int> N;
map<string, double[128]> M;
int G, S;

int f(int C)
{
    int i, j;
    for (i = 0; i <= 20 && i <= S; i++) {
        t = s.substr(S - i);
        N[t]++;
        M[t][C]++;
    }
    s += C;
    S++;

    for (i = 0; i < 128; i++)
        m[i] = 0;

    int E = 0;
    for (i = 20; i >= 0; i--) {
        if (i > S)
            continue;
        t = s.substr(S - i);
        if (i <= 2 && E >= 100 && (i == 0 || t[0] != ' '))
            break;
        if (M.find(t) == M.end())
            continue;
        for (j = 0; j < 128; j++) {
            m[j] += M[t][j] / N[t];
        }
        E += N[t];
    }

    double r = 0;
    for (i = 0; i < 128; i++)
        r += m[i];
    for (i = 0; i < 128; i++)
        m[i] = m[i] / r;

    if (!in) {
        in = fopen("C", "r");
        for (i = 0; i < 4; i++)
            c = c << 8 | getc(in);
    } else {
        l = x[C][G]
            + (l - y[G]) * (x[C][G + 1] - x[C][G]) / (y[G + 1] - y[G]);
        h = x[C][G]
            + (h - y[G]) * (x[C][G + 1] - x[C][G]) / (y[G + 1] - y[G]);
    }

    priority_queue<pair<double, int>> q;
    for (i = 0; i < 128; i++) {
        q.push(make_pair(m[i], i));
    }

    int n = 0;
    double s = 0;
    while (q.size()) {
        i = q.top().second;
        q.pop();
        if (m[i] < s / (n + 15))
            break;
        s += m[i];
        n++;
    }

    r = 0;
    for (i = 0; i < 128; i++) {
        y[i + 1] = m[i] - s / (n + 15);
        if (y[i + 1] < 0)
            y[i + 1] = 0;
        r += y[i + 1];
    }
    for (i = 0; i < 128; i++)
        y[i + 1] /= r;

    for (i = 0; i < 128; i++) {
        r = 0;
        for (j = 0; j < 128; j++) {
            x[i][j + 1] = y[j + 1];
            if (i == j)
                x[i][j + 1] *= 16;
            r += x[i][j + 1];
        }
        for (j = 0; j < 128; j++)
            x[i][j + 1] /= r;
        x[i][0] = 0;
        for (j = 0; j < 128; j++)
            x[i][j + 1] += x[i][j];
    }

    y[0] = 0;
    for (i = 0; i < 128; i++)
        y[i + 1] += y[i];

    for (G = 0; G < 128; G++) {
        if (y[G + 1] <= l)
            continue;
        if (y[G + 1] < h) {
            d = a + (b - a) * ((h - y[G + 1]) / (h - l));
            if (c <= d) {
                b = d;
                l = y[G + 1];
            } else {
                a = d + 1;
                h = y[G + 1];
            }
            while ((a ^ b) < (1 << 24)) {
                a = a << 8;
                b = b << 8 | 255;
                c = c << 8 | getc(in);
            }
        }
        if (h <= y[G + 1])
            return G;
    }
}
// End submission here.  Test code follows.
int main()
{
    FILE *moby = fopen("whale2.txt", "r");

    int E = 0;
    int c = getc(moby);
    while (c != EOF) {
        int guess = f(c);
        c = getc(moby);
        if (c != guess)
            E++;
    }

    printf("E=\t%d\n", E);

    return 0;
}

— 雷克斯
source

7

Python 3，526640

274个字节，526092错误（使用whale2.txt）。这肯定可以进一步改进，但是已经达到了“足以发布”的阶段。

from collections import*
D=defaultdict
M=[D(lambda:D(int))for i in range(10)]
X=""
def f(c):
 global X;G=D(int)
 for L in range(10):
  M[L][X[:L]][c]+=1;N=M[L][(c+X)[:L]]
  if N:g=max(N,key=lambda k:(N[k],k));G[g]+=N[g]*L**8
 X=(c+X)[:10]
 return max(G,key=lambda k:(G[k],k))

这个想法是要存储所有2、3、4，...，10个字符的运行频率。对于每个长度L，我们检查最近的L-1个字符是否与存储的模式匹配；如果是这样，我们的猜测g _L是该模式之后最常见的下一个字符。这样我们最多收集9个猜测。为了决定使用哪种猜测，我们将每个模式的频率按其长度乘以8的幂进行加权。选择加权频率总和最大的猜测。如果没有匹配的模式，我们将猜测空间。

（通过反复试验选择最大的图形长度和加权指数，以使错误的猜测最少。）

这是我未完成的在制品版本：

from collections import defaultdict

PATTERN_MAX_LEN = 10
prev_chars = ""
patterns = [defaultdict(lambda:defaultdict(int))
            for i in range(PATTERN_MAX_LEN)]
# A pattern dictionary has entries like {" wh": {"i": 5, "a": 9}}

def next_char(c):
    global prev_chars
    guesses = defaultdict(int)
    for pattern_len in range(PATTERN_MAX_LEN):
        # Update patterns dictionary based on pattern and c
        pattern = prev_chars[:pattern_len]
        patterns[pattern_len][pattern][c] += 1
        # Make a guess at the next letter based on pattern (including c)
        pattern = (c + prev_chars)[:pattern_len]
        if pattern in patterns[pattern_len]:
            potential_next_chars = patterns[pattern_len][pattern]
            guess = max(potential_next_chars,
                        key=lambda k:(potential_next_chars[k], k))
            frequency = potential_next_chars[guess]
            # Exact formula TBD--long patterns need to be heavily
            # advantaged, but not too heavily
            weight = frequency * pattern_len ** 8
            guesses[guess] += weight
    # Update prev_chars with the current character
    prev_chars = (c + prev_chars)[:PATTERN_MAX_LEN]
    # Return the highest-weighted guess
    return max(guesses, key=lambda k:(guesses[k], k))

和测试工具：

from textPredictorGolfed import f as next_char
# OR:
# from textPredictor import next_char

total = 0
correct = 0
incorrect = 0

with open("whale2.txt") as file:
    character = file.read(1)
    while character != "":
        guess = next_char(character)
        character = file.read(1)
        if guess == character:
            correct += 1
        else:
            incorrect += 1
        total += 1

print("Errors:", incorrect, "({:.2f}%)".format(100 * incorrect / total))

这是文本开头附近的一些示例输出。我们已经开始看到结束常用词的能力，看到他们的第一个字母后（in，to，and，by，还有，很显然，school）。

 you take in hand to school others, and to teach them by what name a whale-fish
xU wshhlnrwn cindkgo dooool)tfhe -; wnd bo so rhoaoe ioy aienisotmhwnqiatl t n

快要结束了，仍然有很多错误，但是还有很多非常好的序列（shmage seashawks例如，）。

savage sea-hawks sailed with sheathed beaks. On the second day, a sail drew near
shmage seashawks wtidod oith tua dh   tyfr.  Tn the shaond tay, wnltiloloaa niar

查看一些错误并猜测该算法“预期”了什么单词，这很有趣。例如，假定之后 sail，程序两次都预测o--for sailor。再或者，在, a期望之后- 可能n是由于的常见情况, and。

变更日志：

274 * 2 + 526092 = 526640对算法进行了改进，但付出了一些额外的错误
306 * 2 + 526089 = 526701原始版本

— DLosc
source

6

Python 2，得分：2 *（407 + 56574）+ 562262 = 676224

从文本中使用的所有大多数单词的列表中搜索与先前字符匹配的单词，并按出现次数进行排序。

码：

import zlib
f=open("d","rb")
l=zlib.decompress(f.read()).split()
w=""
def f(c):
 global w
 if c.isalpha():
  w+=c
  try:n=next(x for x in l if x.startswith(w))
  except StopIteration:return" "
  if len(n)>len(w):
   return list(n)[len(w)]
  return" "
 w="";
 n=ord(c)
 if n>31:
  return list("t \n 2  sS \n  -  08........       huaoRooe oioaoheu thpih eEA \n   neo    enueee neue hteht e")[n-32]
 return"\n"

数据：https：//www.dropbox.com/s/etmzi6i26lso8xj/d？dl = 0

测试套件：

incorrect = 0

with open("whale2.txt") as file:
    p_ch = ch = file.read(1)
    while True:
        ch = file.read(1)
        if not ch:
            break
        f_ch = f(p_ch)
        if f_ch != ch:
            incorrect += 1
        p_ch = ch

print incorrect

编辑：使用whale2.txt给出更好的分数。

— 稳定箱
source

5

C ++（GCC），725×2 + 527076 = 528526

另一个前缀频率提交。运行whale2.txt，并获得与他人相似的分数（略差）。

#import<bits/stdc++.h>
char*T="\n !\"$&'()*,-.0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz";
int I[124];std::string P(7,0);struct D{int V=0;std::array<int,81>X{{0}};};std::vector<D>L(1);D
init(){for(int i=81;i--;)I[T[i]]=i;}int
f(int c){P=P.substr(1)+(char)I[c];for(int i=7;i--;){int D=0;for(char
c:P.substr(i)){if(!L[D].X[c]){L[D].X[c]=L.size();L.push_back({});}D=L[D].X[c];}++L[D].V;}std::vector<int>C(81);for(int
i=81;i--;)C[i]=i;for(int
i=0;i<7;++i){int D=0;for(char c:P.substr(i)){D=L[D].X[c];if(!D)break;}if(!D)continue;int M=0;for(int
x:C)M=std::max(M,L[L[D].X[x]].V);C.erase(std::remove_if(C.begin(),C.end(),[&](int
x){return L[L[D].X[x]].V!=M;}),C.end());if(C.size()<2)break;}return T[C[0]];}

这个贪婪地找到以历史后缀开头的最长的字符串，如果有多个候选者，则以较短的字符串抢七。

例如：如果最后7个字符abcdefgh和字符串abcdefghi，并abcdefghj与形式的所有字符串的最大频率出现abcdefgh*，输出将是要么i或者j，用更短的后缀（抢七局bcdefgh，cdefgh...）。

出于未知原因，除了7以外，我的计算机没有足够的RAM来运行它。即使使用7，我也需要关闭所有Web浏览器才能运行它。

测试代码：

int main() {
    init(); 

    std::cout << "Start ---\n";
    std::time_t start = std::clock();

    std::ifstream file {"whale2.txt"};
    // std::ofstream file_guess {"whale_guess.txt"};
    std::ofstream file_diff {"whale_diff.txt"};
    if (!file.is_open()) {
        std::cout << "File doesn't exist\n";
        return 0;
    }

    char p_ch, ch;
    file >> std::noskipws >> p_ch;
    int incorrect = 0, total = 0;
    // file_diff << p_ch;

    int constexpr line_len = 80;
    std::string correct, guess_diff;
    correct += p_ch;
    guess_diff += '~';

    while (file >> ch) {
        char guess = f(p_ch);

        // file_guess << guess;
/*        if (guess != ch) {
            if (ch == '\n') {
                file_diff << "$";
            } else if (ch == ' ') {
                file_diff << '_';
            } else {
                file_diff << '~';
            }
        } else {
            file_diff << ch;
        }*/
        incorrect += (guess != ch);
        total += 1;
        p_ch = ch;

        if (guess == '\n') guess = '/';
        if (ch == '\n') ch = '/';
        correct += ch; guess_diff += (ch == guess ? ch == ' ' ? ' ' : '~' : guess);
        if (correct.length() == line_len) {
            file_diff << guess_diff << '\n' << correct << "\n\n";
            guess_diff.clear();
            correct.clear();
        }
    }

    file_diff << guess_diff << '\n' << correct << "\n\n";

    file.close();
    file_diff.close();

    std::cout << (std::clock() - start) 
    / double(CLOCKS_PER_SEC) << " seconds, "
    "score = " << incorrect << " / " << total << '\n';
}

取消高尔夫：

size_t constexpr N = 7;

int constexpr NCHAR = 81;

std::array<int, NCHAR> const charset = {{
'\n', ' ', '!', '"', '$', '&', '\'', '(', ')', '*', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'
}}; // this actually contains a lot of information, may want to golf it
// (may take the idea of using AndersKaseorg's algorithm, late acceptance hill climbing)

std::array<int, 'z' + 1> const char_index = [](){
    std::array<int, 'z' + 1> char_index;
    for (size_t i = NCHAR; i --> 0;) 
        char_index[charset[i]] = i;
    return char_index;
}(); // IIFE ?

std::string past (N, 0); 
// modifying this may improve the score by a few units

struct node {
    int value = 0;
    std::array<size_t, NCHAR> child_index {{0}};
};
std::vector<node> node_pool (1); // root

int f(int c) {
    past = past.substr(1) + (char) char_index[c];

    for (size_t i = 0; i < N; ++i) {
        // add past.substr(i) to the string
        size_t node = 0;
        for (char c : past.substr(i)) {
            if (node_pool[node].child_index[c] == 0) {
                node_pool[node].child_index[c] = node_pool.size();
                node_pool.emplace_back();
            }
            node = node_pool[node].child_index[c];
        }
        assert(node != 0); // the substring is non-empty
        ++node_pool[node].value;
    }

    std::vector<size_t> candidates (NCHAR);
    std::iota(candidates.begin(), candidates.end(), 0);
    for (size_t i = 0; i < N; ++i) {
        size_t node = 0;
        for (char c : past.substr(i)) {
            node = node_pool[node].child_index[c];
            if (node == 0) break;
        }
        if (node == 0) continue;

        assert(node_pool[0].value == 0);
        int max_value = 0;
        for (size_t x : candidates)
            max_value = std::max(max_value, node_pool[node_pool[node].child_index[x]].value);

        candidates.erase(
            std::remove_if(candidates.begin(), candidates.end(), [&](size_t x){
                return node_pool[node_pool[node].child_index[x]].value != max_value;
            }), candidates.end()
        );

        if (candidates.size() == 1) 
            break;
    }

    return charset[candidates[0]];
}

输出示例：

~ ~s  ta~ hard ts tt~~~~~~~ ~doam ~~ ar~ ~ i~~~ ~~~ ~he~~~~,a~ t~~~~ t~ ho~si~  
n--as his wont at intervals--stepped forth from the scuttle in which he leaned, 

~~~ thr~ ~~ t~~ crp~~~~~~~~ a~ wap~~~~~ a~eo~~ h~~ o~~ s~~~ or~~y~ ~  boog~e~~ t
and went to his pivot-hole, he suddenly thrust out his face fiercely, snuffing u

~ a~~ ~h~ ~n~ onitn~oi~~~~~~ ~~a~ ~ cewsoat~  a~ tae~~~~ ~e~~t~~ te~~ ouc~s~i~~ 
p the sea air as a sagacious ship's dog will, in drawing nigh to some barbarous 

ct as I~ iisk~~~~ ~~e~ tls~~~~ i~~~ ~~ soe~e Ae ~ ~~e~ tar~~~~~ trd~  ot ~ h~~~ 
isle. He declared that a whale must be near. Soon that peculiar odor, sometimes

这是本书结尾处的内容。大多数长单词都相当准确地预测（intervals，pivot-hole，distance）

 au t  tf weu~i~ aor~ mre~g~~~ m~t~~ ~~~  ~"NC~X~t~ti~  ~~n~ SNsh A FNECnSERTR O
 on as it rolled five thousand years ago./////Epilogue//"AND I ONLY AM ESCAPED A

NL~~,S~ ~HR~ yO~ -/s~n "~A~~ laeu~ta Vew~, S~e s~~  s~ ~ ain~ t~d ~t~ oirept~~ ~
LONE TO TELL THEE" Job.//The drama's done. Why then here does any one step forth

大写似乎不太好。

— 用户名
source

特里似乎比我预期的要消耗更多的内存...

— user202729 '18

...而且也难以实施。

— user202729

4

蟒蛇2，756837

使用的可能是马尔可夫链？

import zlib
a=eval(zlib.decompress('x\x9cM\x9cis\xda\xcc\xd2\x86\xff\x8a2\xf5\xd4\x81\xb8,\x977l\'\xf9\x90\x12 \x02f\x11G\x02c||*%@,a\x11a1\xe0S\xef\x7f\x7fC\x13\xf75\xdf\xda\xaaa4\xd3\xcb\xddw\xf7\x8c\xfc\xbf\xcc\x8f\xd7E\xe6\xab\x93if\xce\x9d\xcc\x8f\xefG\xd1\x11\xf1\x1b\xa2At\x8e\xa2\'\xe2\xc5Q\xfc,\xa2{\x14+"\x9e3\xf63b\x87\x9f\xb5\x8fb$b\xeb(\x96E\x8c\x18\x1b2\xb6{\x14/D\xfcq\x14\x03\x11}\xc6zG\xb1.b\xc0\xd3\x06\xcb\xa9\xf1\xb3\xcaQl\x88X>\x8a-\x11\xb7G1\x11q\x85\x98\x1c\xc5\x95\x88\xf1Q\xec\x89\x98\x1e\xc5\x81\x88\xa2\xb3X\xc4\x19\xe2\xe4(\xbe\x898\xd6\xc9F\xa8\xe4E\x16\x19\x8a\xc8r^|U\xc9\x8b\xc7\xd8\xfcQ\xf4\x8f\xe2\xbf\x1c\x06\xbc\xa8v6\xef\xba\xb2\x17V\xf6\x92\xe8r6\x07\x9d\xcc\x95EN\xe4\xe9FW\xb6\xd9\xea6M\xa2K\xdf\xact\x86\xf9\xc976Gy\xf2\xce\xef\x96G1\x15q\xf1\xf1\xd4\xcc3\xe6\x8f\xb8\x96\xdf}\xd27\xcf\x1d\x9da\x8e\x1f\xcd\xc5c\\\x11Q\xcf\xfc\x02Q\x9c\xe7\\\xd6\xbe;\x8acY\xe5\x8c\x17\xcfu9F\xc4\x83\xfc\x0c\x076\x0b\x1d;\xc7\x97\xe7_U\x9c\xacT\xfc\xc2\x1a\xbe\xb0\x06\x83\r7b\xd9\x85<\x9d\xe8\x86\xbe|Q\xff\xfc\xf2\xa0\xe2d\xa7?\xfbr\xc5\xbc\x97\x8c\xbd\xd1\xbd}\xb9f@\x8e\x01\xb7\x88\xf7\x88w*\xce\x13v1\xc1ZCv\x1c\xebz\xe7=]\xce\x1c\x9d\xcdg\xe8,U/\x98/\x18`\xed\xf8\x8d\xa7\xe21\'\x1bo\xd4,sk\x80\xb8\xc6L\xc45Oq\xa9M\xac\x9e8\xc7?k\xb8\x9fY\xe9\x80\x9a\x8c\x9d\x8a\x98\xea\xde\x8c\xcc\xbb\x94\xa7\x13\x06\xc8\xca\xfa"\x1e\x98\xa1\xa4\xe1R\xfb\xa1\xb1W+\xf2b\xc0\xa4\x96W\xac\xa8\x15\x10=\x8d\xd3ZC#\xb2F \xd7j\xccP\xd78\xadU\x8fbWD"\xbd\xd6Q\xb7\xaf\xb5\x98\x0cH\xac\x85\xfc\x0cH\xac5\x15(k\xdd\x8f\xa7\xa6&\xf1v\xfa\x19\x00Q\xc3\x7fkxuM\xe2\xad(\xa2D\xd6\xabX\xb6&\xfeyy\x14\x1d\xdc\xa4v\x8azY\xdbU\xa4P\xf9\xc4\xcc?\x0fj\x8d\x9f\x135\xf8O\xde\xf7\xd3Q?Ym\xf4\xe9\n\xefY\xe12\xab\x9d:\xc7\n`Y\xfd>\x8a[\x11\xf1\x88\xd5\x9a\xc9\xf6\xcc\x80#\xad\xde\xd5+W\x03\x9e\x12/\xab!\xf3\x8e\x98\x81xY\xf5\x18\xd0g2\xe2e5g\xb2\x05+\x13\x07\x9d\x8b8fCD\xd1j\xca\xcf,X]\x81X+\xb0i\xa5\x88\xf5\'\x1c\x14VW`\xe9\n\x84]\x19u\xaa\x15\x16X\x81\xb0+\x0c\xb7"\'\xbf.N\xab0\xa7?n\xd5\x13^\x179\xb5\xf9\xebB<\xe4\xe1$_[c\x04\xc3\x06\'\x99W\xbd.\xb2\x1ap\xaf\x8b\xb3\x8fy\xcc\x9fW\x19\xe6t\xacE\x18\x1d\xffoR\xf1\xeb\xa2k\xc9/\x96\xfc\x1fk\xfa\x96Z\xe7u\xd1VLx]<\xa9Q^\x17\x1dkL\xd3\x9a\xe7\xdfj\xe4\xd7Eh\x8d\x8fT\xc3\xaf\x8b\x9a5\xben\xc9\ru\xd2\xd7E\xa0\xf6}]\x94\xad1\x15k\x8b\x8f\xd6\xf8\xaa\xf5\xae\xa25\xde\xb7\xe6)Y\xe3\x7fX\xb2g\x8d\xc9[\xeb/(:\xfc[\xd4P9=>X?}\xb7\xe4\x8d\xa5\x92\xad5\xe5\x9b\xb5\x9c\x9d5Fbru\x92\x7f[\xaf]Y\xe3\xd7\x96\xdaf\xd6\x16\xe7\x1a\t\xaf\x8b\x85\xb5\x06\t\x96\xe1I\x1e[\xf3L\xac\xf5\xfc\xb2~;\xb5\x9e\x0f\xac\xf1\x12\xd7\xfb\x93<\xb4\xe6\x1fYk\x8e\xad\xdf\xf6\xac\xdf\xf6u\xfc\x80\x00\x19\x10A\x03\xdcz\xa0ac\x06\x84\xe3\x00>3 2\x07D\xe6\x80\xd8\x1e\x10\xdb\x03\xd8\xc8\xc0\x02\x82\x01\xb9w \xea\xd9\x89\x08\xee\x0c\xe6\xaa\xd8\x01\xba\x19L\xf9\x19\x9a\x1c\xa0\xc8\x01\x807\x00\xf0\x06hq\x00\xd9\x1d\xf4\xd0\x89\xa5\x9e\x985\x80\xb4\x837\xd6\x00\x82\x0f\xf0\xae\x01\x19y\x80\xaf\x0c@\xf0\xc1\xf2cCf\x87Vw\xe8o\x87Vw\x98h\x87]vXk\x07a\xdc\xa1\xf6\x1d\xba\xdea\x81K\x012aR\x977\x88\x97\no\x97W<\x85u]\n\x17;e\xceK(\xda%\xc4\xed\x12\x16x\t7\xdcYV\xbe\x94-I\xba\xbcd\xa3\x97\xec\xee\xf2\\W\xb1\xc3r;l\xb4\xc3r\xbb\xbe\xea}\xd7C\x14s\x9dt\t\xb5\xdb-\xd0\x04>\xb5#)\xed\xe0\xb5;\x12\xd8\x0e\x84\xd8Q8\xec0\xe2\x8e\xe4\xbc[2\x00?\xb9\xc4#\nl\xb3\x80\xe5\n\xa2\x12![\x05\x81G!\x1e\x05AP)\xed\n\x02\xac\x02\xfa\x85\x80\xa75\xc5\xba\x02t\xad  )\xc5l\x01jW\xe8"\x86\xbcB\xd0RrR\xa1\xc5+\x08\x9d\xc2X\xd5W \xbd\x17f\xba\xcd\x82\xa8Z\xd2N!Q\xf5\x15\xdeU}\x85\x83\xc6@a\xa5\x01U\x10\xa5\x9e\xd8\xee@\x9fN 4\x06,3#\xd5\xaf\x01\xc9\x0c$\xc5\x10\xa8\x13\xe0y\xb2\xd4\x1dO0\x96I\xd5\x16\x93\xadnh\x82\x85\xcc/f \x1f\x18\x06L\xc6\xba\x9c\t\xc8c\xc8\x17\x13j\x8c\xc9L}}\x92\xea\xd2\'\xe2\x88#\x11\xd9\xd0\x04\xaa5\xe9\xf1\xb3D]\xd9\x90\xce&#\xc6\x0e\xd9[\x11\x9d\xf9\xe8\x97dj\xc8\xa5\xc6\xd3\x080dRSP\xbb\x99\x1ac\xeb<%\xf3\x9b\x00\x9d\x91\xf7\ri\xdf<2/I\xdf\xc0Y\x0c\x94\xc5<1\x03\x84\xc5\xc0W\x0ct\xc5\x84,\x07\xb2b\xe0KO\xb2\xb7\x9ah\x07\xf43\xaf\x19uv\x039\x7f\x12MI\x1d\xf3$k/\xc8\x80\x0b\xc5.s\x06\xe6=\xc9\x9e\xa58\x99\xb8\xea\xd7\x13"yr\x81\xed\x01\xb7\x89\xbcN\xb2\xd9\xc4\xe8l\x7f\xcah\x85|\xc3:\x9fp\x89\'0\xefi\xa2\xa29\x81\xe9\xdf\x15\xa5j\xc7\xc9\xe9\xb9\xbc&Gc)\x87\xeb\xe6@\xe4\x1c8\x9d\xcb)\xde\xe6\xc0\xf4\x1cew\x8e\x04\x90#-\xe4.u\xc99RHN\x12\x8b$\xa1\x1cj\xc9\x01{9\xf8w\x19L*\xd3\xf2*S\xf5\x95\x9fxJ\xff\xac\xdcb\x00uc\xb9\x82\xd8`\x00Uj\xb9\xce\x0c@d\x19\x88,\x1f\xd4ve\xca\xb4\xf2\x04\x11RR\x8e\xd5\x1ce*\xab\xb2m\x992&-\x7fV\xfd\x94/\xac\x11(\xa8\xec\xaac\x95\xb5\x92\xfd\x13VZ\xdf\xfeG\xb4\xd2\x16Q;d&\xf3\xcd\xe8l\xaf\x19\xcb\xb52\xce\x87k\x99\x8c{\x14]\x11\xcf\xcd\xc7\x0b\x17$8\x8br.\x00\xbf\x05yqA\xb6\xb4\xe8\xec\x02\xb6v"\xb3\x12\x86\'\xaey\x12\xa1R\'\xa6y\x1aKM\xba@s\'\xea*\x00qb\xae\xa7\xa7{\x9e\x92N\x17$\x97/\x04\x96E\xd2-\x8enQ\xf4\x05I`AA\xbe \tX\xf4\x7f\xa1t\xcedv\xe6o\xf8\x98\xcc\x9b\xf9;\xc0d\xb6\xe6\xef6Mf\xf3\xa1T\x93Y#\xae\x18\xfb\xdb\xfc]\x8e\xc9,\x8d\xce{`\xc0\x88\xa7C\xf3Wg&\x93\x98\xbf+3\x7fx\xb6\xce\xdb?\x8a3\x11{\xcc\x1b36\xe5\xe9\xe2\x8fh2\xe6(\xce\x99a\xc6\x0c\x13\xf3\xd7\xf2&3f9\x1dv\xfc\xc4\xd3\x16O#\xdc\x08&\xba\xb8\xc0-\x9bFm\x01\x81]\x00\x88\x0b\xc3\xd8\xae\xbe\xe2T!\x9f\x94\xea\x1f\xc5\xbd\x88E\xb4S@\xcc\xb3M\xcf\xa8{~g\xde\x80\xf56\xf8Y\xfdc\xac\xc9\xd4\xcc_\xe72\x99\n\xda)\x7f\x8c\xcd|eo_\x1du\xb9\xaf\xf4\x1a\xbeZ\xe1\xfe\'Gj\xac\xd6\x8f\x1b\x15\xbdg\xea\x8e\xe6\x9c:\xd3\xd5\t\xfc:\xc8X\x07%\xea\xf0\xf7\xfa\xe9%\x1d\x91\xe9l\xd7\xc9\x12u\x89>\xe9\x82\xd7\x01\xab:\xb5G}\xc3\xc4+D"\xaa\x0e\x08\xd6i\xf6\xd5\x0b\x9a\x0e\xeb4\x06\xeb\x02\xa3\xc2\x1e\xeb5\x05\xad:8[o(\xce\xd6+\xec\xbe\xcd\xcf\x9a\ne\xf5\x88\xe5\x90\x0c\xce_9[X[\x95\xc3\x1aD]S\xca\xac\xd1\xd59f:G\xdb\xe7g\x0c \xf9\x9c\xd3\xeeYgu\x99k\xcc\xb1f\x865\xf6ZS\xf1\xae\xf1\xe7\xb5z\xb9Yg48\xce\x1f\xf4\x15\xdfu2\xf3\x9d\x01\xdfA\xec\xccwG\xcd\xbc\xc62k@kM\x07y\r\xc0\xad\xa98\xd6t\xdd\xd7\x18\x7f\r\xd6\xad\xa1\xab\xeb_\x8a\xcdk\xe0\x7f\r\xb5]\xc3\xf6\xd7\x00\xfd\x1a\xf8_\x93\x14\xd6}\x85\xdeu\x8f\xa7\xb4\xb9\xd7#\xd6\x0b\xd0\xaf\x81\xff55@H\xb9\x15&\xba\x86P&\x93f[\xc8\xca\xc2\xb1\xbe-\x94]\x08\xa7\x0e\xe1\x07!\xdd\xa0\xf0\tQ\xb8\x84\x90\xa3\xb0\xa9\x8e\x1dBAB(H\x88[\x86\xf4\xccC\x02&\xfc\xa1\x8e\x1dz\x1a0a^}<\xa49\x15R\xb0\x85\xb0\x91P\x02F\x90#\xa4\xb8\x0b\xe9\x99\x87\xd4\x84!\xce\x1e\x12\x02!\xbd\xd2\x10\x18\n\xc5\xa3\xaeD\xc4\x81C\xf1\xc4\xbc\x888{\x08\xf6\x84\xa7\x88\x93pH(e\x12J\x99$Us&\xd4\xd4\t\x0c5\xa1\r\x93L\x15\x91\x12|.I\xd4\xc8\t| !\xf3\'\x94\x7f\tT+\xe9+\x16$\x90\x8b\x84pI\xf6\x0c\xe0\xb0.\x81\xcd%DC\xb2C$\xf3\'\x84VB\x01\x99\x10\x86\tgf\xc9\xcf\xa3(\\7\x01,\x12t\x9d\xa0\xe0\x84\xfeY\x02\xedO\x80\x90\x84\x92$!\xc5$\xd8;\x01\xfd\x12L\x7fA\xa1\x92\x9c\x0c\'S\xec\xa1w\xfb\x89jjO3dO\t\xbf\'\xa8\xf7\xf0\xb4}\xac\x10\xb2O4\xf8\xf6\xa2\xebO"\x82<{\x94\xb6\xa7E\xb2\xdf\xaa\xc7\\\xd1\x1d\xdd\xa3\x93=\x9a\xda\x8b\xfe$\x87\xedE\x11R\xaf\xecU=f\x8f\xd2\xf6\xec~om\xf9\xeaR\xadqE=rE\xa3\xeb\x8a:\xe7\x8a:\xe7J\xea\x9c{\x11\xa9s\xae\xa8\x94\xae\x04\xc5\xafE$\xbf\\\xd1l\xbb\xa2_u\xc5\xe6\x8a\x12\xca\x82\xe7\xc5\x9a\xc6z\xb1\xae\xb8P$\xc0\x8b`H\xb1\xa8\x10Q\xf4\x15N\x8ad\xe5"\x80T\xa4<*\xb6\x15\xc7\x8a\x1c\xa0\x15#\x85\x93"\xed\x87\xe2D-[\x84P\x14c\x05\xd0"\xa7\x87\xc5\xad\x1a\xaeH\xfe)\x9e\xd4.(S\xb4\xb6\xac\xf64\xc5\x8cr\xb2"\x14\xa8\x88\xbb\x17\xf1\xe6\x8e\xaf\x88\xd4\xa1r\xefp\x9b\xa1C=\xd7\x81rt\xd0_\x87\xf6X\x87\xc2\xb7#\xbb\xff&"-\xafN\x131Q\x07\xed\xd01\xec\x80n\x1d\x1a\x82\x1d\x02\xaa\xa3\x8a0\x1d\xd0\xb6\xe3\xb02\xee\x85t\xb8\x17\xd2\xb1N\x1d;\xec~\xcb\x81\xdf/p\xeaZ\xbc2\'O\'\x1a\x1a\xbf\x12\xb5\xdc/Y\xb0T>\xbfR5\xd7\x1d\xfc\xe6\x8e\xe0\xba\xc3Dw\x04\xc9\x1d\xa5\xfc\x1dArG\xe8\xdc\x11$w9\x8d\x81;\t\x129\x0e\xbb\x93EJ\x82\xb9\xa3\x9dp\xf7E\xc3\xa1\xc5\xed\x8a;\xab\x81F\xeb\xbeb\xc5o\x05\x9dT@\xbd\n\xc0ZaG\x15vT\xc1\xa7*\n\xa1\xa6\x92\xf9(r2\x95g\xf4^\xe1\xeeH\xa5\xc9\xefH\xf7\x95\x10\xb1\xad\xc1S\xc1\xa9*O\xea>\x95\x8a\xee\xb9R\xd7\xf0\xabp\xdf\xa6\x12\xa8\x87V\xc4\x85\x7f\x88\xc8\x8d\x9dJ\x81\xc9\xf2\xea(\x15\xc8E\xa5\xc8\x80\x1f\xac\xa1\xc4S*\xe4\n9\xaaB\xa3\xb5B\xc2\xab\x08\xceK\xbb\xadB2\xaf\x88\xf7\x08\xa2WH\xe6\x15\x12Ae\xa4\xc8Q\xa1\xd7\x98\xa5\xb0\xce\xaeu\rY\x8a\xf0,\r\xd1,\xb6\xf7\xb0a\x16\x92\x90\x85\x82f9O\xce\x92\xad\xb2\x9c\xa8e\xa1$Y\xc8f\x96s\x80,\xa1\x9c\x85E\\\x8b\x01\xe4\xf8?\x0b\xad\xcc\x82\x0b\xd9H\x8d\x95m\xf26i;\n^g\xe9@e\xf1\x87lU\xed\x96-3\x96.h\x96r(+\xfe \x80\x9e\xad\xf1b\n\xaa,\x9d\xd8l\x81\x9fy\n\xb6\xd9\x92:W\x96\xcb\x1c\xd9"/\xf6\xd9\x85\xc4\xf71\xb1\x99\xe3!\xb3\xc6@jUT\x0b\xfbv\x13\xa7*\x9eL\xf8$\xa3\x89\xb4\x94PL1c\n\xb1I\xc9\xd1)Q\x99\xd2\x01H\x89\xeb\x94hO\xc9\xe7\xdf\xa8\xae\xbei\xae5\xdf\xa8\x98\xbeQ\xcb}\xb3\x96#\x9e"\x97`R|8\xc5SR\xf1\x1fa0)EP\xfa\x0b\x11\x0fL\xc7\x1a\x10)\xa7\x85)\xae\x9f\xd2\x92O!\xafi\x9f5\xd0\xbeOi\x87y\xa1z`\n7M\x0f\xea\xb8\xe9\x9e\xc9\xe0\xa6\xdf\xacb8%\x1b\xa7\xc4u\xca-\xa3\x14r\x9a\xc2\xc9R\x98Z\x83}6\xe8f6h&4\x92\x8f\xa7\xa6Erk\xf0\xe2\x06i\xb7\x81\xef7\xa08\r*\x9b\x06\xd7\x85\x1a\xa4\xf3\x06d\xa6Am\xd4\xa0\xbaj\xf8\xfc\xec\x07O\x9f\x11\xe1@\r\x9a\t\r\x88O\x03Do\xb4\x18@\x0f\xa2\x01\x8c7:\xec\xc2J\xd1\r\\\xbcA\xc9\xd4\xb0\xda\xb7\x0b\x92m\x03\x8e\xd3\x80\xb36,\x05\xe2\xee\x0bk\xe2\x93me\xff16\x88\x01\xdf\x18W\x8aa+1n\x17\xe3\xa2\xf1P\x8d\x14c\xe6x\xccX\\?\xc6\xf5c\xc2$&-\xc4\x80o\xbc\xd0\xe0\x89q\xaax\xc9\xdb\xc8<\xf1\x8a\xb1\xb0\x99\x18g\x8d9(\x8f\xa9\xbabJ\xb8\x983\xc0\x980\xb9\x82\xac,\x80\x8b\x05Zm\x9dTy#\xbf\x03|b(A\x0c:\xc5\x90\xf7\x98c\x9c\x18\xc3\xc4\xa0^\xcc;b\xe0+\xb6\x88\x8b\xebk`\xbb\x9c\xc0\xb9\x9c\xb5\xb9\x82\xda\x92O\\\xf1}I\x85.G\xb6n\x9e\xb1u\xc4\x1a?\xe3\xac\xcd%\xa6\\\xb2\x8c[\xe6gD\xa5\xfb\xc8+\xda\xea\x11.\'p.gm.w\x86\\\xce\xda\xdc&\xf3r\xd6\xe6\x86\xfa\xd4!\xc5\xba\x9c\xc09\xdc>q)\xf5]2\x8ck\r\xa0#\xe4\x12\x03.g\xba.\xa5\xbeK\xa9\xba\xd9\xf1\x94\xbb4.Wl\\b`\x83\x83\xba\xdc\xa3q9\xecp\xc5W\x85\x1a\xb9\x90\x95\r5\xb2\x8b\xaf\xba\xc4\x80\x0bww\xd7h\x12\xf6\xb5\xe1\xfe\xc2\x86\x1do\xe8vm8\xe1s9~\xdap\x14\xecr\xd8\xe1\xda\xa7K\x1b+s;\xd6\xd5f\x1a\xe0\xaev\xd33\x1bBf\x83;\xbbV\xf7\xd1u1.a\xe0f\x99\x98\x88\xd80`\xe3\xa2,x\xc0\x86H\xdb\x90\xd07\xf0\x80\r\x01\xea\xa0\xee\x11\x17\\G4\x17#\x16\x1c\xb1\x8d\x88P\x8ch]E\x16:G\xb24\xc92\x11\x0b\x8e\xe4\xcdB\x1a"\xbd\xc8o"\x80::\xe9\xb5$\xf2A\x8d\x13a\xf4\x88l\x1a\x01f\x11\x1d\xd7h\xc3\xd8\xa9*0\xa2=\x16QKF)K#\xcfG@r\x84\x0fF\x84D$\x81"\x146J\x18\x10)4DT\xb9Q\x07Q@@\xca\xeb\x88\xcb\xb7\x11\x17u#\x92{TV\x18\x89\xe8JF\xa0OTg\x00\xd9?\x82\xb7Fy\xe6\xf5\x18Ku3\xc4\x9eC\xac<\x14\xd3\xca\x9d\xcc!.3\xc4e\x86\xda\x1e3C<mH6\x1eb\xef!$q\x88\x07\x8f\xf0\x9e\xa1\x15GC\x02w\x08b\x0c\xe9h\r\xe9h\ri\xb6\x0fi\x97\x0ci\x9a\r\xb1\xcb\x10\xee8\x04\x94\x86\xdc\xe4\x1f\x02kC\xcd\xbbf\xc4\xe6\x1c\xa9\xb4\xa5\xfe>\xb0\xcf\x03\x9b;\xb0\xe5\x03\xfb<\xa0\xb4\x03\xaa<\xa0\xbf\x03\xaf8`\x81\x03v9\xa0\xa9\x11o\xbb\xa63p\xcd\xd5\xafk\xdag\x07K\xab\xd7\\\xfb\xbf&\x8b_\xd3r\xb8\xa6\xe5pM\x1b\xe1\x9a\x0e\xdc\xb5\xac]: \xd7\xec\xf3\xda\xda\'Z=PU\x1e\xe6\xfa\xb3\x03\x08y\xa0\xbds\xe0`\xe3@\xf7\xeb\x00\xf8\x1e\xc8<\x07\x0e+\x0e\xc0\xf7\x81\xabI\x07\xa0\xfe\xb0d\x06\xfc\xe8@\xff\xec\x00\xe8\x1d(\x93}\x0bz|\xd0\xcbg\xcb\xbe\x85o\xbe\xc2\x9e\xf1\x81/\x1f\x8b\xfb\xdc\x88\xf7Aa\x1f\x83\xfaX\xdc\xa7\x7f\xe1\x13\xcb~\xa0p\xe1K\xdcK\xe9\xea\x83\x11~Y\xd1\xc0\x87u\xf8\x12\xe1/"B\xea}>_\xf2\xa9b}j\x01\xbf\xc0\x0cy\x96\x0e\xd5\xf7\xa5\x00\x10\x92\xed\xbf\xf0bN{\xfc\x0e?\x83\xdf\xfb\x94\xf0>=\x1f\x9f\n\xc1\xa7\xe7\xe3\xd3"\xf1q\x19\x9f\xfbZ>\xc7L>W\xe3|\xf1\x08a\xbd\xbex\x84d.\x9fF\x84Oq\xe8\xe3S\xfe\x9e\xb7Au}\x9af>\xd0\xe3C@|r\x91\xbfd\x91\xe2i\xbfE\xa47\xf3|\xf2)1\xe73\x01\xf3\x8co<\x8b9\x9fE\xa4_\xf5La\xf6\x0c\xbd}~V\x13\xfd#\x88$\x14\xfa\x1f.\xc5?\x8b1\xa4)\xf1\x0c\xb3\x99Zh0\xe5lc\x8a\xafN9?\x9d\x02ISh\xfa\x94\xb5O\xc1\xa1)\xa11\xc5\x99\xa7\xc0\xd7\x14o\xbfg\x86{\x1a\xf6\xf7\xf4Y\xef\xef\xf4m\xf79]\xef=Pw\x0fN\xdd\x83^\xf7|\xe0t\x0f\xd2\xdd\x0bzIk\xf4\x1eL\x9bb\xfb)\x1f\xd5Ma\x86\xd3\xa1b\xc4\x14\xc0\x99\x02oS\xe0mJG\x7f\n\xeb\x9d\x92J\xa6P\x87)04\xe5\xb6\xea\x14\xef\x99\xc2d\xa6$\xb9)e\xd9c\xa0\x0e\xf1\xe8+L=J\xf8J[\xf3\x99\xf3\xd5GV\xf6(K\x17\xa2\xf2\x88C<ri\xf4\x11k>b\xa1,*1\x0c\xf8\xafM\x80?c\xf0\xcf\x18\xfc3\xa3?\xe3\x1c\x9f/x\xca\x8d\xa1\xcf\xa0\xe2\x92\x88Y\xa2\xaa%Lo\x89~\x96\x1bDBu\x89\xaa\x96\\D^\xd2\x96\xfcl/~I\xd5\xb4D-K\xd8\xe2\x12;/\xb1\xfe\x92\x84\xb5D\xc7K>\xbf\\b\xfd\x1b\xf2\xe7\xd2\x8a\xbf%j[\x12\x1cK\xd8\xc1\x92\xfe\xc5\x92P\\\xc2:\x96\x98i\x89\x8a\x97(\xfe\x86\xa7\x01c\x03W!\'\xb0\x06h\x88\x9b\x80,\x16\x80\x0c\x01\x9d\x95\xe0\xb4\r\xf1\xb6\x806_@\x9a\x0fh\xf3\x05c\x8d\xe6\x00\xfa\x15\xd0Y\t\xf8\x10"\xe0\x849\x80\xd6\x05 n@\xfb+ u\x07DR@\xc6\x0f$P\xaa"rn\x15\xd4\x11\xb9\x04\x10Ty\xca\xf5\xc5\xa0\xac0\x1cH\xd2\x14\n\x1d\x94\x18\xcb\xd7\xb2\x01\x07\x04A\x01M\xf1\xe1l\xe0\xf1TR\xa9\xa4\x82\xa0\xc3+\xc8\x94\x01\xb7\xc1\x03:\xdc\x01UE\x10\xaaO\x05Z`\x98\x1en\xd2\xe3\x10\xbb\x87\r{\xd8\xbb\x87\x9b\xf4\xf0\x8d\x1e\xde\xd5\x83\xfd\xf7\xbe2\x16\xaf\xed\xbd\x02v\xbd\x81Z\xa0\x07\\\xf6F\x0c\x80\x8f\xf7z\x0c\x00\x18{TZ=\x82\xab\x97j\x18\xf5\xc6LF \xf6h\x9f\xf56\n\x97=\xdc\xa4\xf7\xc6\xcap\xa9\x1e\x05F\x8f\xa6m\x0f\xe8\xb8\xb0Ab{\xfaC\xc0\xd3\xa13ra5)\xb7\x84\xf0\x05J\xbe@\xc9[\x14wA$]X7E/2\x1c\rl\xad\x1f2\xdd\x96\x8b}[\x8e\xd5\xb6\xd8w\x0b\xa6n\x7f\xf2\xbe\xba:\xcbE\x11\xd1G,!\xfe\x97=]p\'\xec\xa2\xa3\xe2\x16%m\x856\t\xff\xd9\nmz\x17\x91\x8b\x9c[\xda\x8d[\x94\xbf\xc5$\x17\t\xf3\x02\xf7[\x92\xc0\x16\x1e\xb8\x05S\xb6|c\xbe\xa5\'\xba\xe5\x90xK\x83uK\xf9\xb7\xa5\xed\xb5\xe5\xde\xfeVPI\x9aV\xdbX]hK\xf1\xb1\xed)\xae\xb5\x0e\xba\x9c\x16m/\xcf\xeaA\xb6V\xaa\x93{\x0b\xed[\xb4\x17Zd\x94\x16I\xb9ES\xb9\x05]\xf5\x08\xe3\x960\xedc\xef\xdbx\x1c\xc3\xb4\xba\x8a\t-\xb1\x91\x90\xf9\x96\x80\x86\xd4\x0b-\x81\x12\xa9\x17<q*\xb9l\xdd\x82t{\xe2T\xc2*[\xfc\xb3\x82\x16\xa7\x04-N\xc8Z\x94\x19\xad\no\xa3\xa0hq\x87\xbf\x05qm\t\xf4\xc9)\x96WPP\xf6\xf2\xac\xc1\xfa\x19q\xe2q\x19\xc3\x13\x0f\x15\xa6\xe3Uto\x1e\xb7\r<\xaa\x1e\x0f\x84\xf7X\xba\xc7\xb1c\xcb*\xde\xbc\xa6\xc6\xa2\x17\xb1`\xce\x19<\xa0\xd8\xa3\xc0\xf1:<}\xd2\xdd{\x94H\xde3O_P\x8f\xa3\x9e\xdf"j\xbd\xbeb\xa3\x07/\xf5\x06\n}\xde\x08\x91\xa3\x05\x0f\x14\xf4\xe8cyP\x97\x16\xf7\xe8<\xd0\xd5\xe3h\xc1#v<J\x19\x8f\xa3c\x8f\x98\xf4V,\x92\xf3\x04\x8f\x00\xf7 f\x1e\x9f\xe3y\xf4R=>\xfc\x1c1\xd6\xa1\x976\x82\xef\x8e\xacf$k\x18\x81\x0b\x0e\xa1\xec\xf0\xbd\xbeC#\xd9\xa1\xbd\xecp\x99\xd2Ag\x0e\xd9\xcb\xa1m=\x02\xdd\x1c(\xdc\x88\xb3\x9d\xd1P\xb53"\xd3\x8d\xe8D8\xb0\x15\x87\x96\xc2\x88;\x98\x0e-n\xc7R\t\xc7\xed#\x8c\xe5\xf0\xa5\xd1\x88\xa5\x8f\xc6\xea\x04\x0e\x07\xd5\x0e\x9f\x0c9\x1cn8|t\xe4p\x10\xe2p<\xe2\xf0\xb9\xaf\xc3\xd7\xc1\x0e\xdf\t9|S\xe4p\xce\xe1\xf0\xfd\x91\xc3\x99\x88\xc3\xb7J\x0e\xe7\'\x0e\xdf\t9\x9c]8|S\xe4p\xce\xe1p\xfa\xe1p&\xe2pR\xe2\xf0\xad\x92\xf3\xc2+\x9e\x99\x8c\xd3\x8f\x11\xe1\xe4H>\x94v\x80c\x14+\x1c>\xffv\xfe\xf5!\x1a\'ct\xb2\x7f\x8eO\xa5\xdf\xe7\xc8\x89\xb7\x90=\'\x8b\xc8\xb5\xbf\x11\xd5\x8fC\xfev\xa4B\x95km\x0eu\xab\xc3\xb7\xec\x8e\x94\xbbR\x04\x8f(\x84\x1c)w\x856;R\x04Ki<\x82\xaa9R\xcd~\x11\x91\nc\x04\x81\x1bY\xe9\xe7\x1d\xa2\xf5N\xbd\xf2N&z\xc7\xbb\xde\xb9d\xf8\x0e\x1f\x7f\x87\xa5\xbf\x13#\xef\xef\x1a\xb2\xef\x94`74\x9b\x1cB\xf6f\xa0;z\x87\xd3\xbc\xbb\xbc\xcd\xda\xdcZ\r\xf7\x0ef\xbe\x83\x99m\x0e|\x1c\xf0\xea\x86\n\xff\x06]\xdf\xd0#\xb8\xa1\xefyC\x8f\xe0\x86/\xacnh\x9d\xde\xd0P\xbd\xa1\xf7pC+\xe4\x86\xf5>nu\x17\x0eHZ\x12\xbf\x17\xe4/\xd1\xe5/\xd1\xfb/q\x03\xa9D7\xbeTR\xff,q\xd7\xa8D]R\xa23X\xe2\xba\x7f\tU\x97\xb0E\x89{\x0f%\x0c[\xe2\xf3\x84\x12Ek\x89\xa3\xe6\x92u ^\x82\xaf\x96\xc4\x02R\x14\x948\xed)\xb9\xcc\xc6\x8d\xbb.\xed\xc9.]\xcd\xae,X\x9a\x80]z\x16]v\xdf\xa5\x90\xea\xc2R\xba\xa2\xbfS\xce\xee\xd28\xee\xe2\xa0].\x83t\xed\xcfA\xce!K)\xd0|N\xa4u\t\x99\xae\xab\xf6\xe8\xe2\xa2]\x8b/t\xf5\x03a\xd3\xa5L\xeeBZ\xba\x14\x02c\x9e\xce\xa8|g\xe4\x92\x19\xb7\x07f\xe4\x92\x19]\x8bY_w:\xa3\xee\x98Q\x1f\xcd\xb8:2\x9b1\xc3\\\x83c\xcd\xe6f\x84\xf8\x0cE\xccH\xc53\x92\xf9\x0c\x7f\x9e\xe1V3R\xf1\x8c+\xd93:\xa63\x90\xe1\x9c/\xd8g\x00\x91\x99Q\xa2\xce0\xc1\x8c\xae\xc7\x8c\x18\x9f\x11_3\xac1\x03Zg\xd6\xe6P\xfb\x0c\x18\x9ea\x81\x07&{`\xb2\x07y\xb1$\x93\x87\x07\x9erq\xf2\xe1Zq\xfa\xe1F\x01\xf7\x81\xcd=\\\xf1\x14\xecx\x00Q\x1e\x04;$\x83<\x08\xa2H/\xb2\xea|\xc4\xb8\xa9\xe2GUb\xaaj9]\x95\x05W\xd9Q\xf5\xa4V\x89\xaaj\xacJ\xa9R\xefT\xb1x\x15\x86X%\xca\xab\x90\x8e*uK\xd5\xd7x\xaf\x12\xc3\xd5\x9a\x06n\x95\xb8\xac\x86\x8aUU\xae\xe5U\xb9\xb1Y\x85\x13\x9f\x91\xc4\xcf:\xfa\xe2\xb3\xa6\xae\xec\x0c\x1ap\x161\x00\xd2q\xc6\xbf$;\xcb\xeb\x80\xefv\xad~\x86{\x9cQ\r\x9f\xd9C.\xf1\x95\xdfh\xb6\x85\xf8\x9b\xff\xfe\xd2\xa4Q\xd0\xdc \xc2T\x9b\x07u\xdd&`\xd4\x14#\xc8\x19@\x13\xf6\xd9\x9c\xa8\xb75Sf\x00\x80\x9b\xdc\x82lF\xaa\xcd\xa6hH0\xbe\xd9A$\xa34\xf9\xf8\xb6\xd9U\xfcmr\xa2\xd3\xa4\xbejr7\xb2)\x8a\x95z\xb0I\x1ai\xd2\x15kr\x81\xac\xe9\xf06"\xa9\x89\xce\x9a\x94LM\xeb\xf8\xac\xcf\xc7\xab\xfd\x89j\xb5\xcfU\xa8>t\xa4\x0fI\xe9S\x15\xf4\xa9\xc9\xfb\x16HR\xe6\xf4\xb9\x98\xd1\x07\x7f\xfa`U\x1f\x04\xeb\x93\x9c\xfb\xd8\xb0\xbfa26\xd7\'\xab\xf5\xd9g\x1f|\xeaS\x9c\xf7\t\xcb>\xf0\xd3\xc7\xd1\xfaV\x8b\xe0\x8d\x1d\xbd\xd1s~#X\xdf\xf8\x94\xfc\x8d\xb5\xbf\xb1\xe07\xdd\xa7y\xcb\x18\xfd\x19k\xcfc\xf0<\xdfB\xe5\xa9\xb8\xf3T\xc6\xf9@a$O\xb8\xe7\xdb\xcc\x00\x8d\xc9\x13\xf9y\x02;O\xea\xcd\xd3\xe7\xcb\xe3\xd7y6\x94\xe7\x7ft\xe5\xe9\xd2\xe5\xe9\xe0\xe6\xb1\xe1F\x9b&&\x0fH\xe692\xcbc\x97\xbc\x85\x97yL\xd0fD\x1b\xf5\xb4\x15}3#,\xd7\xde\xe8z\\\x98q\x9b\xfbDm\xc9\xab\xc2\xfd\xda3\x1d\xdb\x06D7\xd6\xcf\xba\n\xa2m)S\xe4\x18\xb6M7\xb7\xcd1M\x9bo\xdf\xda(\xb8\r\x18\xb4\xeb\x1a\xa9m1\x9c\xb0\xc7\xb6\x18NZ\x1am\xba\x1bmxb\x9b\xeb\x9b\xed\xa2\x86r\xfb\x87"@\xdbS#\xb7i\xcc\xb4\xf3\x1a\xcac4\xf9\x89\x1c\xfd\xc9\xba\xaf4\xe6\x9e\xd3\'\x98\xd6\'2\xf3\'\xeb\xbf6|\x02\x9c\xc7\xf0\xe81\x86\x19c\xae\xb15\x96W\x8f9\x14\x19C%>\xd9\xf0>\xb6\x0fY\x80\xe41~5\x06\xd4\xc7\xc0\xc4\x98\x92b\x0cL\x8c\xe1Gc\xf8\xd1\x98o#\xc7\xf4\xa5\xc7\xb0\xea1\x1cm\x0c]\x1ds\x9bjLwaL\x95:\x86\xad\x8f\xb9\xc60\x16\xca(g\xdd\xe3\x01\x1b\x02\r7P\xc6[J\xa0[\xa11\xc2<n\xa1&\xb7P\x93[\xbe\xbc\xbd\xcd\xa99n\xf9\xc7\x11\xb7\x14Q\xb7\xfc\x93\x89[\x8a\xa8[Lw\xcbY\xee\x85e\xf2[<~\x04t\x8e\xfeZ\xf4\xff\xfe\x1f\xfa\xddI\x97'))
global t
t=' '
def f(k):
 global t
 r=a[t+k]if t+k in a else'e';t=k
 return r

— 斯凯勒
source

1

快速说明：的结果zlib.decompress('...')为{'G?':' ', 'G;':' ','G"':' ',.......}，a是一个从2个字符映射到1个字符的字典。Steadybox的answer基本上是2个字符的变体。

— user202729

1

如我所见，字面量为17780字节。您可以通过删除解压缩内容中的空格将其减少到11619个字符，这样可以节省12322个字节。（如果我计数正确）另外...将十六进制转义码转换为实际的原始字符可能会节省更多字节。

— user202729

如果是原始字节，如何在此处发布内容？

— 斯凯勒

1

xxd，hexdump，uuencode，或类似的

— 彼得·泰勒

@ user202729请注意，Python代码不能包含实际的原始NUL字节。

— mbomb007 '18

4

Haskell，（1904 + 1621 + 208548 + 25646）* 2 + 371705 = 847143

{-# LANGUAGE FlexibleInstances, DeriveGeneric #-}

import Control.Arrow
import Control.Monad
import Control.Monad.Trans.State
import Data.List

import System.IO
import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BC8
import Data.Ord
import Data.Char
import Data.Monoid
import Data.Maybe (fromJust, catMaybes)
import Data.Function
import qualified Data.Map as Map

import Codec.Compression.Lzma

import Data.Flat

import GHC.Word

maxWordLen :: Integral n => n
maxWordLen = 20

wordSeqDictSize :: Integral n => n
wordSeqDictSize = 255

predict :: [Trie] -> Char -> State ([Either Char Int], String) Char
predict statDict c = do
   (nextChar:future, begunWord) <- get
   case nextChar of
     Left p -> do
       put (future, [])
       return p
     Right lw -> do
       let wpre = begunWord++[c]
       put (future, wpre)
       return $ trieLook (tail wpre) (case drop lw statDict of{(t:_)->t;_->Trie[]})

newtype Trie = Trie [(Char,Trie)] deriving (Show, Generic)
instance Flat Trie

trieLook :: String -> Trie -> Char
trieLook [] (Trie ((p,_):_)) = p
trieLook (c:cs) (Trie m)
 | Just t' <- lookup c m  = trieLook cs t'
trieLook _ _ = ' '

moby :: IO (String -> String)
moby = do
    approxWSeq <- BSL.unpack . decompress <$> BSL.readFile "wordsseq"
    Right fallbackTries <- unflat <$> BS.readFile "dicttries"
    seqWords <- read <$> readFile "seqwords"
    let rdict = Map.fromList $ zip [maxWordLen..wordSeqDictSize] seqWords
    return $ \orig ->
      let reconstructed = approxWSeq >>= \i
             -> if i<maxWordLen then let l = fromIntegral i+1
                                     in replicate l $ Right l
                                else Left <$> rdict Map.! i
      in (`evalState`(reconstructed, ""))
              $ mapM (predict fallbackTries) (' ':orig)

例：

Call me Ishmael. Some years ago--never mind how long precisely--having
 ap  me ,nhmael.  Hme ?ears |ce--never  usd how long .aacesely--|ubing
little or no money in my purse, and nothing particular to interest me on
little or no ?ivey in my ?efse, and ,uwhing .hrticular to Bdaenest me on
shore, I thought I would sail about a little and see the watery part of
?neae, I thought I would  cfl about a little and see the |rkers part of
the world. It is a way I have of driving off the spleen and regulating
the world. It is a way I have of ,uiving off the |kli   and .ia       
the circulation. Whenever I find myself growing grim about the mouth;
the Ca         . B        I  rtd |yself ,haoing  eom about the ?ivlh;
whenever it is a damp, drizzly November in my soul; whenever I find
Baieever it is a  'mp, ,uiv    Bar      in my  cfl; Baieever I  rtd

使用三个预先计算的辅助文件：

seqwords 包含236个最常用的单词。
wordsseq 包含这些单词的LZMA压缩后缀，对于不属于236个最常见单词的所有单词，其长度。
dicttries对于每个字长，包含一个决策树，该决策树包含所有剩余的字。从这些尝试中，我们随手挑选了条目。

这样，我们实现的错误率大大低于所有其他有损方案。不幸的是，该wordsseq文件仍然太大，无法竞争。

这是创建文件并进行分析的完整版本：

depunct :: String -> [String]
depunct (p:l) = (p:take lm1 wordr) : depunct (drop lm1 wordr ++ srcr)
 where lm1 = maxWordLen-1
       (wordr, srcr) = (`span`l) $ if isAlpha p
                 then \c -> isLetter c || c=='\''
                 else not . isAlpha
depunct []=[]

mhead :: Monoid a => [a] -> a
mhead (h:_) = h
mhead [] = mempty

limit :: [Int] -> [Int]
limit = go 0
 where go z (n:l) | z<100 = n : go (z+n) l
       go _ l = take 1 l

packStr :: String -> Integer
packStr = go 0
 where go n [] = n
       go n (c:cs)
        | c>='a' && c<='z'  = go (28*n + fromIntegral
                                   (1 + fromEnum c - fromEnum 'a')) cs
        | otherwise         = go (28*n) cs


mkTrie :: [String] -> Trie
mkTrie [] = Trie []
mkTrie strs = Trie [ (c, mkTrie . filter (not . null) $ tail<$>l)
                   | l@((c:_):_) <- sortBy (comparing length)
                                  . groupBy ((==)`on`head)
                                  $ sortBy (comparing head) strs ]

mkTries :: [String] -> [Trie]
mkTries rsrc = [ mkTrie $ filter ((==l) . length) rsrc
               | l <- [0..maximum (length<$>rsrc)] ]

main :: IO ()
main = do
    orig <- readFile "whale.txt"
    let wordchopped = depunct orig
        dictRes
          = take 5000
          . map mhead
          . sortBy (comparing $ negate . length)
          . group . sort
          $ wordchopped
        dict = Map.fromList $ zip dictRes [maxWordLen..wordSeqDictSize]
        rdict = Map.fromList $ zip [maxWordLen..wordSeqDictSize] dictRes
        approxWSeq = [ case Map.lookup w dict of
                        Just i -> i
                        Nothing -> fromIntegral (length w - 1) :: Word8
                     | w <- wordchopped ]
        fallbackTries = mkTries . drop (wordSeqDictSize-maxWordLen) $ dictRes
        reconstructed = approxWSeq >>= \i
             -> if i<maxWordLen then let l = fromIntegral i+1
                                     in replicate l $ Right l
                                else Left <$> rdict Map.! i
        predicted = (`evalState`(reconstructed, ""))
              $ mapM (predict fallbackTries) (' ':orig)
        incorrects = length . filter id $ zipWith (/=) orig predicted
    putStrLn $ "longest word: "++show(maximum $ length<$>wordchopped)
    putStrLn $ show incorrects++" errors / "++show (length orig)++" chars"
    BSL.writeFile "wordsseq" . compress $ BSL.pack approxWSeq
    BS.writeFile "dicttries" $ flat fallbackTries
    writeFile "seqwords" . show $ take (256-maxWordLen) dictRes
    writeFile "whale-approx.txt" . unlines $ coLines orig predicted

coLines :: String -> String -> [String]
coLines [] _ = [[],[]]
coLines ('\n':l) (_:m) = []:[]:coLines l m
coLines l ('\n':m) = coLines l ('|':m)
coLines (c:l) (d:m) = case coLines l m of
   (lt:mt:r) -> (c:lt):(d:mt):r

— 停止转动逆时针
source

3

C ++（WIP），1923 * 2 + 1017344 = 1021190

#include <map>
#include <random>
#include <string>
#include <type_traits>
#include <vector>

using namespace std;

constexpr minstd_rand::result_type seed = 10087702;

template<typename T>
class discrete_mapped_distribution {
private:
    discrete_distribution<size_t> distr;
    vector<T> values;

public:
    discrete_mapped_distribution() :
            distr(), values() {
    }
    template<typename I, typename = typename enable_if<is_arithmetic<I>::value,
            I>::type>
    discrete_mapped_distribution(map<T, I> distribution) :
            values() {
        vector<I> counts;

        values.reserve(distribution.size());
        counts.reserve(distribution.size());

        for (typename map<T, I>::const_reference count : distribution) {
            values.push_back(count.first);
            counts.push_back(count.second);
        }

        distr = discrete_distribution<size_t>(counts.cbegin(), counts.cend());
    }

    discrete_mapped_distribution(const discrete_mapped_distribution&) = default;
    discrete_mapped_distribution& operator=(const discrete_mapped_distribution&) = default;

    template<typename URNG>
    T operator()(URNG& urng) {
        return values.at(distr(urng));
    }
};

class generator2 {
private:
    static map<char, discrete_mapped_distribution<char>> letters;

    minstd_rand rng;

public:
    static void initDistribution(const string& text) {
        map<char, map<char, uint64_t>> letterDistribution;

        string::const_iterator it = text.cbegin();
        char oldLetter = *it++;

        for (; it != text.cend();) {
            ++(letterDistribution[oldLetter][*it]);
            oldLetter = *it++;
        }

        generator2::letters = map<char, discrete_mapped_distribution<char>>();

        for (map<char, map<char, uint64_t>>::const_reference letter : letterDistribution) {
            generator2::letters[letter.first] = discrete_mapped_distribution<char>(letter.second);
        }
    }

    generator2() :
            rng(seed) {
    }

    char getNextChar(char in) {
        return letters.at(in)(rng);
    }
};

map<char, discrete_mapped_distribution<char>> generator2::letters;

目前的解决方案是WIP，因此不存在问题。另外考虑到实际的代码大小几乎不会对分数产生任何影响，我认为在开始进行微优化之前，我首先发布了答案。
（完整的代码在这里：https : //github.com/BrainStone/MobyDickRNG-包括完整的程序和种子搜索）

该解决方案基于RNG。首先，我分析文本。我创建了一个地图，该地图计算了两个连续字符的出现次数。然后，我创建一个分布图。这些都是静态完成的，因此应遵循规则。

然后，在尝试打印文本时，我进行了查找并从可能的字符中随机提取了一个字符。虽然这通常会产生比仅输出最常见的后继字母更糟糕的结果，但可能还有上帝种子会产生更好的结果。这就是为什么种子被硬编码的原因。我目前正在寻找最佳种子。一旦找到更好的种子，我将更新此答案。所以保持张贴！

如果有人想自己搜索种子或使用不同的RNG，请随意分叉该仓库。

用于计算分数的方法：https : //github.com/BrainStone/MobyDickRNG/blob/master/src/search.cpp#L15

请注意，即使目前总分是最差的，也超过了仅输出空格的错误计数。而且通过检查更多的种子，分数降低的机会很大。

变更日志

2018/01/24：发布初始答案
检查的种子：0-50000。得分：2305 * 2 + 1017754 = 1022364
2018/01/24：打了一些最少的高尔夫球添加了链接到分数计算方法。
检查的种子：0-80000。得分：1920 * 2 + 1017754 = 1021594（-770）
2018/02/02：新种子（10087702）（找不到修复提交的时间）
检查种子：0-32000000。得分：1923 * 2 + 1017344 = 1021190（-404）

— 脑石
source

您可以在答案中包含评估分数的测试工具吗？

— 纳撒尼尔（Nathaniel）

@Nathaniel我直接链接了得分代码。除了存储库，您是否认为足够？

— BrainStone

在查看规则时，我注意到我违反了其中一些规则。解决问题后，我自然会更新我的答案

— BrainStone

然后，您将最终将文本编码为随机种子。请参阅深奥的编程语言Seed，您可能想要对MT19937程序进行反向工程并击败这个答案（如果可以的话）。

— user202729

好主意，但无济于事。还是+1。

— user202729

3

Ruby（1164418）

我只是想看看不检查其他答案就能做得如何。
我不确定是否允许这样做，因为它包含我通过分析文件生成的文字，但是即使不是，也不意味着有殴打任何人的危险。

x="\"ect,htabsdd,in,\\nodniwlrfydbulkm;f?ckgwvi0,.*pr;\\\"uz17klI\\n-c'WSpA\\nTwqu8.77!-BeWO5.4.CoP\\n\\\"UHEFu2.?-9.jo6.NI3.MaLYDOGoOAR'QUECziJoxp(\\nYa:\\nVI);K\\nUS*IZEX\\n&\\n$\\n_y[S\""
f=->n{(x.include? n)? x[x.index(n)+1] : ' '}

我是如何产生的 `x`

首先，我生成a.txt了以下内容：

grep -o ".." whale2.txt | sort | uniq -c|sort -bn>a.txt

然后我生成了a.csv：

cat a.txt | awk '{ print $1","$2 }'|sort -n|tac>a.csv

然后，我x使用以下Ruby脚本将其解析为：

f={}
File.open('./a.csv').each{|l|x=l.partition(',')
f[x.last[0..1]]=x.first}
n={}
r={}
f.each{|k,v|if((r.include? k[0]and v>n[k[0]])or not r.include? k[0])and not k[1].nil?
r[k[0]]=k[1]
n[k[0]]=v
end}
s=''
r.each{|k,v|s+=k+v}
puts s.inspect

我如何得分

w=File.read('whale2.txt')
x="ect,htabsdd,in,\nodniwlrfydbulkm;f?ckgwvi0,.*pr;\"uz17klI\n-c'WSpA\nTwqu8.77!-BeWO5.4.CoP\n\"UHEFu2.?-9.jo6.NI3.MaLYDOGoOAR'QUECziJoxp(\nYa:\nVI);K\nUS*IZEX\n&\n$\n_y[S"
f=->n{(x.include? n)? x[x.index(n)+1] : ' '}

score = 235
w.each_line{|l|v=l[0];l[0..-3].each_char{|n|v+=f[n]};v.split(//).each_with_index{|c,i|if l[i]==c
print c
else
print '_'
score+=1

end}}

puts "FINAL SCORE: #{score}"

— NO_BOOT_DEVICE
source

我确定这是允许的；如果您分析了文件，那就好！只有程序这样做才无效。

— 暴民埃里克（Erik the Outgolfer）'18年

@EriktheOutgolfer> _>（将“（非竞争）”悄悄地拖到标题中）

— NO_BOOT_DEVICE

为什么？如果这是有效的，那它就在竞争，即使它可能不会打败太多。如果无效（即您的解决方案从文件中读取并且不仅仅包含文字），则应将其删除。

— Erik the Outgolfer

嗯我认为您的意思是，如果有任何程序分析了该文件，而不仅仅是解决方案。

— NO_BOOT_DEVICE

1

我看不懂Ruby，但是我认为这是有效的。在程序中包含文字完全可以，这完全没有问题。

— 纳撒尼尔（Nathaniel）'18

2

Python 3，（146 * 2 + 879757）880049字节

def f(c):return"\n                     t \n 2  sS \n  -  08........       huaoRooe oioaohue thpih eEA \n   neo    enueee neue hteht e"[ord(c)-10]

在线尝试！

非常简单的频率表。字符串中的每个位置对应于当前字符的ASCII码（负10 = 0x0a ='\ n'，文件中的最低字符），并且每个索引处的字符是频率最高的下一个字符。假设我正确计算了频率...

已使用user202729的测试代码进行测试

— 凯文
source

可以使用保存一些字节def f(c):return(" ">c)*c or"t ... e"[ord(c)-32]吗？

— 尼尔，

0

[Python 3]（644449 * 2 + 0）1288898点

仅 644449字节的完美准确性

import zlib,base64 as s
t=enumerate(zlib.decompress(s.b64decode(b'###')).decode());a=lambda c:next(t)[1]

完整的代码无法容纳在答案中，因此我将其放在此处，并在答案文本中用b'###'替换了大的二进制字符串文字。

这是用以下代码生成的，其中“ modified.py”是生成的文件，“ cheatsheet.txt”是从第二个字符开始的whale2.txt文件。

import zlib, base64
with open("modified.py","w") as writer:
    writer.write("import zlib,base64 as s\nt=enumerate(zlib.decompress(s.b64decode(")
    with open("cheatsheet.txt","rb") as source:
        text = source.read()
        writer.write(str(base64.b64encode(zlib.compress(text,9))))
    writer.write(')).decode());a=lambda c:next(t)[1]')

可以通过在“ modified.py”的末尾添加以下内容来执行代码。“ whale2.txt”必须与“ modified.py”位于同一目录中，并且输出将被写入“ out.txt”。

with open("out.txt","w") as writer:
    with open("whale2.txt","r") as reader:
        text = reader.read()
        for b in text:
            c = a(b)
            writer.write(c)

此答案不会直接访问whale.txt或whale2.txt。它使用规则中明确允许的现有标准压缩库。

— 莱戈林
source

当我计算它们时，可能无法在Windows中摆脱掉“ \ r \ n”

— Legorhin

2

是的，这是传播的错字

— Legorhin

写大约Moby Dick

排行榜

赏金

///，2 * 1 + 1020874 = 1020876

Node.js，2 * 224 + 524279 = 524727

笔记

测试码

变更记录

Perl，2·70525 + 326508 = 467558

预测变量

测试码

文字比较

产生提示

说明

模型

整体架构

为什么要使用提示文件？

算术编码

为什么会出错？

“倒带”机制

一点定理

全部放在一起

Python 3，2·267 + 510193 = 510727

预测变量

测试码

Python 3中，2 * 279 + 592920 = 593478 2 * 250 + 592467 = 592967 2 * 271 + 592084 = 592626 2 * 278 + 592059 = 592615 2 * 285 + 586660 = 587230 2 * 320 + 585161 = 585801 2 * 339 + 585050 = 585728

C ++，得分：2 * 132 + 865821 = 866085

Python，2 * 516 + 521122 = 522154

算法：

结果：

测试代码：

C（gcc），679787 652892

说明

sh + bzip2，2 * 364106 = 728212

蟒蛇 3，879766

C＃，378 * 2 + 569279 = 570035

Java 7，1995个字符，（1995 * 2 + 525158）529148

Python 3，2 ×497 + 619608 = 620602 2×496 + 619608 = 620600

计分

C ++，2·62829 + 318786 = 444444

Python 3，526640

Python 2，得分：2 *（407 + 56574）+ 562262 = 676224

C ++（GCC），725×2 + 527076 = 528526

蟒蛇2，756837

Haskell，（1904 + 1621 + 208548 + 25646）* 2 + 371705 = 847143

C ++（WIP），1923 * 2 + 1017344 = 1021190

变更日志

Ruby（1164418）

我是如何产生的 x

我如何得分

Python 3，（146 * 2 + 879757）880049字节

[Python 3]（644449 * 2 + 0）1288898点

我是如何产生的 `x`