是否可以压缩随机的

19

我有用于模拟纸牌游戏的真实数据。我只对卡片的等级感兴趣，对西服不感兴趣。但是，它是标准的张纸牌，因此该纸牌中每个等级只能有张纸牌。每副牌的洗牌盘都洗好了，然后我将整个唱盘输出到一个文件中。因此，在输出文件中只有可能的符号分别是。（ =十位）。因此，我们当然可以使用每个符号位对它们进行位打包，但是随后我们浪费了可能编码中的种。如果一次将符号分组，然后对其进行压缩，则可以做得更好，因为 $52$ $4$ $13$ $2,3,4,5,6,7,8,9,T,J,Q,K,A$ $T$ $4$ $3$ $16$ $4$ $13^4$ =，可以“ ”放入位而不是位。对于每个可能的卡，带有随机符号的数据的理论位打包限制为log（）/ log（）=。但是，例如在此甲板中不能有国王。在每个套牌中，每个等级必须只有，因此熵编码每个符号下降约一半，降至约。 $28,561$ $15$ $16$ $13$ $2$ $3.70044$ $13$ $52$ $4$ $3.2$

好的，这就是我的想法。此数据不是完全随机的。我们知道每个等级有，因此在张卡片的每块中（称为洗牌后的卡片组），因此我们可以进行几个假设和优化。其中之一就是我们不必对最后一张卡进行编码，因为我们会知道它应该是什么。如果我们只停留在一个等级上，那将是另一个节省。例如，如果最后在甲板卡是，我们不会有编码的，因为解码器将票卡到这一点，并看到所有其他的队伍已经排满，并且将承担 “ “失踪”卡全为 s。 $4$ $52$ $3$ $777$ $3$ $7$

因此，我对此站点的问题是，在这种类型的数据上可以获得什至更小的输出文件还有哪些其他优化方法？如果使用它们，我们能否击败每个符号位的理论（简单）位打包熵，或者甚至接近每个符号平均约位的极限熵极限？如果是这样，怎么办？ $3.70044$ $3.2$

当我使用ZIP类型的程序（例如WinZip）时，我只会看到 $2:1$ 压缩，这告诉我它只是在将“惰性” 位打包为 $4$ 位。如果我使用自己的位打包来“预压缩”数据，那似乎会更好，因为当我通过zip程序运行数据时，我得到的压缩比是 $2:1$ 。我在想的是，为什么不自己进行所有压缩（因为我比Zip程序对数据有更多的了解）。我想知道是否可以 log（ $13$ ）/ log（ $2$ ）= 的熵“极限” $3.70044$ 。我怀疑我可以通过我提到的一些“技巧”来发现更多的技巧。当然，输出文件不必是“人类可读的”。只要编码是无损的，它就是有效的。

这是 $3$ 百万条人类可读混洗牌组（每行 $1$ 条）的链接。任何人都可以“练习”这些行的一小部分，然后再将其撕裂整个文件。我将根据此数据继续更新最佳（最小）文件大小。

https://drive.google.com/file/d/0BweDAVsuCEM1amhsNmFITnEwd2s/view

顺便说一句，如果您对这些数据用于哪种类型的纸牌游戏感兴趣，请访问我的活跃问题（悬赏分）链接。有人告诉我这是一个（很难）解决的难题，因为这将需要大量的数据存储空间。不过，一些模拟与近似概率一致。尚未提供纯粹的数学解决方案。我想这太难了。 $300$

/math/1882705/probability-2-player-card-game-with-multiple-patterns-to-win-who-has-the-advant

我有一个很好的算法，可以显示位以对样本数据中的第一副牌进行编码。该数据是使用Fisher-Yates随机算法随机生成的。它是真正的随机数据，所以我新创建的算法似乎运行得很好，这让我很高兴。 $168$

关于压缩“挑战”，我目前大约为每副160位。我想我可以下降到158。是的，我尝试了一下，每个卡座得到158.43位。我想我已经接近算法的极限，所以我成功降低到每副166位以下，但未能获得156位，即每张卡3位，但这是一个有趣的练习。也许在将来，我会考虑采取一些措施将每个卡座平均减少2.43位或更多。

data-compression

— 大卫·詹姆斯
source

8

如果您自己生成这些随机播放的牌组（例如，而不是描述物理纸牌的状态），则根本不需要存储该牌组-只需存储生成该牌组的RNG种子即可。

— 杰森哈珀（Jasonharper）

3

您的描述和答案的描述与通常称为范围编码（en.wikipedia.org/wiki/Range_encoding）的概念非常相似。您可以在每张牌之后调整概率，以反映剩余的可能牌。

— H. Idden '16

评论不作进一步讨论；此对话已转移至聊天。

— 吉尔（Gilles）'所以别再邪恶了'

3

需要考虑的另一件事：如果您只关心压缩几百万个套牌的完整集合，而又不关心它们的顺序，则可以通过丢弃有关套牌的排序信息来获得额外的编码灵活性。。例如，如果您需要加载集合以枚举所有卡座并对其进行处理，但是却不在乎它们以什么顺序处理，便会是这种情况。

您首先要对每个卡座进行单独编码，因为其他答案已经描述了如何做。然后，对那些编码值进行排序。存储排序后的编码值之间的一系列差异（其中第一个差异从编码的牌组“ 0”开始）。给定大量卡片组，差异往往会小于整个编码范围，因此您可以使用某种形式的varint编码来处理偶尔的较大差异，同时仍然有效地存储较小的差异。适当的varint方案将取决于您在集合中有多少个牌组（从而确定平均差异大小）。

不幸的是，我不知道这对您的压缩有多大帮助，但我认为考虑这个想法可能很有用。

— 丹·布莱恩特
source

1

粗略地说，如果您有数百万个随机卡组，那么平均差异将是整个范围的一（几百万分之一），这意味着您希望每个值节省约20位数。您因varint编码而损失了一些。

— 史蒂夫·杰索普

2

@DavidJames：如果套牌的具体顺序并不重要，只是没有偏差，可以在减压后重新洗牌300万套套牌（即，不更改任何套牌，只需更改套牌的顺序即可） 300万个牌组的列表）。

— 史蒂夫·杰索普

2

如果订购信息不重要，这只是进一步减少信息内容的一种方法。如果重要，则不适用，可以忽略。就是说，如果对套组的排序的唯一重要性是它是“随机的”，那么您就可以按照@SteveJessop的说明在解压缩后将其随机化。

— 丹·布莱恩特

@DavidJames看到您的牌组的前173个以KKKK开始，而没有其他的数以百万计，并且断定它们全部以KKKK开始，这是一件很愚蠢的事情。尤其是当它们显然处于排序顺序时。

— user253751 '16

3

@DavidJames：此数据已压缩，并且解压缩例程可以根据需要将其重新随机化。“有些天真的人”根本什么也得不到，他们甚至都没有想出如何将其解释为纸牌。这是不是在数据存储格式的缺陷（在这种情况下，有损格式），有人使用它需要RTFM得到正确的数据出来。

— 史蒂夫·杰索普

34

这是一个达到理论极限的完整算法。

序言：编码整数序列

A 13-整数序列“整数，上限 $a-1$ ，整数与上限 $b-1$ ，”与上限整数整数，上限为整数，...上限为整数”始终可以以完美的效率进行编码。 $c-1$ $d-1$ $m-1$

取第一个整数，将其乘以 $b$ ，将第二个乘以，将结果乘以 $c$ ，添加第三个，再乘以 $d$ ，…乘以结果，再乘以 $m$ ，再乘以第十三–会产生一个介于之间的唯一数字 $0$ 和。 $abcdefghijklm-1$
用二进制记下该数字。

反之亦然。除以，其余为第十三整数。将结果除以，其余为第十二个整数。继续进行操作，直到用：余数是第二个整数，商是第一个整数。 $m$ $l$ $b$

因此，要以最佳方式对卡进行编码，我们要做的就是找到13个整数序列（具有给定的上限）和改组后的卡排列之间的完美对应关系。

这是怎么做。

改组和整数序列之间的对应关系

从您面前桌上的0张卡片开始。

步骤1

拿起背包中的四个2s，将它们放在桌子上。

您有什么选择？一张或多张纸牌可以放置在桌上已有序列的开头，也可以放在该序列中的任何一张纸之后。在这种情况下，这意味着可以放置可能的位置。 $1+0=1$

将4张卡放置在1个位置中的总数为。将每种方式编码为 $1$ 和。这样的数字有1个。 $0$ $1-1$

通过考虑将0写为5个整数的和的方式，我得到1：它是。 $\frac{4\times 3\times 2 \times 1}{4!}$

第2步

拿走包装中的四个3，然后将它们放在桌子上。

您有什么选择？一张或多张纸牌可以放置在桌上已有序列的开头，也可以放在该序列中的任何一张纸之后。在这种情况下，这意味着有可以放置可能的牌。 $1+4=5$

在5个地方放置4张卡的总数为。将这些方式分别编码为介于到之间的数字 $70$ $0$ 。有70个这样的数字。 $70-1$

通过将4编写为5个整数的总和，我得到70：。 $\frac{8\times 7\times 6 \times 5}{4!}$

第三步

拿起背包中的四个4s，将它们放在桌子上。

您有什么选择？一张或多张纸牌可以放置在桌上已有序列的开头，也可以放在该序列中的任何一张纸之后。在这种情况下，这意味着可能的放置卡片的位置。 $1+8=9$

在9个位置放置4张卡的总数为。将这些方式分别编码为到之间的数字 $495$ $0$ 。有495个这样的数字。 $495-1$

通过将8编写为5个整数的总和，我得到了495： $\frac{12\times 11\times 10 \times 9}{4!}$ 。

依此类推，直到...

步骤13

将四个ace放在包中，将它们放在桌子上。

您有什么选择？一张或多张纸牌可以放置在桌上已有序列的开头，也可以放在该序列中的任何一张纸之后。在这种情况下，这意味着 $1+48=49$ 可以放置可能的位置。

在49个位置中放置4张卡的总数为。将这些方式编码为到之间的数字 $270725$ $0$ $270725-1$ 。这样的数字有270725。

通过将48作为5个整数的总和来考虑，我得到了270725： $\frac{52\times 51\times 50 \times 49}{4!}$ 。

此过程产生之间的1对1的对应关系的（a）的卡，你不关心西装和整数（b）的序列，其中第一之间shufflings 和，第二个是间和，第三个是间和，依此类推，直到第十三，这是之间和。 $0$ $1-1$ $0$ $70-1$ $0$ $495-1$ $0$ $270725-1$

参考“编码整数序列”，您可以看到这样的整数序列与到之间的数字成1-1对应。如果查看每个整数的“乘积除以阶乘”表达式（如每个步骤末尾的斜体字所示），则将看到这表示到之间的数字 $0$ $(1\times 70\times 495\times … \times 270725)-1$ $0$

\frac{52!}{(4!)^{13}} - 1,

$\frac{52!}{(4!)^{13}}-1\text,$ 我之前的回答显示是最好的。

因此，我们提供了一种完美的方法来压缩洗牌。

算法

预计算所有方式的列表，这些方式将0写入5个整数，将4写入5个整数，将8写入5个整数，…将48写入5个整数。最长的列表包含270725个元素，因此它并不是特别大。（严格计算不是绝对必要的，因为您可以在需要时轻松地合成每个列表：尝试使用Microsoft QuickBasic，甚至遍历270725个元素的列表都比眼睛能看到的要快）

要从混洗转换为整数序列：

2s没有任何贡献，因此让我们忽略它们。写下一个介于0和1-1之间的数字。

3s：前3个之前有2个？秒前多少？第三？第四？四号之后？答案是5个整数，它们显然加起来为4。因此，请在“将4写作5个整数的和”列表中查找5个整数的序列，并注意其在该列表中的位置。那将是介于0和70-1之间的数字。写下来。

4s：在前4个之前有2个或3个？秒前多少？第三？第四？四号之后？答案是5个整数，它们显然加起来为8。因此，请在“将8写入5个整数的和”列表中查找5个整数的序列，并注意其在该列表中的位置。这将是介于0和495-1之间的数字。写下来。

依此类推，直到...

ace：第一张ace之前有多少张非ace卡？秒前多少？第三？第四？四号之后？答案是5个整数，这些总数显然加起来为48。因此，请在“将48作为5个整数的总和写入”列表中查找5个整数的序列，并注意其在该列表中的位置。这将是0到270725-1之间的数字。写下来。

您现在已经写下了13个整数。将它们编码（如前所述）为到之间的单个数字 $0$ 。用二进制写出那个数字。它将花费不到166位。 $\frac{52!}{(4!)^{13}}$

这是可能的最佳压缩方式，因为它达到了信息理论的极限。

解压很简单：从大数到13个整数的序列，然后使用它们来构建卡序列，如前所述。

— 马丁·科尚斯基
source

评论不作进一步讨论；此对话已转移至聊天。

— DW

此解决方案对我来说尚不清楚，而且不完整。它没有显示如何实际获得166位数字并将其解码回甲板。对我来说，构想并不容易，所以我不知道如何实现。您的步进公式基本上只分解了

分成

部分，对我的帮助不大。我认为，如果您使用70种可能的方式排列卡片来制作第2步的图表，将会有所帮助。您的解决方案太抽象了，我的大脑无法接受和处理。我更喜欢实际的例子和插图。

52! / (4!^{13})

$52! / (4! ^ {13})$

13

$13$

— 戴维·詹姆斯

23

建议您不要将整个卡座的状态编码为166位，而不是尝试将每张卡分别编码为3位或4位。正如Martin Kochanski所解释的，忽略西装的卡牌的可能排列少于，这意味着整个卡座的状态可以以166位存储。 $2^{166}$

您如何以有效方式通过算法进行这种压缩和解压缩？我建议使用字典顺序和二进制搜索。这将使您能够有效地（在空间和时间上）进行压缩和解压缩，而无需庞大的查找表或其他不切实际的假设。

更详细地讲：让我们通过在甲板的未压缩表示形式上按字典顺序对甲板进行排序，即，甲板以未压缩的形式表示为字符串，例如22223333444455555556666777788889999TTTTJJJJQQQQQQKKKKAAAA；您可以根据字典顺序对其进行排序。现在，假设您有一个给定牌组，它计算在它之前的牌组数量（按字典顺序）。然后，您可以使用此过程来压缩卡片组：给定卡片组，通过计算在其之前的卡片组数量，然后输出该数字，将其压缩为166位数字。该数字是卡座的压缩表示。 $D$ $D$

要解压缩，请使用二进制搜索。给定数字，您想在所有卡片组的字典顺序中找到第个卡片组。您可以按照二进制搜索的过程使用此程序：选择一个牌组，计算之前的牌数量，并将其与进行比较。这将告诉您是否要调整 $n$ $n$ $D_0$ $D_0$ $n$ $D_0$ 迟早要来。我建议您尝试正确地使符号正确：如果您要恢复2222333344445555666666778778888TTTTJJJJQQQQKKKKAAAA之类的字符串，请首先搜索以查找要用作字符串中第一个符号的字符（只需尝试所有12种可能性，或者对12种可能性使用二进制搜索）），然后在找到第一个符号的正确值后，搜索以找到第二个符号，依此类推。

剩下的就是想出一个有效的方法来计算按字典顺序排列的之前的牌组数量。这看起来像是一个简单但乏味的组合练习。特别是，我建议您为以下问题构建一个子例程：给定前缀（如222234），计算以该前缀开头的牌组数量。解决这个问题的方法看起来很容易二项式系数和阶乘。然后，您可以调用此子例程几次，以计算之前的牌组数量。 $D$ $D$

— DW
source

评论不作进一步讨论；此对话已转移至聊天。

— DW

8

忽略西装的卡的可能排列数为其对数底数2为165.976，或每张卡3.1919位，这比您给出的限制要好。

\frac{52!}{(4!)^{13}},

$\frac{52!}{(4!)^{13}}\text,$

任何固定的“每张卡位数”编码都没有意义，因为如您所述，最后一张卡总是可以位编码，在许多情况下，最后几张卡也可以。这就意味着，在某种程度上朝着包装的“尾部”迈进，每张卡所需的位数将比您想象的要少得多。 $0$

到目前为止，压缩数据的最佳方法是找到您希望与卡数据打包在一起的59位其他数据（实际上是59.6位），并将这些59位写为13位数字模24（= ），为每张卡分配一个花色（在对王牌分配花色的方式之间选择一位数字，另一位对国王做同样的选择，依此类推）。然后，您有52张完全不同的卡包。确实可以非常容易地将可能性编码为225.58位。 $4!$ $4!$ $52!$

但是在某种程度上也可以在不利用编码这些额外位的机会的情况下进行操作，因此我会考虑一下，因为我相信其他所有人也是如此。谢谢您提出了一个非常有趣的问题！

— 马丁·科尚斯基
source

1

在这里可以使用类似于密文窃取的方法吗？像这样，您用这额外的59位编码的数据是编码表示的最后59位吗？

— John Dvorak

@JanD我正在考虑调查这样的事情。但是后来发现，存在一种可以达到理论极限，直接且100％可靠的算法，因此没有必要进一步研究。

— 马丁·科尚斯基

@MartinKochanski-我不会用“忽略西服”来形容它，因为我们仍然尊重每个等级的4套西服。更好的措词可能是“甲板上可能有不同排列的数目……”

— 大卫·詹姆斯

3

这是一个长期解决的问题。

当您发行52张卡片组时，您发出的每张卡片最多具有已知概率的13个等级之一。概率随着每张发牌而变化。这是使用称为自适应算术编码的古老技术（对霍夫曼编码的一种改进）进行最佳处理的。通常，这用于已知的不变概率，但是也可以用于改变概率。阅读有关算术编码的维基百科文章：

https://zh.wikipedia.org/wiki/算术编码

— gnasher729
source

好的，但这不能回答我的问题，它是否可以接近，匹配或超过理论上的熵编码限制。似乎因为存在n个可能的套牌，每个套牌的概率为1 / n，所以熵编码是极限，我们不能做得更好（除非我们“作弊”并提前告知解码器有关编码器输入数据的信息。

— 戴维·詹姆斯

3

无论DW和马丁Kochanski已经描述了算法范围内构建交易和整数之间的双射，但是似乎他们俩都没有将问题简化为最简单的形式。（注1） $[0, {52!\over(4!)^{13}})$

假设我们有由所述有序列表中描述的（局部的）甲板，其中是的类型的卡的数量。在OP中，初始牌组由13个元素的列表描述，每个元素的值均为4。此类牌组的不同洗牌数量为 $a$ $a_i$ $i$

c (a) = \frac{(\sum a_{i})!}{\prod a_{i}!}

$c(a) = {(\sum a_i)! \over \prod a_i!}$

这是二项式系数的简单概括，确实可以通过简单地一次将对象排列为一种类型来证明（如Martin Kochanski所建议）。（请参阅下面的注释2）

现在，任何这样的（局部的）甲板，我们可以一次选择的混洗一卡，使用任何针对。以开头的唯一洗牌数量为 $i$ $a_i>0$ $i$

{\begin{cases} 0 & if a_{i} = 0 \\ c (⟨ a_{1}, . . ., a_{i - 1}, a_{i} - 1, a_{i + 1}, . . ., a_{n} ⟩) & if a_{i} > 0. \end{cases}

$\begin{cases}0 & \text{if } a_i = 0 \\ c(\langle a_1,...,a_{i-1},a_i-1,a_{i+1},...,a_n \rangle) & \text{if } a_i > 0. \end{cases}$

根据上面的公式，我们有

c (⟨ a_{1}, . . ., a_{i - 1}, a_{i} - 1, a_{i + 1}, . . ., a_{n} ⟩) = \frac{a_{i} c (a)}{\sum a_{i}}

$c(\langle a_1,...,a_{i-1},a_i-1,a_{i+1},...,a_n \rangle) = {a_ic(a)\over\sum a_i}$

然后，我们可以递归（或者iterate）穿过甲板直到洗牌完成通过观察对应于前缀混洗的数量比前缀字典顺序较小高达是 $i$

\frac{c (a) \sum_{j = 1}^{i} a_{j}}{\sum_{j = 1}^{n} a_{j}}

${c(a)\sum\limits_{j=1}^i a_j}\over\sum\limits_{j=1}^n a_j$

我用Python编写了此程序以说明算法。Python是与任何伪代码一样合理的伪代码。注意，大多数算术都涉及扩展精度。值（代表混洗的序数）和值（剩余的部分副牌可能混洗的总数）均为166位大数。要将代码翻译成另一种语言，则有必要使用某种bignum库。 $k$ $n$

另外，我只使用整数列表而不是卡名，并且-与上述数学不同-整数基于0。

为了对随机播放进行编码，我们遍历随机播放，并使用上述公式在每个点上累积从较小卡开始的随机播放次数：

from math import factorial
T = factorial(52) // factorial(4) ** 13

def encode(vec):
    a = [4] * 13
    cards = sum(a)
    n = T
    k = 0
    for idx in vec:
        k += sum(a[:idx]) * n // cards
        n = a[idx] * n // cards
        a[idx] -= 1
        cards -= 1
    return k

解码166位数字很简单。在每一步中，我们都有一个局部甲板和一个序数的描述。我们需要跳过比与序号相对应的卡片小的卡开始的混洗，然后我们计算输出所选择的卡片，将其从剩余卡组中移除，并使用选定的前缀调整可能的混洗数量：

def decode(k):
    vec = []
    a = [4] * 13
    cards = sum(a)
    n = T
    while cards > 0:
        i = cards * k // n
        accum = 0
        for idx in range(len(a)):
            if i < accum + a[idx]:
                k -= accum * n // cards
                n = a[idx] * n // cards
                a[idx] -= 1
                vec.append(idx)
                break
            accum += a[idx]
        cards -= 1
    return vec

我没有真正尝试优化上面的代码。我针对整个3mil.TXT文件运行了该文件，检查是否生成encode(decode(line))了原始编码。花了不到300秒。（可以在有关ideone的在线测试中看到其中的七行。）用较低级的语言重写和优化除法（可能的话）可能会将这段时间减少到可以管理的程度。

由于编码值只是一个整数，因此可以166位输出。删除前导零没有任何价值，因为那样便无法知道编码在何处终止，因此它实际上是166位编码。

但是，值得注意的是，在实际应用中，可能永远不需要对shuffle进行编码。可以通过生成一个166位随机数并对其进行解码来生成随机混洗。并非所有166位都是随机的。例如，可以从32位随机整数开始，然后使用以32位数字作为种子的任何标准RNG填充166位。因此，如果目标只是为了能够可复制地存储大量随机洗牌，则可以或多或少地任意减少按交易存储的需求。

如果你要编码大量的实际交易（以某种其他方式产生的），但不关心交易的顺序，可以增量编码数的排序列表，节省大约每个位数。（节省的原因是排序后的序列比未排序的序列具有更少的熵。它不会减少序列中单个值的熵。） $N$ $\log_2 N$

假设我们需要对位数字的排序列表进行编码，我们可以进行如下操作： $N$ $k$

选择作为接近的整数（地板或天花板都可以；我通常选择天花板）。 $p$ $\log_2 N$
我们通过二进制前缀将数字范围隐式划分为间隔。每个位数字分为一个位前缀和一个位后缀。我们只写出后缀（按顺序）。这需要位。 $2^p$ $k$ $p$ $k-p$ $N*(k-p)$
此外，我们创建一个位序列：对于前缀（前缀除外）中的每个前缀，我们为每个带有该前缀（如果有的话）的数字写一个，后跟一个。该序列显然具有位： s和 s。 $2^p$ $0$ $0$ $1$ $2^p+N$ $2^p$ $1$ $N$ $0$

为了解码数字，我们从0开始一个前缀计数器，然后继续处理比特序列。当看到，我们输出后缀列表中的当前前缀和下一个后缀。当我们看到，我们增加当前前缀。 $0$ $1$

编码的总长度是，这是非常接近或，平均每个值的位。 $N*(k-p) + N + 2^p$ $N*(k-p) + N + N$ $N*(k-p+2)$ $k-p+2$

笔记

是，并 $52!\over(4!)^{13}$ $92024242230271040357108320801872044844750000000000$ 约为。在本文中，我有时会假装以2为底的对数确实是；在生成该范围内的随机序数的情况下，可以使用拒绝算法，该算法很少会拒绝生成的随机数。 $\log_2 {52!\over(4!)^{13}}$ $165.9765$ $166$
为方便起见，我将写为；然后可以将类型的对象放入 $S_k$ $\sum\limits_{i=k}^n a_i$ $a_1$ $1$ 方式，然后可以将类型的对象放入 $S_1 \choose a_1$ $2$ 的方式，等等。由于 $S_2 \choose a_2$ ${S_i \choose a_i}={S_i! \over a_i!(S_i - a_i)!}={S_i!\over {a_i!S_{i+1}!}}$ , that leads to the total count

\frac{\prod_{i = 1}^{n} S_{i}!}{\prod_{i = 1}^{n} a_{i}! S_{i + 1}!}

$\prod\limits_{i=1}^n S_i! \over \prod\limits_{i=1}^n a_i! S_{i+1}!$

which simplifies to the formula above.

— rici
source

Comments are not for extended discussion; this conversation has been moved to chat.

— D.W.

@rici - I gave you the +100 bounty cuz you explained your answer in what seems like a better presentation including code while the other answers are more abstract/theoretical, leaving out some details of how to actually implement the encode/decode. As you may know, there are many details when writing code. I admit my algorithm is not the most straightforward, simple, easy to understand either but I actually got it working without much effort and over time I can get it running faster with more compression. So thanks for your answer and keep up the good work.

— David James

2

As an alternate solution to this problem, my algorithm uses compound fractional (non integer) bits per card for groups of cards in the deck based on how many unfilled ranks there are remaining. It is a rather elegant algorithm. I checked my encode algorithm by hand and it is looking good. The encoder is outputting what appear to be correct bitstrings (in byte form for simplicity).

The overview of my algorithm is that it uses a combination of groups of cards and compound fractional bit encoding. For example, in my shared test file of $3$ million shuffled decks, the first one has the first $7$ cards of $54A236J$ . The reason I chose a $7$ card block size when $13$ ranks of cards are possible is because $13^7$ "shoehorns" (fits snugly) into $26$ bits (since $13^7$ = $62,748,517$ and $2^{26}$ = $67,108,864$ $2$ $4$ $13$ $13^4$ $28,561$ $2^{15}$ $32,768$ $15/4 = 3.75$ but $26/7 = 3.714$ . So the number of bits per card is slightly lower per card if we use the $26/7$ packing method.

So looking at $54A236J$ , we simply look up the ordinal position of those ranks in our master " $23456789TJQKA$ " list of sorted ranks. For example, the first actual card rank of $5$ has a lookup position in the rank lookup string of $4$ . We just treat these $7$ rank positions as a base $13$ number starting with 0 (so the position 4 we previously got will actually be a 3). Converted back to base $10$ (for checking purposes), we get $15,565,975$ . In $26$ bits of binary we get $00111011011000010010010111$ .

The decoder works in a very similar way. It takes (for example) that string of $26$ bits and converts it back to decimal (base 10) to get $15,565,975$ , then converts it to base $13$ to get the offsets into the rank lookup string, then it reconstructs the ranks one at a time and gets the original $54A236J$ first $7$ cards. Note that the blocksize of bits wont always be 26 but will always start out at 26 in each deck. The encoder and decoder both have some important information about the deck data even before they operate. That is one exceptionally nice thing about this algorithm.

Each # of ranks remaining (such as $13, 12, 11 ..., 2, 1)$ has its own groupsize and cost (# of bits per card). These were found experimentally just playing around with powers of $13,12,11...$ and powers of $2$ . I already explained how I got the groupsize for when we can see $13$ ranks, so how about when we drop to $12$ unfilled ranks? Same method. Look at the powers of $12$ and stop when one of them comes very close to a power of $2$ but just slightly under it. $12^5$ = $248,832$ and $2^{18}$ = $262,144$ . That is a pretty tight fit. The number of bits encoding this group is $18/5$ = $3.6$ . In the $13$ rank group it was $26/7$ = $3.714$ so as you can see, as the number of unfilled ranks decreases (ranks are filling up such as $5555$ , $3333$ ), the number of bits to encode the cards decreases.

Here is my complete list of costs (# of bits per card) for all possible # of ranks to be seen:

$13~~~~26/7 = 3.714 = 3~~5/7$
$12~~~~18/5 = 3.600 = 3~~3/5$
$11~~~~~~7/2 = 3.500 = 3~~1/2$
$10~~~~10/3 = 3.333 = 3~~1/3$
$~~9~~~~16/5 = 3.200 = 3~~1/5$
$~~8~~~~~~3/1 = 3.000 = 3$
$~~7~~~~17/6 = 2.833 = 2~~5/6$
$~~6~~~~13/5 = 2.600 = 2~~3/5$
$~~5~~~~~~7/3 = 2.333 = 2~~1/3$
$~~4~~~~~~2/1 = 2.000 = 2$
$~~3~~~~~~5/3 = 1.667 = 1~~2/3$
$~~2~~~~~~1/1 = 1.000 = 1$
$~~1~~~~~~0/1..4 = 0.0 = 0$

So as you can clearly see, as the number of unfilled ranks decreases (which it will do every deck), the number of bits needed to encode each card also decreases. You might be wondering what happens if we fill a rank but we are not yet done a group. For example, if the first $7$ cards in the deck were $5,6,7,7,7,7,K$ , what should we do? Easy, The $K$ would normally drop the encoder from $13$ rank encoding mode to $12$ rank encoding mode. However, since we haven't yet filled the first block of $7$ cards in $13$ rank encoding mode, we include the $K$ in that block to complete it. There is very little waste this way. There are also cases while we are trying to fill a block, the # of filled ranks bumps up by $2$ or even more. That is also no problem as we just fill the block in the current encoding mode, then we pick up in the new encoding mode which may be $1,2,3...$ less or even stay in the same mode (as was the case in the first deck in the datafile as there are $3$ full blocks in the $13$ rank encoding mode). This is why it is important to make the blocksizes reasonable such as between size $1$ and $7$ . If we made it size $20$ for example, we would have to fill that block at a higher bitrate than if we let the encoder transition into a more efficient encoding mode (encoding less ranks).

When I ran this algorithm (by hand) on the first deck of cards in the data file (which was created using Fisher-Yates unbiased shuffle), I got an impressive $168$ bits to encode which is almost identical to optimal binary encoding but requires no knowledge of ordinal positions of all possible decks, no very large numbers, and no binary searches. It does however require binary manipulations and also radix manipulations (powers of $13, 12, 11$ ...).

Notice also that when the number of unfilled ranks = $1$ , the overhead is $0$ bits per card. Best case (for encoding) is we want the deck to end on a run of the same cards (such as $7777$ ) cuz those get encoded for "free" (no bits required for those). My encode program will suppress any output when the remaining cards are all the same rank. This is cuz the decoder will be counting cards for each deck and know if after seeing card $48$ , if some rank (like $7$ ) has not yet been seen, all $4$ remaining cards MUST be $7$ s. If the deck ends on a pair (such as 77), triple/set (such as 777) or a quad ( such as 7777), we get additional savings for that deck using my algorithm.

Another "pretty" thing about this algorithm is that it never needs to use any numbers larger than $32$ bit so it wont cause problems in some languages that "don't like" large numbers. Actually the largest numbers need to be on the order of $2^{26}$ which are used in the $13$ rank encoding mode. From there they just get smaller. In fact, if I really wanted to, I could make the program so that it doesn't use anything larger than $16$ bit numbers but this is not necessary as most computer languages can easily handle $32$ bits well. Also this is beneficial to me since one of the bit functions I am using maxes out at $32$ bit. It is a function to test if a bit is set or not.

In the first deck in the datafile, the encoding of cards is as follows (diagram to come later). Format is (groupsize, bits, rank encode mode):

( $7,26,13$ ) First $7$ cards take $26$ bits to encode in $13$ rank mode.
( $7,26,13$ )
( $7,26,13$ )
( $5,18,12$ )
( $5,18,12$ )
( $3,10,10$ )
( $3,~~9,~~8$ )
( $6,17,~~7$ )
( $5,13,~~6$ )
( $3,~~5,~~3$ )
( $1,~~0,~~1$ )

This is a total of $52$ cards and $168$ bits for an average of about $3.23$ bits per card. There is no ambiguity in either the encoder or the decoder. Both count cards and know which encode mode to use/expect.

Also notice that $18$ cards, (more than $1/3$ rd of the deck), are encoded BELOW the $3.2$ bits per card "limit". Unfortunately those are not enough cards to bring the overall average below about $3.2$ bits per card. I imagine in the best case or near best case (where many ranks fill up early such as $54545454722772277...$ ), the encoding for that particular deck might be under $3$ bits per card, but of course it is the average case that counts. I think best case might be if all the quads are dealt in order which might never happen if given all the time in the universe and the fastest supercomputer. Something like $22223333444455556666777788889999TTTTJJJJQQQQKKKKAAAA$ . Here the rank encode mode would drop fast and the last $4$ cards would have $0$ bits of overhead. This special case takes only 135 bits to encode.

Also one possible optimization I am considering is to take all the ranks that have only $1$ card remaining and treating those all as a special "rank" by placing them in a single "bucket". The reason here is if we do that, the encoder can drop into a more efficient packing mode quicker. For example, if we are in $10$ rank encoding mode but we only have one more each of ranks $3,7$ , and $K$ , those cards have much less chance of appearing than the other cards so it doesn't make much sense to treat them the same. If instead I dropped to $8$ rank encoding mode which is more efficient that $10$ rank mode, perhaps I could use fewer bits for that deck. When I see one of the cards in that special "grouped" bucket of several cards, I would just output that special "rank" (not a real rank but just an indicator we just saw something in that special bucket) and then a few more bits to tell the decoder which card in the bucket I saw, then I would remove that card from the group (since it just filled up). I will trace this by hand to see if any bit savings is possible using it. Note there should be no ambiguity using this special bucket because both the encoder and decoder will be counting cards and will know which ranks have only $1$ card remaining. This is important because it makes the encoding process more efficient when the decoder can make correct assumptions without the encoder having to pass extra messages to it.

Here is the first full deck in the $3$ million deck data file and a trace of my algorithm on it showing both the block groupings and the transitions to a lower rank encoding mode (like when transitioning from $13$ to $12$ unfilled ranks) as well as how many bits needed to encode each block. x and y are used for $11$ and $10$ respectively because unfortunately they happened on neighboring cards and don't display well juxtaposed.

$~~~~~~~~~26~~~~~~~~~~~~~26~~~~~~~~~~~~~26~~~~~~~~~~~~18~~~~~~~~~18~~~~~~~10~~~~~~9~~~~~~~~~~17~~~~~~~~~~~13~~~~~~~~5~~~~~0$
$~~~~54A236J~~87726Q3~~3969AAA~~QJK7T~~9292Q~~36K~~J57~~~T8TKJ4~~48Q8T~~55K~~4$
$13~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~12~~~~~~~~~~~~~~~~~~~~xy~~~~~98~~~~~~~~~7~~~~~~~~~~~~~~6~~~~~~~~543~~~~~2~1~~0$

Note that there is some inefficiency when the encode mode wants to transition early in a block (when the block is not yet completed). We are "stuck" encoding that block at a slightly higher bit level. This is a tradeoff. Because of this and because I am not using every possible combination of the bit patterns for each block (except when it is an integer power of $2$ ), this algorithm cannot be optimal but can approach $166$ bits per deck. The average on my datafile is around $175$ . The particular deck was "well behaved" and only required $168$ bits. Note that we only got a single 4 at the end of the deck but if instead we got all four 4s there, that is a better case and we would have needed only 161 bits to encode that deck, a case where the packing actually beats the entropy of a straight binary encode of the ordinal position of it.

I now have the code implemented to calculate the bit requirements and it is showing me on average, about 175 bits per deck with a low of 155 and a high of 183 for the 3 million deck test file. So my algorithm seems to use 9 extra bits per deck vs. the straight binary encode of the ordinal position method. Not too bad at only 5.5% additional storage space required. 176 bits is exactly 22 bytes so that is quite a bit better than 52 bytes per deck. Best case deck (didn't show up in 3 million deck test file) packs to 136 bits and worst case deck (did show up in testfile 8206 times), is 183 bits. Analysis shows worst case is when we don't get the first quad until close to (or at) card 40. Then as the encode mode wants to drop quickly, we are "stuck" filling blocks (as large as 7 cards) in a higher bit encoding mode. One might think that not getting any quads until card 40 would be quite rare using a well shuffled deck, but my program is telling me it happened 321 times in the testfile of 3 million decks so that it about 1 out of every 9346 decks. That is more often that I would have expected. I could check for this case and handle it with less bits but it is so rare that it wouldn't affect the average bits enough.

Also here is something else very interesting. If I sort the deck on the raw deck data, the length of prefixes that repeat a significant # of times is only about length 6 (such as 222244). However with the packed data, that length increases to about 16. That means if I sort the packed data, I should be able to get a significant savings by just indicating to the decoder a 16 bit prefix and then just output the remainder of the decks (minus the repeating prefix) that have that same prefix, then go onto the next prefix and repeat. Assuming I save even just 10 bits per deck this way, I should beat the 166 bits per deck. With the enumeration technique stated by others, I am not sure if the prefix would be as long as with my algorithm. Also the packing and unpacking speed using my algorithm is surprisingly good. I could make it even faster too by storing powers of 13,12,11... in an array and using those instead of expression like 13^5.

Regarding the 2nd level of compression where I sort the output bitstrings of my algorithm then use "difference" encoding: A very simple method would be to encode the 61,278 unique 16 bit prefixes that show up at least twice in the output data (and a maximum of 89 times reported) simply as a leading bit of 0 in the output to indicate to the 2nd level decompressor that we are encoding a prefix (such as 0000111100001111) and then any packed decks with that same prefix will follow with a 1 leading bit to indicate the non prefix part of the packed deck. The average # of packed decks with the same prefix is about 49 for each prefix, not including the few that are unique (only 1 deck has that particular prefix). It appears I can save about 15 bits per deck using this simple strategy (storing the common prefixes once). So assuming I really do get 15 bit saving per deck and I am already at about 175 bits per deck on the first level packing/compression, that should be a net of about 160 bits per deck, thus beating the 166 bits of the enumeration method.

After the 2nd level of compression using difference (prefix) encoding of the sorted bitstring output of the first encoder, I am now getting about 160 bits per deck. I use length 18 prefix and just store it intact. Since almost all (245013 out of 262144 = 93.5%) of those possible 18 bit prefixes show up, it would be even better to encode the prefixes. Perhaps I can use 2 bits to encode what type of data I have. 00 = regular length 18 prefix stored, 01= "1 up prefix" (same as previous prefix except 1 added), 11 = straight encoding from 1st level packing (approx 175 bits on average). 10=future expansion when I think of something else to encode that will save bits.

Did anyone else beat 160 bits per deck yet? I think I can get mine a little lower with some experimenting and using the 2 bit descriptors I mentioned above. Perhaps it will bottom out at 158ish. My goal is to get it to 156 bits (or better) because that would be 3 bits per card or less. Very impressive. Lots of experimenting to get it down to that level because if I change the first level encoding then I have to retest which is the best 2nd level encoding and there are many combinations to try. Some changes I make may be good for other similar random data but some may be biased towards this dataset. Not really sure but if I get the urge I can try another 3 million deck dataset to see what happens like if I get similar results on it.

One interesting thing (of many) about compression is you are never quite sure when you have hit the limit or are even approaching it. The entropy limit tells us how many bits we need if ALL possible occurrences of those bits occur about equally, but as we know, in reality, that rarely happens with a large number of bits and a (relatively) small # of trials (such as 3 million random decks vs almost $10^{50}$ bit combinations of 166 bits.

Does anyone have any ideas on how to make my algorithm better like what other cases I should encode that would reduce bits of storage for each deck on average? Anyone?

2 more things: 1) I am somewhat disappointed that more people didn't upvote my solution which although not optimal on space, is still decent and fairly easy to implement (I got mine working fine). 2) I did analysis on my 3 million deck datafile and noticed that the most frequently occurring card where the 1st rank fills (such as 4444) is at card 26. This happens about 6.711% of the time (for 201322 of the 3 million decks). I was hoping to use this info to compress more such as start out in 12 symbol encode mode since we know on average we wont see every rank until about middeck but this method failed to compress any as the overhead of it exceeded the savings. I am looking for some tweaks to my algorithm that can actually save bits.

So does anyone have any ideas what I should try next to save a few bits per deck using my algorithm? I am looking for a pattern that happens frequently enough so that I can reduce the bits per deck even after the extra overhead of telling the decoder what pattern to expect. I was thinking something with the expected probabilities of the remaining unseen cards and lumping all the single card remaining ones into a single bucket. This will allow me to drop into a lower encode mode quicker and maybe save some bits but I doubt it.

Also, F.Y.I., I generated 10 million random shuffles and stored them in a database for easy analysis. Only 488 of them end in a quad (such as 5555). If I pack just those using my algorithm, I get 165.71712 bits on average with a low of 157 bits and a high of 173 bits. Just slightly below the 166 bits using the other encoding method. I am somewhat surprised at how infrequent this case is (about 1 out of every 20,492 shuffles on average).

— David James
source

3

I notice that you've made about 24 edits in the space of 9 hours. I appreciate your desire to improve your answer. However, each time you edit the answer, it bumps this to the top of the front page. For that reason, we discourage excessive editing. If you expect to make many edits, would it be possible to batch up your edits, so you only make one edit every few hours? (Incidentally, note that putting "EDIT:" and "UPDATE:" in your answer is usually poor style. See meta.cs.stackexchange.com/q/657/755.)

— D.W.

4

This is not the place to put progress reports, status updates, or blog items. We want fully-formed answers, not "coming soon" or "I have a solution but I'm not going to describe what it is".

— D.W.

3

If someone is interested he will find the improved solution. The best way is to wait for full answer and post it then. If you have some updates a blog would do. I do not encourage this, but if you really must (I do not see valid reason why) you can write comment below your post and merge later. I also encourage you to delete all obsolete comments and incorporate them into one seamless question - it gets hard to read all. I try to make my own algorithm, different than any presented, but I am not happy with the results - so I do not post partials to be edited - the answer box is for full ones.

— Evil

3

@DavidJames, I do understand. However, that still doesn't change our guidelines: please don't make so many edits. (If you'd like to propose improvements to the website, feel free to make a post on our Computer Science Meta or on meta.stackexchange.com suggesting it. Devs don't read this comment thread.) But in the meantime, we work with the software we have, and making many edits is discouraged because it bumps the question to the top. At this point, limiting yourself to one edit per day might be a good guideline to shoot for. Feel free to use offline editors or StackEdit if that helps!

— D.W.

3

I'm not upvoting your answer for several reasons. 1) it is needless long and FAR too verbose. You can drastically reduce it's presentation. 2) there are better answers posted, which you choose to ignore for reasons unbeknownst to me. 3) asking about lack of upvotes is usually a "red flag" to me. 4) This has constantly remained in the front page due to an INSANE amount of edits.

— Nicholas Mancuso