是否可以压缩随机的


19

我有用于模拟纸牌游戏的真实数据。我只对卡片的等级感兴趣,对西服不感兴趣。但是,它是标准的张纸牌,因此该纸牌中每个等级只能有张纸牌。每副牌的洗牌盘都洗好了,然后我将整个唱盘输出到一个文件中。因此,在输出文件中只有可能的符号分别是。( =十位)。因此,我们当然可以使用每个符号位对它们进行位打包,但是随后我们浪费了可能编码中的种。如果一次将符号分组,然后对其进行压缩,则可以做得更好,因为28 561 15 16 13 2 3.70044 13 52 4 3.25213 2 3 4 5 6 7 8 9 4132,3,4,5,6,7,8,9,T,J,Q,K,AT43164134 =,可以“ ”放入位而不是位。对于每个可能的卡,带有随机符号的数据的理论位打包限制为log()/ log()=。但是,例如在此甲板中不能有国王。在每个套牌中,每个等级必须只有,因此熵编码每个符号下降约一半,降至约。28,56115161323.70044135243.2

好的,这就是我的想法。此数据不是完全随机的。我们知道每个等级有,因此在张卡片的每块中(称为洗牌后的卡片组),因此我们可以进行几个假设和优化。其中之一就是我们不必对最后一张卡进行编码,因为我们会知道它应该是什么。如果我们只停留在一个等级上,那将是另一个节省。例如,如果最后在甲板卡是,我们不会有编码的,因为解码器将票卡到这一点,并看到所有其他的队伍已经排满,并且将承担 “ “失踪”卡全为 s。452377737

因此,我对此站点的问题是,在这种类型的数据上可以获得什至更小的输出文件还有哪些其他优化方法?如果使用它们,我们能否击败每个符号位的理论(简单)位打包熵,或者甚至接近每个符号平均约位的极限熵极限?如果是这样,怎么办?3.700443.2

当我使用ZIP类型的程序(例如WinZip)时,我只会看到2:1压缩,这告诉我它只是在将“惰性” 位打包为4位。如果我使用自己的位打包来“预压缩”数据,那似乎会更好,因为当我通过zip程序运行数据时,我得到的压缩比是2:1。我在想的是,为什么不自己进行所有压缩(因为我比Zip程序对数据有更多的了解)。我想知道是否可以克服 log(13)/ log(2)= 3.70044的熵“极限”3.70044。我怀疑我可以通过我提到的一些“技巧”来发现更多的技巧。当然,输出文件不必是“人类可读的”。只要编码是无损的,它就是有效的。

这是3百万条人类可读混洗牌组(每行1条)的链接。任何人都可以“练习”这些行的一小部分,然后再将其撕裂整个文件。我将根据此数据继续更新最佳(最小)文件大小。

https://drive.google.com/file/d/0BweDAVsuCEM1amhsNmFITnEwd2s/view

顺便说一句,如果您对这些数据用于哪种类型的纸牌游戏感兴趣,请访问我的活跃问题(悬赏分)链接。有人告诉我这是一个(很难)解决的难题,因为这将需要大量的数据存储空间。不过,一些模拟与近似概率一致。尚未提供纯粹的数学解决方案。我想这太难了。300

/math/1882705/probability-2-player-card-game-with-multiple-patterns-to-win-who-has-the-advant

我有一个很好的算法,可以显示位以对样本数据中的第一副牌进行编码。该数据是使用Fisher-Yates随机算法随机生成的。它是真正的随机数据,所以我新创建的算法似乎运行得很好,这让我很高兴。168

关于压缩“挑战”,我目前大约为每副160位。我想我可以下降到158。是的,我尝试了一下,每个卡座得到158.43位。我想我已经接近算法的极限,所以我成功降低到每副166位以下,但未能获得156位,即每张卡3位,但这是一个有趣的练习。也许在将来,我会考虑采取一些措施将每个卡座平均减少2.43位或更多。


8
如果您自己生成这些随机播放的牌组(例如,而不是描述物理纸牌的状态),则根本不需要存储该牌组-只需存储生成该牌组的RNG种子即可。
杰森哈珀(Jasonharper)

3
您的描述和答案的描述与通常称为范围编码(en.wikipedia.org/wiki/Range_encoding)的概念非常相似。您可以在每张牌之后调整概率,以反映剩余的可能牌。
H. Idden '16

评论不作进一步讨论;此对话已转移至聊天
吉尔(Gilles)'所以别再邪恶了'

Answers:


3

需要考虑的另一件事:如果您只关心压缩几百万个套牌的完整集合,而又不关心它们的顺序,则可以通过丢弃有关套牌的排序信息来获得额外的编码灵活性。 。例如,如果您需要加载集合以枚举所有卡座并对其进行处理,但是却不在乎它们以什么顺序处理,便会是这种情况。

您首先要对每个卡座进行单独编码,因为其他答案已经描述了如何做。然后,对那些编码值进行排序。存储排序后的编码值之间的一系列差异(其中第一个差异从编码的牌组“ 0”开始)。给定大量卡片组,差异往往会小于整个编码范围,因此您可以使用某种形式的varint编码来处理偶尔的较大差异,同时仍然有效地存储较小的差异。适当的varint方案将取决于您在集合中有多少个牌组(从而确定平均差异大小)。

不幸的是,我不知道这对您的压缩有多大帮助,但我认为考虑这个想法可能很有用。


1
粗略地说,如果您有数百万个随机卡组,那么平均差异将是整个范围的一(几百万分之一),这意味着您希望每个值节省约20位数。您因varint编码而损失了一些。
史蒂夫·杰索普

2
@DavidJames:如果套牌的具体顺序并不重要,只是没有偏差,可以在减压后重新洗牌300万套套牌(即,不更改任何套牌,只需更改套牌的顺序即可) 300万个牌组的列表)。
史蒂夫·杰索普

2
如果订购信息不重要,这只是进一步减少信息内容的一种方法。如果重要,则不适用,可以忽略。就是说,如果对套组的排序的唯一重要性是它是“随机的”,那么您就可以按照@SteveJessop的说明在解压缩后将其随机化。
丹·布莱恩特

@DavidJames看到您的牌组的前173个以KKKK开始,而没有其他的数以百万计,并且断定它们全部以KKKK开始,这是一件很愚蠢的事情。尤其是当它们显然处于排序顺序时。
user253751 '16

3
@DavidJames:此数据已压缩,并且解压缩例程可以根据需要将其重新随机化。“有些天真的人”根本什么也得不到,他们甚至都没有想出如何将其解释为纸牌。这是不是在数据存储格式的缺陷(在这种情况下,有损格式),有人使用它需要RTFM得到正确的数据出来。
史蒂夫·杰索普

34

这是一个达到理论极限的完整算法。

序言:编码整数序列

A 13-整数序列“整数,上限a1,整数与上限b1,”与上限整数整数,上限为 d - 1的整数,...上限为 m - 1的整数”始终可以以完美的效率进行编码。c1d1m1

  1. 取第一个整数,将其乘以b,将第二个乘以,将结果乘以c,添加第三个,再乘以d,…乘以结果,再乘以m,再乘以第十三–会产生一个介于之间的唯一数字0abcdefghijklm1
  2. 用二进制记下该数字。

反之亦然。除以,其余为第十三整数。将结果除以l,其余为第十二个整数。继续进行操作,直到用b除:余数是第二个整数,商是第一个整数。mlb

因此,要以最佳方式对卡进行编码,我们要做的就是找到13个整数序列(具有给定的上限)和改组后的卡排列之间的完美对应关系。

这是怎么做。

改组和整数序列之间的对应关系

从您面前桌上的0张卡片开始。

步骤1

拿起背包中的四个2s,将它们放在桌子上。

您有什么选择?一张或多张纸牌可以放置在桌上已有序列的开头,也可以放在该序列中的任何一张纸之后。在这种情况下,这意味着可以放置可能的位置。1+0=1

将4张卡放置在1个位置中的总数为。将每种方式编码为1 1 - 1。这样的数字有1个。011

通过考虑将0写为5个整数的和的方式,我得到1:它是4×3×2×14!

第2步

拿走包装中的四个3,然后将它们放在桌子上。

您有什么选择?一张或多张纸牌可以放置在桌上已有序列的开头,也可以放在该序列中的任何一张纸之后。在这种情况下,这意味着有可以放置 + 4 = 5个可能的牌。1+4=5

在5个地方放置4张卡的总数为。将这些方式分别编码为介于0到之间的数字700。有70个这样的数字。701

通过将4编写为5个整数的总和,我得到70:8×7×6×54!

第三步

拿起背包中的四个4s,将它们放在桌子上。

您有什么选择?一张或多张纸牌可以放置在桌上已有序列的开头,也可以放在该序列中的任何一张纸之后。在这种情况下,这意味着可能的放置卡片的位置。1+8=9

在9个位置放置4张卡的总数为。将这些方式分别编码为0495之间的数字4950。有495个这样的数字。4951

通过将8编写为5个整数的总和,我得到了495:12×11×10×94!

依此类推,直到...

步骤13

将四个ace放在包中,将它们放在桌子上。

您有什么选择?一张或多张纸牌可以放置在桌上已有序列的开头,也可以放在该序列中的任何一张纸之后。在这种情况下,这意味着1+48=49可以放置可能的位置。

在49个位置中放置4张卡的总数为。将这些方式编码为0270725 1之间的数字27072502707251。这样的数字有270725。

通过将48作为5个整数的总和来考虑,我得到了270725:52×51×50×494!


此过程产生之间的1对1的对应关系的(a)的卡,你不关心西装和整数(b)的序列,其中第一之间shufflings 1 - 1,第二个是间070 - 1,第三个是间0495 - 1,依此类推,直到第十三,这是之间0270725 - 101107010495102707251

参考“编码整数序列”,您可以看到这样的整数序列与1 × 70 × 495 × × 270725 1之间的数字成1-1对应。如果查看每个整数的“乘积除以阶乘”表达式(如每个步骤末尾的斜体字所示),则将看到这表示052之间的数字0(1×70×495××270725)10

52!(4!)131,
我之前的回答显示是最好的。

因此,我们提供了一种完美的方法来压缩洗牌。


算法

预计算所有方式的列表,这些方式将0写入5个整数,将4写入5个整数,将8写入5个整数,…将48写入5个整数。最长的列表包含270725个元素,因此它并不是特别大。(严格计算不是绝对必要的,因为您可以在需要时轻松地合成每个列表:尝试使用Microsoft QuickBasic,甚至遍历270725个元素的列表都比眼睛能看到的要快)

要从混洗转换为整数序列:

2s没有任何贡献,因此让我们忽略它们。写下一个介于0和1-1之间的数字。

3s:前3个之前有2个?秒前多少?第三?第四?四号之后?答案是5个整数,它们显然加起来为4。因此,请在“将4写作5个整数的和”列表中查找5个整数的序列,并注意其在该列表中的位置。那将是介于0和70-1之间的数字。写下来。

4s:在前4个之前有2个或3个?秒前多少?第三?第四?四号之后?答案是5个整数,它们显然加起来为8。因此,请在“将8写入5个整数的和”列表中查找5个整数的序列,并注意其在该列表中的位置。这将是介于0和495-1之间的数字。写下来。

依此类推,直到...

ace:第一张ace之前有多少张非ace卡?秒前多少?第三?第四?四号之后?答案是5个整数,这些总数显然加起来为48。因此,请在“将48作为5个整数的总和写入”列表中查找5个整数的序列,并注意其在该列表中的位置。这将是0到270725-1之间的数字。写下来。

您现在已经写下了13个整数。将它们编码(如前所述)为52之间的单个数字0。用二进制写出那个数字。它将花费不到166位。52!(4!)13

这是可能的最佳压缩方式,因为它达到了信息理论的极限。

解压很简单:从大数到13个整数的序列,然后使用它们来构建卡序列,如前所述。


评论不作进一步讨论;此对话已转移至聊天
DW

此解决方案对我来说尚不清楚,而且不完整。它没有显示如何实际获得166位数字并将其解码回甲板。对我来说,构想并不容易,所以我不知道如何实现。您的步进公式基本上只分解了分成13个部分,对我的帮助不大。我认为,如果您使用70种可能的方式排列卡片来制作第2步的图表,将会有所帮助。您的解决方案太抽象了,我的大脑无法接受和处理。我更喜欢实际的例子和插图。52!/(4!13)13
戴维·詹姆斯

23

建议您不要将整个卡座的状态编码为166位,而不是尝试将每张卡分别编码为3位或4位。正如Martin Kochanski所解释的,忽略西装的卡牌的可能排列少于,这意味着整个卡座的状态可以以166位存储。2166

您如何以有效方式通过算法进行这种压缩和解压缩?我建议使用字典顺序和二进制搜索。这将使您能够有效地(在空间和时间上)进行压缩和解压缩,而无需庞大的查找表或其他不切实际的假设。

更详细地讲:让我们通过在甲板的未压缩表示形式上按字典顺序对甲板进行排序,即,甲板以未压缩的形式表示为字符串,例如22223333444455555556666777788889999TTTTJJJJQQQQQQKKKKAAAA;您可以根据字典顺序对其进行排序。现在,假设您有一个给定牌组,它计算在它之前的牌组数量(按字典顺序)。然后,您可以使用此过程来压缩卡片组:给定卡片组D,通过计算在其之前的卡片组数量,然后输出该数字,将其压缩为166位数字。该数字是卡座的压缩表示。DD

要解压缩,请使用二进制搜索。给定数字,您想在所有卡片组的字典顺序中找到第n个卡片组。您可以按照二进制搜索的过程使用此程序:选择一个牌组D 0,计算D 0之前的牌数量,并将其与n进行比较。这将告诉您是否要调整D 0nnD0D0nD0迟早要来。我建议您尝试正确地使符号正确:如果您要恢复2222333344445555666666778778888TTTTJJJJQQQQKKKKAAAA之类的字符串,请首先搜索以查找要用作字符串中第一个符号的字符(只需尝试所有12种可能性,或者对12种可能性使用二进制搜索) ),然后在找到第一个符号的正确值后,搜索以找到第二个符号,依此类推。

剩下的就是想出一个有效的方法来计算按字典顺序排列的之前的牌组数量。这看起来像是一个简单但乏味的组合练习。特别是,我建议您为以下问题构建一个子例程:给定前缀(如222234),计算以该前缀开头的牌组数量。解决这个问题的方法看起来很容易二项式系数和阶乘。然后,您可以调用此子例程几次,以计算D之前的牌组数量。DD


评论不作进一步讨论;此对话已转移至聊天
DW

8

忽略西装的卡的可能排列数为其对数底数2为165.976,或每张卡3.1919位,这比您给出的限制要好。

52!(4!)13,

任何固定的“每张卡位数”编码都没有意义,因为如您所述,最后一张卡总是可以位编码,在许多情况下,最后几张卡也可以。这就意味着,在某种程度上朝着包装的“尾部”迈进,每张卡所需的位数将比您想象的要少得多。0

到目前为止,压缩数据的最佳方法是找到您希望与卡数据打包在一起的59位其他数据(实际上是59.6位),并将这些59位写为13位数字模24(= ),为每张卡分配一个花色(在对王牌分配花色的4 方式之间选择一位数字,另一位对国王做同样的选择,依此类推)。然后,您有52张完全不同的卡包。52 确实可以非常容易地将可能性编码为225.58位。4!4!52!

但是在某种程度上也可以在不利用编码这些额外位的机会的情况下进行操作,因此我会考虑一下,因为我相信其他所有人也是如此。谢谢您提出了一个非常有趣的问题!


1
在这里可以使用类似于密文窃取的方法吗?像这样,您用这额外的59位编码的数据是编码表示的最后59位吗?
John Dvorak

@JanD我正在考虑调查这样的事情。但是后来发现,存在一种可以达到理论极限,直接且100%可靠的算法,因此没有必要进一步研究。
马丁·科尚斯基

@MartinKochanski-我不会用“忽略西服”来形容它,因为我们仍然尊重每个等级的4套西服。更好的措词可能是“甲板上可能有不同排列的数目……”
大卫·詹姆斯

3

这是一个长期解决的问题。

当您发行52张卡片组时,您发出的每张卡片最多具有已知概率的13个等级之一。概率随着每张发牌而变化。这是使用称为自适应算术编码的古老技术(对霍夫曼编码的一种改进)进行最佳处理的。通常,这用于已知的不变概率,但是也可以用于改变概率。阅读有关算术编码的维基百科文章:

https://zh.wikipedia.org/wiki/算术编码


好的,但这不能回答我的问题,它是否可以接近,匹配或超过理论上的熵编码限制。似乎因为存在n个可能的套牌,每个套牌的概率为1 / n,所以熵编码是极限,我们不能做得更好(除非我们“作弊”并提前告知解码器有关编码器输入数据的信息。
戴维·詹姆斯

3

无论DW和马丁Kochanski已经描述了算法范围内构建交易和整数之间的双射,但是似乎他们俩都没有将问题简化为最简单的形式。(注1)[0,52!(4!)13)

假设我们有由所述有序列表中描述的(局部的)甲板,其中一个是的类型的卡的数量。在OP中,初始牌组由13个元素的列表描述,每个元素的值均为4。此类牌组的不同洗牌数量为aaii

c(a)=(ai)!ai!

这是二项式系数的简单概括,确实可以通过简单地一次将对象排列为一种类型来证明(如Martin Kochanski所建议)。(请参阅下面的注释2)

现在,任何这样的(局部的)甲板,我们可以一次选择的混洗一卡,使用任何针对一个 > 0。以i开头的唯一洗牌数量为iai>0i

{0if ai=0c(a1,...,ai1,ai1,ai+1,...,an)if ai>0.

根据上面的公式,我们有

c(a1,...,ai1,ai1,ai+1,...,an)=aic(a)ai

然后,我们可以递归(或者iterate)穿过甲板直到洗牌完成通过观察对应于前缀混洗的数量比前缀字典顺序较小高达i

c(a)j=1iajj=1naj

我用Python编写了此程序以说明算法。Python是与任何伪代码一样合理的伪代码。注意,大多数算术都涉及扩展精度。值(代表混洗的序数)和值n(剩余的部分副牌可能混洗的总数)均为166位大数。要将代码翻译成另一种语言,则有必要使用某种bignum库。kn

另外,我只使用整数列表而不是卡名,并且-与上述数学不同-整数基于0。

为了对随机播放进行编码,我们遍历随机播放,并使用上述公式在每个点上累积从较小卡开始的随机播放次数:

from math import factorial
T = factorial(52) // factorial(4) ** 13

def encode(vec):
    a = [4] * 13
    cards = sum(a)
    n = T
    k = 0
    for idx in vec:
        k += sum(a[:idx]) * n // cards
        n = a[idx] * n // cards
        a[idx] -= 1
        cards -= 1
    return k

解码166位数字很简单。在每一步中,我们都有一个局部甲板和一个序数的描述。我们需要跳过比与序号相对应的卡片小的卡开始的混洗,然后我们计算输出所选择的卡片,将其从剩余卡组中移除,并使用选定的前缀调整可能的混洗数量:

def decode(k):
    vec = []
    a = [4] * 13
    cards = sum(a)
    n = T
    while cards > 0:
        i = cards * k // n
        accum = 0
        for idx in range(len(a)):
            if i < accum + a[idx]:
                k -= accum * n // cards
                n = a[idx] * n // cards
                a[idx] -= 1
                vec.append(idx)
                break
            accum += a[idx]
        cards -= 1
    return vec

我没有真正尝试优化上面的代码。我针对整个3mil.TXT文件运行了该文件,检查是否生成encode(decode(line))了原始编码。花了不到300秒。(可以在有关ideone的在线测试中看到其中的七行。)用较低级的语言重写和优化除法(可能的话)可能会将这段时间减少到可以管理的程度。

由于编码值只是一个整数,因此可以166位输出。删除前导零没有任何价值,因为那样便无法知道编码在何处终止,因此它实际上是166位编码。

但是,值得注意的是,在实际应用中,可能永远不需要对shuffle进行编码。可以通过生成一个166位随机数并对其进行解码来生成随机混洗。并非所有166位都是随机的。例如,可以从32位随机整数开始,然后使用以32位数字作为种子的任何标准RNG填充166位。因此,如果目标只是为了能够可复制地存储大量随机洗牌,则可以或多或少地任意减少按交易存储的需求。

如果你要编码大量的实际交易(以某种其他方式产生的),但不关心交易的顺序,可以增量编码数的排序列表,节省大约登录2 ñ每个位数。(节省的原因是排序后的序列比未排序的序列具有更少的熵。它不会减少序列中单个值的熵。)Nlog2N

假设我们需要对k位数字的排序列表进行编码,我们可以进行如下操作:N k

  1. 选择作为接近log 2 N的整数(地板或天花板都可以;我通常选择天花板)。plog2N

  2. 我们通过二进制前缀将数字范围隐式划分为间隔。每个k位数字分为一个p位前缀和一个k p位后缀。我们只写出后缀(按顺序)。这需要N k - p 位。2pkpkpN(kp)

  3. 此外,我们创建一个位序列:对于前缀(前缀0除外)中的每个前缀,我们为每个带有该前缀(如果有的话)的数字写一个0,后跟一个1。该序列显然具有2 p + N位:2 p 1 s和N 0 s。2p0012p+N2p 1N 0

为了解码数字,我们从0开始一个前缀计数器,然后继续处理比特序列。当看到,我们输出后缀列表中的当前前缀和下一个后缀。当我们看到 1时,我们增加当前前缀。01

编码的总长度是,这是非常接近Ñ * ķ - p + Ñ + Ññ * ķ - p + 2 ,平均每个值的k p + 2位。N(kp)+N+2pN(kp)+N+NN(kp+2)kp+2

笔记

  1. 92024242230271040357108320801872044844750000000000,并记录25252!(4!)1392024242230271040357108320801872044844750000000000约为165.9765。在本文中,我有时会假装以2为底的对数确实是166;在生成该范围内的随机序数的情况下,可以使用拒绝算法,该算法很少会拒绝生成的随机数。log252!(4!)13165.9765166
  2. 为方便起见,我将写为n i = ka a i;然后可以将类型1a 1对象放入 S 1Ski=knaia11方式,然后可以将类型2的对象放入S2(S1a1)2的方式,等等。由于小号(S2a2)(Siai)=Si!ai!(Siai)!=Si!ai!Si+1!, that leads to the total count

i=1nSi!i=1nai!Si+1!

which simplifies to the formula above.


Comments are not for extended discussion; this conversation has been moved to chat.
D.W.

@rici - I gave you the +100 bounty cuz you explained your answer in what seems like a better presentation including code while the other answers are more abstract/theoretical, leaving out some details of how to actually implement the encode/decode. As you may know, there are many details when writing code. I admit my algorithm is not the most straightforward, simple, easy to understand either but I actually got it working without much effort and over time I can get it running faster with more compression. So thanks for your answer and keep up the good work.
David James

2

As an alternate solution to this problem, my algorithm uses compound fractional (non integer) bits per card for groups of cards in the deck based on how many unfilled ranks there are remaining. It is a rather elegant algorithm. I checked my encode algorithm by hand and it is looking good. The encoder is outputting what appear to be correct bitstrings (in byte form for simplicity).

The overview of my algorithm is that it uses a combination of groups of cards and compound fractional bit encoding. For example, in my shared test file of 3 million shuffled decks, the first one has the first 7 cards of 54A236J. The reason I chose a 7 card block size when 13 ranks of cards are possible is because 137 "shoehorns" (fits snugly) into 26 bits (since 137 = 62,748,517 and 226 = 67,108,864241313428,56121532,76815/4=3.75 but 26/7=3.714. So the number of bits per card is slightly lower per card if we use the 26/7 packing method.

So looking at 54A236J, we simply look up the ordinal position of those ranks in our master "23456789TJQKA" list of sorted ranks. For example, the first actual card rank of 5 has a lookup position in the rank lookup string of 4. We just treat these 7 rank positions as a base 13 number starting with 0 (so the position 4 we previously got will actually be a 3). Converted back to base 10 (for checking purposes), we get 15,565,975. In 26 bits of binary we get 00111011011000010010010111.

The decoder works in a very similar way. It takes (for example) that string of 26 bits and converts it back to decimal (base 10) to get 15,565,975, then converts it to base 13 to get the offsets into the rank lookup string, then it reconstructs the ranks one at a time and gets the original 54A236J first 7 cards. Note that the blocksize of bits wont always be 26 but will always start out at 26 in each deck. The encoder and decoder both have some important information about the deck data even before they operate. That is one exceptionally nice thing about this algorithm.

Each # of ranks remaining (such as 13,12,11...,2,1) has its own groupsize and cost (# of bits per card). These were found experimentally just playing around with powers of 13,12,11... and powers of 2. I already explained how I got the groupsize for when we can see 13 ranks, so how about when we drop to 12 unfilled ranks? Same method. Look at the powers of 12 and stop when one of them comes very close to a power of 2 but just slightly under it. 125 = 248,832 and 218 = 262,144. That is a pretty tight fit. The number of bits encoding this group is 18/5 = 3.6. In the 13 rank group it was 26/7 = 3.714 so as you can see, as the number of unfilled ranks decreases (ranks are filling up such as 5555, 3333), the number of bits to encode the cards decreases.

Here is my complete list of costs (# of bits per card) for all possible # of ranks to be seen:

13    26/7=3.714=3  5/7
12    18/5=3.600=3  3/5
11      7/2=3.500=3  1/2
10    10/3=3.333=3  1/3
  9    16/5=3.200=3  1/5
  8      3/1=3.000=3
  7    17/6=2.833=2  5/6
  6    13/5=2.600=2  3/5
  5      7/3=2.333=2  1/3
  4      2/1=2.000=2
  3      5/3=1.667=1  2/3
  2      1/1=1.000=1
  1      0/1..4=0.0=0

So as you can clearly see, as the number of unfilled ranks decreases (which it will do every deck), the number of bits needed to encode each card also decreases. You might be wondering what happens if we fill a rank but we are not yet done a group. For example, if the first 7 cards in the deck were 5,6,7,7,7,7,K, what should we do? Easy, The K would normally drop the encoder from 13 rank encoding mode to 12 rank encoding mode. However, since we haven't yet filled the first block of 7 cards in 13 rank encoding mode, we include the K in that block to complete it. There is very little waste this way. There are also cases while we are trying to fill a block, the # of filled ranks bumps up by 2 or even more. That is also no problem as we just fill the block in the current encoding mode, then we pick up in the new encoding mode which may be 1,2,3... less or even stay in the same mode (as was the case in the first deck in the datafile as there are 3 full blocks in the 13 rank encoding mode). This is why it is important to make the blocksizes reasonable such as between size 1 and 7. If we made it size 20 for example, we would have to fill that block at a higher bitrate than if we let the encoder transition into a more efficient encoding mode (encoding less ranks).

When I ran this algorithm (by hand) on the first deck of cards in the data file (which was created using Fisher-Yates unbiased shuffle), I got an impressive 168 bits to encode which is almost identical to optimal binary encoding but requires no knowledge of ordinal positions of all possible decks, no very large numbers, and no binary searches. It does however require binary manipulations and also radix manipulations (powers of 13,12,11...).

Notice also that when the number of unfilled ranks = 1, the overhead is 0 bits per card. Best case (for encoding) is we want the deck to end on a run of the same cards (such as 7777) cuz those get encoded for "free" (no bits required for those). My encode program will suppress any output when the remaining cards are all the same rank. This is cuz the decoder will be counting cards for each deck and know if after seeing card 48, if some rank (like 7) has not yet been seen, all 4 remaining cards MUST be 7s. If the deck ends on a pair (such as 77), triple/set (such as 777) or a quad ( such as 7777), we get additional savings for that deck using my algorithm.

Another "pretty" thing about this algorithm is that it never needs to use any numbers larger than 32 bit so it wont cause problems in some languages that "don't like" large numbers. Actually the largest numbers need to be on the order of 226 which are used in the 13 rank encoding mode. From there they just get smaller. In fact, if I really wanted to, I could make the program so that it doesn't use anything larger than 16 bit numbers but this is not necessary as most computer languages can easily handle 32 bits well. Also this is beneficial to me since one of the bit functions I am using maxes out at 32 bit. It is a function to test if a bit is set or not.

In the first deck in the datafile, the encoding of cards is as follows (diagram to come later). Format is (groupsize, bits, rank encode mode):

(7,26,13) First 7 cards take 26 bits to encode in 13 rank mode.
(7,26,13)
(7,26,13)
(5,18,12)
(5,18,12)
(3,10,10)
(3,  9,  8)
(6,17,  7)
(5,13,  6)
(3,  5,  3)
(1,  0,  1)

This is a total of 52 cards and 168 bits for an average of about 3.23 bits per card. There is no ambiguity in either the encoder or the decoder. Both count cards and know which encode mode to use/expect.

Also notice that 18 cards, (more than 1/3rd of the deck), are encoded BELOW the 3.2 bits per card "limit". Unfortunately those are not enough cards to bring the overall average below about 3.2 bits per card. I imagine in the best case or near best case (where many ranks fill up early such as 54545454722772277...), the encoding for that particular deck might be under 3 bits per card, but of course it is the average case that counts. I think best case might be if all the quads are dealt in order which might never happen if given all the time in the universe and the fastest supercomputer. Something like 22223333444455556666777788889999TTTTJJJJQQQQKKKKAAAA. Here the rank encode mode would drop fast and the last 4 cards would have 0 bits of overhead. This special case takes only 135 bits to encode.

Also one possible optimization I am considering is to take all the ranks that have only 1 card remaining and treating those all as a special "rank" by placing them in a single "bucket". The reason here is if we do that, the encoder can drop into a more efficient packing mode quicker. For example, if we are in 10 rank encoding mode but we only have one more each of ranks 3,7, and K, those cards have much less chance of appearing than the other cards so it doesn't make much sense to treat them the same. If instead I dropped to 8 rank encoding mode which is more efficient that 10 rank mode, perhaps I could use fewer bits for that deck. When I see one of the cards in that special "grouped" bucket of several cards, I would just output that special "rank" (not a real rank but just an indicator we just saw something in that special bucket) and then a few more bits to tell the decoder which card in the bucket I saw, then I would remove that card from the group (since it just filled up). I will trace this by hand to see if any bit savings is possible using it. Note there should be no ambiguity using this special bucket because both the encoder and decoder will be counting cards and will know which ranks have only 1 card remaining. This is important because it makes the encoding process more efficient when the decoder can make correct assumptions without the encoder having to pass extra messages to it.

Here is the first full deck in the 3 million deck data file and a trace of my algorithm on it showing both the block groupings and the transitions to a lower rank encoding mode (like when transitioning from 13 to 12 unfilled ranks) as well as how many bits needed to encode each block. x and y are used for 11 and 10 respectively because unfortunately they happened on neighboring cards and don't display well juxtaposed.

         26             26             26            18         18       10      9          17           13        5     0
    54A236J  87726Q3  3969AAA  QJK7T  9292Q  36K  J57   T8TKJ4  48Q8T  55K  4
13                                            12                    xy     98         7              6        543     2 1  0

Note that there is some inefficiency when the encode mode wants to transition early in a block (when the block is not yet completed). We are "stuck" encoding that block at a slightly higher bit level. This is a tradeoff. Because of this and because I am not using every possible combination of the bit patterns for each block (except when it is an integer power of 2), this algorithm cannot be optimal but can approach 166 bits per deck. The average on my datafile is around 175. The particular deck was "well behaved" and only required 168 bits. Note that we only got a single 4 at the end of the deck but if instead we got all four 4s there, that is a better case and we would have needed only 161 bits to encode that deck, a case where the packing actually beats the entropy of a straight binary encode of the ordinal position of it.

I now have the code implemented to calculate the bit requirements and it is showing me on average, about 175 bits per deck with a low of 155 and a high of 183 for the 3 million deck test file. So my algorithm seems to use 9 extra bits per deck vs. the straight binary encode of the ordinal position method. Not too bad at only 5.5% additional storage space required. 176 bits is exactly 22 bytes so that is quite a bit better than 52 bytes per deck. Best case deck (didn't show up in 3 million deck test file) packs to 136 bits and worst case deck (did show up in testfile 8206 times), is 183 bits. Analysis shows worst case is when we don't get the first quad until close to (or at) card 40. Then as the encode mode wants to drop quickly, we are "stuck" filling blocks (as large as 7 cards) in a higher bit encoding mode. One might think that not getting any quads until card 40 would be quite rare using a well shuffled deck, but my program is telling me it happened 321 times in the testfile of 3 million decks so that it about 1 out of every 9346 decks. That is more often that I would have expected. I could check for this case and handle it with less bits but it is so rare that it wouldn't affect the average bits enough.

Also here is something else very interesting. If I sort the deck on the raw deck data, the length of prefixes that repeat a significant # of times is only about length 6 (such as 222244). However with the packed data, that length increases to about 16. That means if I sort the packed data, I should be able to get a significant savings by just indicating to the decoder a 16 bit prefix and then just output the remainder of the decks (minus the repeating prefix) that have that same prefix, then go onto the next prefix and repeat. Assuming I save even just 10 bits per deck this way, I should beat the 166 bits per deck. With the enumeration technique stated by others, I am not sure if the prefix would be as long as with my algorithm. Also the packing and unpacking speed using my algorithm is surprisingly good. I could make it even faster too by storing powers of 13,12,11... in an array and using those instead of expression like 13^5.

Regarding the 2nd level of compression where I sort the output bitstrings of my algorithm then use "difference" encoding: A very simple method would be to encode the 61,278 unique 16 bit prefixes that show up at least twice in the output data (and a maximum of 89 times reported) simply as a leading bit of 0 in the output to indicate to the 2nd level decompressor that we are encoding a prefix (such as 0000111100001111) and then any packed decks with that same prefix will follow with a 1 leading bit to indicate the non prefix part of the packed deck. The average # of packed decks with the same prefix is about 49 for each prefix, not including the few that are unique (only 1 deck has that particular prefix). It appears I can save about 15 bits per deck using this simple strategy (storing the common prefixes once). So assuming I really do get 15 bit saving per deck and I am already at about 175 bits per deck on the first level packing/compression, that should be a net of about 160 bits per deck, thus beating the 166 bits of the enumeration method.

After the 2nd level of compression using difference (prefix) encoding of the sorted bitstring output of the first encoder, I am now getting about 160 bits per deck. I use length 18 prefix and just store it intact. Since almost all (245013 out of 262144 = 93.5%) of those possible 18 bit prefixes show up, it would be even better to encode the prefixes. Perhaps I can use 2 bits to encode what type of data I have. 00 = regular length 18 prefix stored, 01= "1 up prefix" (same as previous prefix except 1 added), 11 = straight encoding from 1st level packing (approx 175 bits on average). 10=future expansion when I think of something else to encode that will save bits.

Did anyone else beat 160 bits per deck yet? I think I can get mine a little lower with some experimenting and using the 2 bit descriptors I mentioned above. Perhaps it will bottom out at 158ish. My goal is to get it to 156 bits (or better) because that would be 3 bits per card or less. Very impressive. Lots of experimenting to get it down to that level because if I change the first level encoding then I have to retest which is the best 2nd level encoding and there are many combinations to try. Some changes I make may be good for other similar random data but some may be biased towards this dataset. Not really sure but if I get the urge I can try another 3 million deck dataset to see what happens like if I get similar results on it.

One interesting thing (of many) about compression is you are never quite sure when you have hit the limit or are even approaching it. The entropy limit tells us how many bits we need if ALL possible occurrences of those bits occur about equally, but as we know, in reality, that rarely happens with a large number of bits and a (relatively) small # of trials (such as 3 million random decks vs almost 1050 bit combinations of 166 bits.

Does anyone have any ideas on how to make my algorithm better like what other cases I should encode that would reduce bits of storage for each deck on average? Anyone?

2 more things: 1) I am somewhat disappointed that more people didn't upvote my solution which although not optimal on space, is still decent and fairly easy to implement (I got mine working fine). 2) I did analysis on my 3 million deck datafile and noticed that the most frequently occurring card where the 1st rank fills (such as 4444) is at card 26. This happens about 6.711% of the time (for 201322 of the 3 million decks). I was hoping to use this info to compress more such as start out in 12 symbol encode mode since we know on average we wont see every rank until about middeck but this method failed to compress any as the overhead of it exceeded the savings. I am looking for some tweaks to my algorithm that can actually save bits.

So does anyone have any ideas what I should try next to save a few bits per deck using my algorithm? I am looking for a pattern that happens frequently enough so that I can reduce the bits per deck even after the extra overhead of telling the decoder what pattern to expect. I was thinking something with the expected probabilities of the remaining unseen cards and lumping all the single card remaining ones into a single bucket. This will allow me to drop into a lower encode mode quicker and maybe save some bits but I doubt it.

Also, F.Y.I., I generated 10 million random shuffles and stored them in a database for easy analysis. Only 488 of them end in a quad (such as 5555). If I pack just those using my algorithm, I get 165.71712 bits on average with a low of 157 bits and a high of 173 bits. Just slightly below the 166 bits using the other encoding method. I am somewhat surprised at how infrequent this case is (about 1 out of every 20,492 shuffles on average).


3
I notice that you've made about 24 edits in the space of 9 hours. I appreciate your desire to improve your answer. However, each time you edit the answer, it bumps this to the top of the front page. For that reason, we discourage excessive editing. If you expect to make many edits, would it be possible to batch up your edits, so you only make one edit every few hours? (Incidentally, note that putting "EDIT:" and "UPDATE:" in your answer is usually poor style. See meta.cs.stackexchange.com/q/657/755.)
D.W.

4
This is not the place to put progress reports, status updates, or blog items. We want fully-formed answers, not "coming soon" or "I have a solution but I'm not going to describe what it is".
D.W.

3
If someone is interested he will find the improved solution. The best way is to wait for full answer and post it then. If you have some updates a blog would do. I do not encourage this, but if you really must (I do not see valid reason why) you can write comment below your post and merge later. I also encourage you to delete all obsolete comments and incorporate them into one seamless question - it gets hard to read all. I try to make my own algorithm, different than any presented, but I am not happy with the results - so I do not post partials to be edited - the answer box is for full ones.
Evil

3
@DavidJames, I do understand. However, that still doesn't change our guidelines: please don't make so many edits. (If you'd like to propose improvements to the website, feel free to make a post on our Computer Science Meta or on meta.stackexchange.com suggesting it. Devs don't read this comment thread.) But in the meantime, we work with the software we have, and making many edits is discouraged because it bumps the question to the top. At this point, limiting yourself to one edit per day might be a good guideline to shoot for. Feel free to use offline editors or StackEdit if that helps!
D.W.

3
I'm not upvoting your answer for several reasons. 1) it is needless long and FAR too verbose. You can drastically reduce it's presentation. 2) there are better answers posted, which you choose to ignore for reasons unbeknownst to me. 3) asking about lack of upvotes is usually a "red flag" to me. 4) This has constantly remained in the front page due to an INSANE amount of edits.
Nicholas Mancuso
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.