As an alternate solution to this problem, my algorithm uses compound fractional (non integer) bits per card for groups of cards in the deck based on how many unfilled ranks there are remaining. It is a rather elegant algorithm. I checked my encode algorithm by hand and it is looking good. The encoder is outputting what appear to be correct bitstrings (in byte form for simplicity).
The overview of my algorithm is that it uses a combination of groups of cards and compound fractional bit encoding. For example, in my shared test file of 3 million shuffled decks, the first one has the first 7 cards of 54A236J. The reason I chose a 7 card block size when 13 ranks of cards are possible is because 137 "shoehorns" (fits snugly) into 26 bits (since 137 = 62,748,517 and 226 = 67,108,864241313428,56121532,76815/4=3.75 but 26/7=3.714. So the number of bits per card is slightly lower per card if we use the 26/7 packing method.
So looking at 54A236J, we simply look up the ordinal position of those ranks in our master "23456789TJQKA" list of sorted ranks. For example, the first actual card rank of 5 has a lookup position in the rank lookup string of 4. We just treat these 7 rank positions as a base 13 number starting with 0 (so the position 4 we previously got will actually be a 3). Converted back to base 10 (for checking purposes), we get 15,565,975. In 26 bits of binary we get 00111011011000010010010111.
The decoder works in a very similar way. It takes (for example) that string of 26 bits and converts it back to decimal (base 10) to get 15,565,975, then converts it to base 13 to get the offsets into the rank lookup string, then it reconstructs the ranks one at a time and gets the original 54A236J first 7 cards. Note that the blocksize of bits wont always be 26 but will always start out at 26 in each deck. The encoder and decoder both have some important information about the deck data even before they operate. That is one exceptionally nice thing about this algorithm.
Each # of ranks remaining (such as 13,12,11...,2,1) has its own groupsize and cost (# of bits per card). These were found experimentally just playing around with powers of 13,12,11... and powers of 2. I already explained how I got the groupsize for when we can see 13 ranks, so how about when we drop to 12 unfilled ranks? Same method. Look at the powers of 12 and stop when one of them comes very close to a power of 2 but just slightly under it. 125 = 248,832 and 218 = 262,144. That is a pretty tight fit. The number of bits encoding this group is 18/5 = 3.6. In the 13 rank group it was 26/7 = 3.714 so as you can see, as the number of unfilled ranks decreases (ranks are filling up such as 5555, 3333), the number of bits to encode the cards decreases.
Here is my complete list of costs (# of bits per card) for all possible # of ranks to be seen:
13 26/7=3.714=3 5/7
12 18/5=3.600=3 3/5
11 7/2=3.500=3 1/2
10 10/3=3.333=3 1/3
9 16/5=3.200=3 1/5
8 3/1=3.000=3
7 17/6=2.833=2 5/6
6 13/5=2.600=2 3/5
5 7/3=2.333=2 1/3
4 2/1=2.000=2
3 5/3=1.667=1 2/3
2 1/1=1.000=1
1 0/1..4=0.0=0
So as you can clearly see, as the number of unfilled ranks decreases (which it will do every deck), the number of bits needed to encode each card also decreases. You might be wondering what happens if we fill a rank but we are not yet done a group. For example, if the first 7 cards in the deck were 5,6,7,7,7,7,K, what should we do? Easy, The K would normally drop the encoder from 13 rank encoding mode to 12 rank encoding mode. However, since we haven't yet filled the first block of 7 cards in 13 rank encoding mode, we include the K in that block to complete it. There is very little waste this way. There are also cases while we are trying to fill a block, the # of filled ranks bumps up by 2 or even more. That is also no problem as we just fill the block in the current encoding mode, then we pick up in the new encoding mode which may be 1,2,3... less or even stay in the same mode (as was the case in the first deck in the datafile as there are 3 full blocks in the 13 rank encoding mode). This is why it is important to make the blocksizes reasonable such as between size 1 and 7. If we made it size 20 for example, we would have to fill that block at a higher bitrate than if we let the encoder transition into a more efficient encoding mode (encoding less ranks).
When I ran this algorithm (by hand) on the first deck of cards in the data file (which was created using Fisher-Yates unbiased shuffle), I got an impressive 168 bits to encode which is almost identical to optimal binary encoding but requires no knowledge of ordinal positions of all possible decks, no very large numbers, and no binary searches. It does however require binary manipulations and also radix manipulations (powers of 13,12,11...).
Notice also that when the number of unfilled ranks = 1, the overhead is 0 bits per card. Best case (for encoding) is we want the deck to end on a run of the same cards (such as 7777) cuz those get encoded for "free" (no bits required for those). My encode program will suppress any output when the remaining cards are all the same rank. This is cuz the decoder will be counting cards for each deck and know if after seeing card 48, if some rank (like 7) has not yet been seen, all 4 remaining cards MUST be 7s. If the deck ends on a pair (such as 77), triple/set (such as 777) or a quad ( such as 7777), we get additional savings for that deck using my algorithm.
Another "pretty" thing about this algorithm is that it never needs to use any numbers larger than 32 bit so it wont cause problems in some languages that "don't like" large numbers. Actually the largest numbers need to be on the order of 226 which are used in the 13 rank encoding mode. From there they just get smaller. In fact, if I really wanted to, I could make the program so that it doesn't use anything larger than 16 bit numbers but this is not necessary as most computer languages can easily handle 32 bits well. Also this is beneficial to me since one of the bit functions I am using maxes out at 32 bit. It is a function to test if a bit is set or not.
In the first deck in the datafile, the encoding of cards is as follows (diagram to come later). Format is (groupsize, bits, rank encode mode):
(7,26,13) First 7 cards take 26 bits to encode in 13 rank mode.
(7,26,13)
(7,26,13)
(5,18,12)
(5,18,12)
(3,10,10)
(3, 9, 8)
(6,17, 7)
(5,13, 6)
(3, 5, 3)
(1, 0, 1)
This is a total of 52 cards and 168 bits for an average of about 3.23 bits per card. There is no ambiguity in either the encoder or the decoder. Both count cards and know which encode mode to use/expect.
Also notice that 18 cards, (more than 1/3rd of the deck), are encoded BELOW the 3.2 bits per card "limit". Unfortunately those are not enough cards to bring the overall average below about 3.2 bits per card. I imagine in the best case or near best case (where many ranks fill up early such as 54545454722772277...), the encoding for that particular deck might be under 3 bits per card, but of course it is the average case that counts. I think best case might be if all the quads are dealt in order which might never happen if given all the time in the universe and the fastest supercomputer. Something like 22223333444455556666777788889999TTTTJJJJQQQQKKKKAAAA. Here the rank encode mode would drop fast and the last 4 cards would have 0 bits of overhead. This special case takes only 135 bits to encode.
Also one possible optimization I am considering is to take all the ranks that have only 1 card remaining and treating those all as a special "rank" by placing them in a single "bucket". The reason here is if we do that, the encoder can drop into a more efficient packing mode quicker. For example, if we are in 10 rank encoding mode but we only have one more each of ranks 3,7, and K, those cards have much less chance of appearing than the other cards so it doesn't make much sense to treat them the same. If instead I dropped to 8 rank encoding mode which is more efficient that 10 rank mode, perhaps I could use fewer bits for that deck. When I see one of the cards in that special "grouped" bucket of several cards, I would just output that special "rank" (not a real rank but just an indicator we just saw something in that special bucket) and then a few more bits to tell the decoder which card in the bucket I saw, then I would remove that card from the group (since it just filled up). I will trace this by hand to see if any bit savings is possible using it. Note there should be no ambiguity using this special bucket because both the encoder and decoder will be counting cards and will know which ranks have only 1 card remaining. This is important because it makes the encoding process more efficient when the decoder can make correct assumptions without the encoder having to pass extra messages to it.
Here is the first full deck in the 3 million deck data file and a trace of my algorithm on it showing both the block groupings and the transitions to a lower rank encoding mode (like when transitioning from 13 to 12 unfilled ranks) as well as how many bits needed to encode each block. x and y are used for 11 and 10 respectively because unfortunately they happened on neighboring cards and don't display well juxtaposed.
26 26 26 18 18 10 9 17 13 5 0
54A236J 87726Q3 3969AAA QJK7T 9292Q 36K J57 T8TKJ4 48Q8T 55K 4
13 12 xy 98 7 6 543 2 1 0
Note that there is some inefficiency when the encode mode wants to transition early in a block (when the block is not yet completed). We are "stuck" encoding that block at a slightly higher bit level. This is a tradeoff. Because of this and because I am not using every possible combination of the bit patterns for each block (except when it is an integer power of 2), this algorithm cannot be optimal but can approach 166 bits per deck. The average on my datafile is around 175. The particular deck was "well behaved" and only required 168 bits. Note that we only got a single 4 at the end of the deck but if instead we got all four 4s there, that is a better case and we would have needed only 161 bits to encode that deck, a case where the packing actually beats the entropy of a straight binary encode of the ordinal position of it.
I now have the code implemented to calculate the bit requirements and it is showing me on average, about 175 bits per deck with a low of 155 and a high of 183 for the 3 million deck test file. So my algorithm seems to use 9 extra bits per deck vs. the straight binary encode of the ordinal position method. Not too bad at only 5.5% additional storage space required. 176 bits is exactly 22 bytes so that is quite a bit better than 52 bytes per deck. Best case deck (didn't show up in 3 million deck test file) packs to 136 bits and worst case deck (did show up in testfile 8206 times), is 183 bits. Analysis shows worst case is when we don't get the first quad until close to (or at) card 40. Then as the encode mode wants to drop quickly, we are "stuck" filling blocks (as large as 7 cards) in a higher bit encoding mode. One might think that not getting any quads until card 40 would be quite rare using a well shuffled deck, but my program is telling me it happened 321 times in the testfile of 3 million decks so that it about 1 out of every 9346 decks. That is more often that I would have expected. I could check for this case and handle it with less bits but it is so rare that it wouldn't affect the average bits enough.
Also here is something else very interesting. If I sort the deck on the raw deck data, the length of prefixes that repeat a significant # of times is only about length 6 (such as 222244). However with the packed data, that length increases to about 16. That means if I sort the packed data, I should be able to get a significant savings by just indicating to the decoder a 16 bit prefix and then just output the remainder of the decks (minus the repeating prefix) that have that same prefix, then go onto the next prefix and repeat. Assuming I save even just 10 bits per deck this way, I should beat the 166 bits per deck. With the enumeration technique stated by others, I am not sure if the prefix would be as long as with my algorithm. Also the packing and unpacking speed using my algorithm is surprisingly good. I could make it even faster too by storing powers of 13,12,11... in an array and using those instead of expression like 13^5.
Regarding the 2nd level of compression where I sort the output bitstrings of my algorithm then use "difference" encoding: A very simple method would be to encode the 61,278 unique 16 bit prefixes that show up at least twice in the output data (and a maximum of 89 times reported) simply as a leading bit of 0 in the output to indicate to the 2nd level decompressor that we are encoding a prefix (such as 0000111100001111) and then any packed decks with that same prefix will follow with a 1 leading bit to indicate the non prefix part of the packed deck. The average # of packed decks with the same prefix is about 49 for each prefix, not including the few that are unique (only 1 deck has that particular prefix). It appears I can save about 15 bits per deck using this simple strategy (storing the common prefixes once). So assuming I really do get 15 bit saving per deck and I am already at about 175 bits per deck on the first level packing/compression, that should be a net of about 160 bits per deck, thus beating the 166 bits of the enumeration method.
After the 2nd level of compression using difference (prefix) encoding of the sorted bitstring output of the first encoder, I am now getting about 160 bits per deck. I use length 18 prefix and just store it intact. Since almost all (245013 out of 262144 = 93.5%) of those possible 18 bit prefixes show up, it would be even better to encode the prefixes. Perhaps I can use 2 bits to encode what type of data I have. 00 = regular length 18 prefix stored, 01= "1 up prefix" (same as previous prefix except 1 added), 11 = straight encoding from 1st level packing (approx 175 bits on average). 10=future expansion when I think of something else to encode that will save bits.
Did anyone else beat 160 bits per deck yet? I think I can get mine a little lower with some experimenting and using the 2 bit descriptors I mentioned above. Perhaps it will bottom out at 158ish. My goal is to get it to 156 bits (or better) because that would be 3 bits per card or less. Very impressive. Lots of experimenting to get it down to that level because if I change the first level encoding then I have to retest which is the best 2nd level encoding and there are many combinations to try. Some changes I make may be good for other similar random data but some may be biased towards this dataset. Not really sure but if I get the urge I can try another 3 million deck dataset to see what happens like if I get similar results on it.
One interesting thing (of many) about compression is you are never quite sure when you have hit the limit or are even approaching it. The entropy limit tells us how many bits we need if ALL possible occurrences of those bits occur about equally, but as we know, in reality, that rarely happens with a large number of bits and a (relatively) small # of trials (such as 3 million random decks vs almost 1050 bit combinations of 166 bits.
Does anyone have any ideas on how to make my algorithm better like what other cases I should encode that would reduce bits of storage for each deck on average? Anyone?
2 more things: 1) I am somewhat disappointed that more people didn't upvote my solution which although not optimal on space, is still decent and fairly easy to implement (I got mine working fine). 2) I did analysis on my 3 million deck datafile and noticed that the most frequently occurring card where the 1st rank fills (such as 4444) is at card 26. This happens about 6.711% of the time (for 201322 of the 3 million decks). I was hoping to use this info to compress more such as start out in 12 symbol encode mode since we know on average we wont see every rank until about middeck but this method failed to compress any as the overhead of it exceeded the savings. I am looking for some tweaks to my algorithm that can actually save bits.
So does anyone have any ideas what I should try next to save a few bits per deck using my algorithm? I am looking for a pattern that happens frequently enough so that I can reduce the bits per deck even after the extra overhead of telling the decoder what pattern to expect. I was thinking something with the expected probabilities of the remaining unseen cards and lumping all the single card remaining ones into a single bucket. This will allow me to drop into a lower encode mode quicker and maybe save some bits but I doubt it.
Also, F.Y.I., I generated 10 million random shuffles and stored them in a database for easy analysis. Only 488 of them end in a quad (such as 5555). If I pack just those using my algorithm, I get 165.71712 bits on average with a low of 157 bits and a high of 173 bits. Just slightly below the 166 bits using the other encoding method. I am somewhat surprised at how infrequent this case is (about 1 out of every 20,492 shuffles on average).