用于快速查找字符串之间差异的数据结构或算法


19

我有一个100,000个字符串数组,长度均为。我想将每个字符串与其他每个字符串进行比较,看看是否有两个字符串相差1个字符。现在,当我将每个字符串添加到数组时,我正在将其与数组中已有的每个字符串进行比较,该字符串的时间复杂度为。kn(n1)2k

是否有一种数据结构或算法可以比我已经做的更快地将字符串相互比较?

一些其他信息:

  • 顺序很重要:abcdexbcde相差1个字符,而abcdeedcba相差4个字符。

  • 对于每一个相差一个字符的字符串对,我将从数组中删除其中一个字符串。

  • 现在,我正在寻找仅相差1个字符的字符串,但是如果可以将1个字符的差异增加到例如2个,3个或4个字符,那就太好了。但是,在这种情况下,我认为效率比提高字符差异限制的能力更为重要。

  • k通常在20-40的范围内。


4
搜索带有1个错误的字符串字典是一个众所周知的问题,例如cs.nyu.edu/~adi/CGL04.pdf
KWillets

1
20至40人可以使用相当多的空间。您可能会看一下Bloom过滤器(en.wikipedia.org/wiki/Bloom_filter),以测试简并字符串(测试mer中来自一个,两个或多个替换的所有mer的集合)是否“可能存在”或“确定” -not-in”一组kmers。如果得到“可能输入”,则进一步比较两个字符串以确定是否为假阳性。“绝对禁止”情况是真实的否定词,通过将比较限制在仅可能的“可能进入”匹配中,可以减少您必须进行的逐字母比较的总数。
亚历克斯·雷诺兹

如果您使用较小的k范围,则可以使用位集来存储所有退化字符串的布尔值哈希表(例如,github.com/alexpreynolds/kmer-boolean作为玩具示例)。但是,对于k = 20-40,位集的空间需求实在太大了。
亚历克斯·雷诺兹

Answers:


12

有可能达到最坏情况下的运行时间。O(nklogk)

让我们开始简单。如果您关心一个易于实现的解决方案,该解决方案将在许多输入(但不是全部)上有效,那么这是一个简单,实用,易于实现的解决方案,在许多情况下在实践中就足够了。但是,在最坏的情况下,它确实会回到二次运行时间。

取每个字符串并将其存储在哈希表中,该哈希表键入该字符串的前半部分。然后,遍历哈希表存储桶。对于同一存储桶中的每对字符串,请检查它们是否相差1个字符(即,检查其后半部分是否相差1个字符)。

那么,这一次键入的每个字符串,并将其存储在一个哈希表,第二个字符串的一半。再次检查同一存储桶中的每对字符串。

假设字符串分布均匀,则运行时间可能约为。此外,如果存在一对相差1个字符的字符串,则会在两次遍历中的一个过程中找到(因为它们仅相差1个字符,因此相异的字符必须位于字符串的前半部分或后半部分,因此字符串的后半部分或前半部分必须相同)。但是,在最坏的情况下(例如,如果所有字符串都以相同的字符开头或结尾),这会降低为运行时间,因此其最坏情况下的运行时间并不能改善暴力破解力。O(nk)k/2O(n2k)

作为性能优化,如果任何存储桶中的字符串太多,则可以递归重复相同的过程,以寻找相差一个字符的一对。递归调用将在长度为字符串上进行。k/2

如果您担心最坏的运行时间:

通过以上性能优化,我认为最坏情况下的运行时间为。O(nklogk)


3
如果字符串共享相同的前半部分,而这在现实生活中可能会发生,那么您并没有提高复杂性。Ω(n)
einpoklum

@einpoklum,当然!这就是为什么我在第二句话中写出在最坏的情况下可以返回到二次运行时间的语句,以及在最后一句话中写的描述如果需要的话如何实现最坏情况下的复杂度的原因关于最坏的情况。但是我想也许我并没有很清楚地表达这一点-所以我相应地编辑了我的答案。现在好点了吗?O(nklogk)
DW

15

我的解决方案类似于j_random_hacker的解决方案,但仅使用单个哈希集。

我将创建一个哈希散列字符串。对于输入中的每个字符串,将其添加到设置的字符串中。在每个这些字符串中,用一个特殊字符替换一个字母,这些字符在任何一个字符串中都找不到。添加它们时,请检查它们是否尚未在集合中。如果是,则您有两个仅相差最多(最多)一个字符的字符串。k

字符串为“ abc”,“ adc”的示例

对于abc,我们添加“ * bc”,“ a * c”和“ ab *”

对于adc,我们添加“ * dc”,“ a * c”和“ ad *”

当我们第二次添加'a * c'时,我们注意到它已经在集合中,因此我们知道有两个仅相差一个字母的字符串。

该算法的总运行时间为。这是因为我们为输入中的所有n个字符串创建了k个新字符串。对于这些字符串中的每一个,我们需要计算哈希,通常需要O k 时间。O(nk2)knO(k)

存储所有字符串需要空间。O(nk2)

进一步的改进

我们可以通过不直接存储修改后的字符串,而是存储一个引用原始字符串和被屏蔽字符的索引的对象来进一步改进算法。这样,我们不需要创建所有字符串,而只需要空间来存储所有对象。O(nk)

您将需要为对象实现自定义哈希函数。我们可以以Java实现为例,请参阅Java文档。java hashCode将每个字符的unicode值乘以k为字符串长度,而i为该字符的从一开始的索引。请注意,每个更改后的字符串与原始字符串仅相差一个字符。我们可以轻松地计算出该字符对哈希码的贡献。我们可以减去该字符,然后添加掩码字符,这需要O 1 进行计算。这使我们可以将总运行时间降低到O n31kikiO(1)O(nk)


4
@JollyJoker是的,这种方法需要关注空间。您可以通过不存储修改后的字符串来减少空间,而可以存储引用了字符串和掩码索引的对象。那应该给您留下O(nk)空间。
西蒙·普林斯

要计算O k 时间中每个字符串的哈希,我认为您需要一个特殊的自制哈希函数(例如,计算O k 时间中原始字符串的哈希,然后将其与每个删除的字符串进行XOR运算。每次使用O 1 次字符(尽管在其他方面这可能是一个非常糟糕的哈希函数))。顺便说一句,这与我的解决方案非常相似,但是有一个哈希表而不是k个单独的哈希表,并用“ *”代替了一个字符而不是将其删除。kO(k)O(k)O(1)k
j_random_hacker

@SimonPrins具有可能起作用的自定义equalshashCode方法。仅在这些方法中创建a * b样式的字符串就可以使其防弹。我怀疑这里的其他一些答案还会有哈希冲突问题。
JollyJoker

1
@DW我修改了帖子以反映以下事实:计算散列需要时间,并添加了一种解决方案以将总运行时间降低到O n * k O(k)O(nk)
西蒙·普林斯

1
@SimonPrins最糟糕的情况可能是nk ^ 2,这是因为hashset中包含字符串相等检查。当然,最糟糕的情况是,当每串具有完全相同的哈希值,这将需要相当多的手工组字符串,尤其是得到相同的哈希值*bca*cab*。我想知道是否可以证明它不可能?
JollyJoker

7

我将制作哈希表H 1H k,每个哈希表都有一个长度为k 1 )的字符串作为键,并包含一个数字列表(字符串ID)作为值。哈希表H i将包含到目前为止已处理的所有字符串,但位置i处的字符已删除。例如,如果k = 6,则H 3 [ A B D E F ]将包含到目前为止看到的所有具有模式A的字符串的列表。kH1,,Hk(k1)Hiik=6H3[ABDEF],其中装置“的任何字符”。然后处理第 j个输入字符串 s jABDEFjsj

  1. 对于1到k范围内的每个ik
    • 格式字符串通过删除的性格第小号Ĵsjisj
    • 查找。此处的每个字符串ID均标识等于s或仅在位置i不同的原始字符串。将它们输出为字符串s j的匹配项。(如果要排除完全相同的重复项,请将哈希表的值类型设置为(字符串ID,已删除字符)对,以便可以测试已删除了与我们刚刚从s j中删除的字符相同的字符。)Hi[sj]sisjsj
    • 插入到H i中,以供将来查询使用。jHi

如果我们显式地存储每个哈希键,则必须使用空间,因此至少具有时间复杂度。但是,正如Simon Prins所述,可以隐式表示对字符串的一系列修改(在他的案例中,将单个字符更改为,在我的情况下,将其表示为删除),以使特定字符串的所有k个哈希键仅需要O k 空间,整体导致O n k 空间,并为O n k 打开可能性O(nk2)*kO(k)O(nk)O(nk)时间也一样。为了实现这种时间复杂性,我们需要一种方法来计算O k 时间中长度k字符串的所有变化的哈希值:例如,这可以使用多项式哈希值完成,如DW所建议的(这是可能比单纯将已删除的字符与原始字符串的哈希值进行XOR更好。kkO(k)

西蒙·普林斯(Simon Prins)的隐式表示技巧也意味着实际上不会真正执行每个字符的“删除”,因此我们可以使用字符串的常规基于数组的表示,而不会降低性能(而不是我最初建议的链表)。


2
不错的解决方案。合适的定制哈希函数的一个示例是多项式哈希。
DW

感谢@DW您能否澄清一下“多项式哈希”的含义?谷歌搜索这个词并没有给我什么看似确定的东西。(如果需要,请随时直接编辑我的帖子。)
j_random_hacker

1
只需将字符串作为基数p读取,其中p比您的哈希图大小小一些质数,并且qp的原始根,并且q大于字母大小。之所以称为“多项式哈希”,是因为它类似于评估系数由q处的字符串给出的多项式。我将把它留作练习,以弄清楚如何计算O k 时间中所有期望的哈希值。请注意,除非您随机选择满足期望条件的p q,否则这种方法不能避免对手。qppqpqqO(k)p,q
user21820

1
我认为可以通过观察任意时刻仅存在k个哈希表之一来进一步完善此解决方案,从而减少内存需求。
Michael Kay

1
@MichaelKay:如果您想在O k 时间中计算字符串的可能变化的哈希,那将行不通。您仍然需要将它们存储在某个地方。所以,如果你一次只检查一个位置,你将采取ķ倍,只要你检查所有位置一起使用ķ倍的散列表条目。kO(k)kk
user21820

2

这是比多项式哈希方法更健壮的哈希表方法。首先生成与哈希表大小M互质的随机正整数r 1 .. k。即,0 [R < 中号。然后散列每个串X 1 .. ķΣ ķ = 1 X ř MOD 中号。对手几乎无法采取任何措施来导致非常不均匀的碰撞,因为您在运行时生成r 1 .. k,因此kkr1..kM0ri<Mx1..k(i=1kxiri)modMr1..kk增加任何给定的一对不同的串的碰撞的最大概率变为迅速。同样显而易见的是,如何在O k 时间内计算每个字符串更改一个字符后所有可能的哈希值。1/MO(k)

如果您确实想保证统一哈希,则可以为i的每个对i c 1k以及每个字符c生成一个小于M的随机自然数,然后对每个字符串进行哈希x 1 .. kk i = 1 r i x imod Mr(i,c)M(i,c)i1kcx1..k(i=1kr(i,xi))modM。然后任何给定的一对不同的串中的冲突的概率是精确地。如果您的字符集与n相比较小,则此方法更好。1/Mn


2

此处发布的许多算法在哈希表上占用了大量空间。下面是一个辅助存储ø Ñ LG Ñ ķ 2运行时简单的算法。O(1)O((nlgn)k2)

诀窍是使用,它是两个值ab之间的比较器,如果a < b(按字典顺序),则返回true,而忽略第k个字符。那么算法如下。Ck(a,b)aba<bk

首先,只需定期对字符串进行排序并进行线性扫描即可删除所有重复项。

然后,对于每个k

  1. 使用作为比较器对字符串进行排序。Ck

  2. 现在仅相差字符串相邻,并且可以在线性扫描中检测到。k


1

长度为k的两个字符串(一个字符不同)共享长度为l的前缀和长度为m的后缀,使得k = l + m + 1

由西蒙·普林斯答案通过存储所有前缀/后缀组合明确,即编码此abc*bca*cab*。那是k = 3,l = 0,1,2和m = 2,1,0。

正如valarMorghulis指出的那样,您可以在前缀树中组织单词。还有非常相似的后缀树。在每个前缀或后缀下增加叶子节点的数量来扩展树是相当容易的。插入新单词时,可以在O(k)中进行更新。

想要这些同级计数的原因是,您知道给定一个新词是要枚举具有相同前缀的所有字符串还是要枚举具有相同后缀的所有字符串。例如,以“ abc”作为输入,可能的前缀为“”,“ a”和“ ab”,而相应的后缀为“ bc”,“ c”和“”。显而易见,对于短后缀,最好在前缀树中枚举同级,反之亦然。

正如@einpoklum指出的那样,所有字符串都可能共享相同的k / 2前缀。对于这种方法来说这不是问题。前缀树将是线性的,直到深度k / 2,每个节点直到k / 2深度都是100.000个叶节点的祖先。结果,后缀树将使用到(k / 2-1)深度,这很好,因为字符串必须共享前缀,因此后缀必须有所不同。

[编辑]作为一种优化,一旦确定了字符串的最短唯一前缀,便知道如果存在一个不同的字符,则该字符必须是该前缀的最后一个字符,并且您会发现当检查一个短一点的前缀。因此,如果“ abcde”具有最短的唯一前缀“ abc”,则意味着还有其他以“ ab?”开头的字符串。但不能使用“ abc”。即,如果它们仅在一个字符上有所不同,那将是第三个字符。您不再需要检查“ abc?e”。

按照相同的逻辑,如果您发现“ cde”是唯一的最短后缀,那么您知道只需要检查长度为2的“ ab”前缀,而不是长度为1或3的前缀。

请注意,此方法仅适用于仅一个字符差异,而不能泛化为2个字符差异,它依赖于一个字符是相同前缀和相同后缀之间的分隔。


Are you suggesting that for each string s and each 1ik, we find the node P[s1,,si1] corresponding to the length-(i1) prefix in the prefix trie, and the node S[si+1,,sk] corresponding to the length-(ki1) suffix in the suffix trie (each takes amortised O(1) time), and compare the number of descendants of each, choosing whichever has fewer descendants, and then "probing" for the rest of the string in that trie?
j_random_hacker

1
What is the running time of your approach? It looks to me like in the worst case it might be quadratic: consider what happens if every string starts and ends with the same k/4 characters.
D.W.

The optimization idea is clever and interesting. Did you have in mind a particular way to do the check for mtaches? If the "abcde" has the shortest unique prefix "abc", that means we should check for some other string of the form "ab?de". Did you have in mind a particular way to do that, that will be efficient? What's the resulting running time?
D.W.

@D.W.: The idea is that to find strings in the form "ab?de", you check the prefix tree how many leaf nodes exist below "ab" and in the suffix tree how many nodes exist under "de", then choose the smallest of the two to enumerate. When all strings begin and end with the same k/4 characters; that means the first k/4 nodes in both trees have one child each. And yes, every time you need those trees, those have to be traversed which is an O(n*k) step.
MSalters

To check for a string of the form "ab?de" in the prefix trie, it suffices to get to the node for "ab", then for each of its children v, check whether the path "de" exists below v. That is, don't bother enumerating any other nodes in these subtries. This takes O(ah) time, where a is the alphabet size and h is the height of the initial node in the trie. h is O(k), so if the alphabet size is O(n) then it is indeed O(nk) time overall, but smaller alphabets are common. The number of children (not descendants) is important, as well as the height.
j_random_hacker

1

Storing strings in buckets is a good way (there are already different answers outlining this).

An alternative solution could be to store strings in a sorted list. The trick is to sort by a locality-sensitive hashing algorithm. This is a hash algorithm which yields similar results when the input is similar[1].

Each time you want to investigate a string, you could calculate its hash and lookup the position of that hash in your sorted list (taking O(log(n)) for arrays or O(n) for linked lists). If you find that the neighbours (considering all close neighbours, not only those with an index of +/- 1) of that position are similar (off by one character) you found your match. If there are no similar strings, you can insert the new string at the position you found (which takes O(1) for linked lists and O(n) for arrays).

One possible locality-sensitive hashing algorithm could be Nilsimsa (with open source implementation available for example in python).

[1]: Note that often hash algorithms, like SHA1, are designed for the opposite: producing greatly differing hashes for similar, but not equal inputs.

Disclaimer: To be honest, I would personally implement one of the nested/tree-organized bucket-solutions for a production application. However, the sorted list idea struck me as an interesting alternative. Note that this algorithm highly depends on the choosen hash algorithm. Nilsimsa is one algorithm I found - there are many more though (for example TLSH, Ssdeep and Sdhash). I haven't verified that Nilsimsa works with my outlined algorithm.


1
Interesting idea, but I think we would need to have some bounds on how far apart two hash values can be when their inputs differ by just 1 character -- then scan everything within that range of hash values, instead of just neighbours. (It's impossible to have a hash function that produces adjacent hash values for all possible pairs of strings that differ by 1 character. Consider the length-2 strings in a binary alphabet: 00, 01, 10 and 11. If h(00) is adjacent to both h(10) and h(01) then it must be between them, in which case h(11) can't be adjacent to them both, and vice versa.)
j_random_hacker

Looking at neighbors isn't sufficient. Consider the list abcd, acef, agcd. There exists a matching pair, but your procedure will not find it, as abcd is not a neighbor of agcd.
D.W.

You both are right! With neighbours I didn't mean only "direct neighbours" but thought of "a neighbourhood" of close positions. I didn't specify how many neighbours need to be looked at since that depends on the hash algorithm. But you're right, I should probably note this down in my answer. thanks :)
tessi

1
"LSH... similar items map to the same “buckets” with high probability" - since it's probability algorithm, result isn't guaranteed. So it depends on TS whether he needs 100% solution or 99.9% is enough.
Bulat

1

One could achieve the solution in O(nk+n2) time and O(nk) space using enhanced suffix arrays (Suffix array along with the LCP array) that allows constant time LCP (Longest Common Prefix) query (i.e. Given two indices of a string, what is the length of the longest prefix of the suffixes starting at those indices). Here, we could take advantage of the fact that all strings are of equal length. Specifically,

  1. Build the enhanced suffix array of all the n strings concatenated together. Let X=x1.x2.x3....xn where xi,1in is a string in the collection. Build the suffix array and LCP array for X.

  2. Now each xi starts at position (i1)k in the zero-based indexing. For each string xi, take LCP with each of the string xj such that j<i. If LCP goes beyond the end of xj then xi=xj. Otherwise, there is a mismatch (say xi[p]xj[p]); in this case take another LCP starting at the corresponding positions following the mismatch. If the second LCP goes beyond the end of xj then xi and xj differ by only one character; otherwise there are more than one mismatches.

    for (i=2; i<= n; ++i){
        i_pos = (i-1)k;
        for (j=1; j < i; ++j){
            j_pos = (j-1)k;
            lcp_len = LCP (i_pos, j_pos);
            if (lcp_len < k) { // mismatch
                if (lcp_len == k-1) { // mismatch at the last position
                // Output the pair (i, j)
                }
                else {
                  second_lcp_len = LCP (i_pos+lcp_len+1, j_pos+lcp_len+1);
                  if (lcp_len+second_lcp_len>=k-1) { // second lcp goes beyond
                    // Output the pair(i, j)
                  }
                }
            }
        }
    }
    

You could use SDSL library to build the suffix array in compressed form and answer the LCP queries.

Analysis: Building the enhanced suffix array is linear in the length of X i.e. O(nk). Each LCP query takes constant time. Thus, querying time is O(n2).

Generalisation: This approach can also be generalised to more than one mismatches. In general, running time is O(nk+qn2) where q is the number of allowed mismatches.

If you wish to remove a string from the collection, instead of checking every j<i, you could keep a list of only 'valid' j.


Can i say that O(kn2) algo is trivial - just compare each string pair and count number of matches? And k in this formula practically can be omitted, since with SSE you can count matching bytes in 2 CPU cycles per 16 symbols (i.e. 6 cycles for k=40).
Bulat

Apologies but I could not understand your query. The above approach is O(nk+n2) and not O(kn2). Also, it is virtually alphabet-size independent. It could be used in conjunction with the hash-table approach -- Once two strings are found to have the same hashes, they could be tested if they contain a single mismatch in O(1) time.
Ritu Kundu

My point is that k=20..40 for the question author and comparing such small strings require only a few CPU cycles, so practical difference between brute force and your approach probably doesn't exist.
Bulat

1

One improvement to all the solutions proposed. They all require O(nk) memory in the worst case. You can reduce it by computing hashes of strings with * instead each character, i.e. *bcde, a*cde... and processing at each pass only variants with hash value in certain integer range. F.e. with even hash values in the first pass, and odd hash values in the second one.

You can also use this approach to split the work among multiple CPU/GPU cores.


Clever suggestion! In this case, the original question says n=100,000 and k40, so O(nk) memory doesn't seem likely to be an issue (that might be something like 4MB). Still a good idea worth knowing if one needs to scale this up, though!
D.W.

0

This is a short version of @SimonPrins' answer not involving hashes.

Assuming none of your strings contain an asterisk:

  1. Create a list of size nk where each of your strings occurs in k variations, each having one letter replaced by an asterisk (runtime O(nk2))
  2. Sort that list (runtime O(nk2lognk))
  3. Check for duplicates by comparing subsequent entries of the sorted list (runtime O(nk2))

An alternative solution with implicit usage of hashes in Python (can't resist the beauty):

def has_almost_repeats(strings,k):
    variations = [s[:i-1]+'*'+s[i+1:] for s in strings for i in range(k)]
    return len(set(variations))==k*len(strings)

Thanks. Please also mention the k copies of exact duplicates, and I'll +1. (Hmm, just noticed I made the same claim about O(nk) time in my own answer... Better fix that...)
j_random_hacker

@j_random_hacker I don't know what exactly the OP wants reported, so I left step 3 vague but I think it is trivial with some extra work to report either (a) a binary any duplicate/no duplicates result or (b) a list of pairs of strings that differ in at most one position, without duplicates. If we take the OP literally ("...to see if any two strings..."), then (a) seems to be desired. Also, if (b) were desired then of course simply creating a list of pairs may take O(n2) if all strings are equal
Bananach

0

Here is my take on 2+ mismatches finder. Note that in this post I consider each string as circular, f.e. substring of length 2 at index k-1 consists of symbol str[k-1] followed by str[0]. And substring of length 2 at index -1 is the same!

If we have M mismatches between two strings of length k, they have matching substring with length at least mlen(k,M)=k/M1 since, in the worst case, mismatched symbols split (circular) string into M equal-sized segments. F.e. with k=20 and M=4 the "worst" match may have the pattern abcd*efgh*ijkl*mnop*.

Now, the algorithm for searching all mismatches up to M symbols among strings of k symbols:

  • for each i from 0 to k-1
    • split all strings into groups by str[i..i+L-1], where L = mlen(k,M). F.e. if L=4 and you have alphabet of only 4 symbols (from DNA), this will make 256 groups.
    • Groups smaller than ~100 strings can be checked with brute-force algorithm
    • For larger groups, we should perform secondary division:
      • Remove from every string in the group L symbols we already matched
      • for each j from i-L+1 to k-L-1
        • split all strings into groups by str[i..i+L1-1], where L1 = mlen(k-L,M). F.e. if k=20, M=4, alphabet of 4 symbols, so L=4 and L1=3, this will make 64 groups.
        • the rest is left as exercise for the reader :D

Why we don't start j from 0? Because we already made these groups with the same value of i, so job with j<=i-L will be exactly equivalent to job with i and j values swapped.

Further optimizations:

  • At every position, also consider strings str[i..i+L-2] & str[i+L]. This only doubles amount of jobs created, but allows to increase L by 1 (if my math is correct). So, f.e. instead of 256 groups, you will split data into 1024 groups.
  • If some L[i] becomes too small, we can always use the * trick: for each i in in 0..k-1, remove i'th symbol from each string and create job searching for M-1 mismatches in those strings of length k-1.

0

I work everyday on inventing and optimizing algos, so if you need every last bit of performance, that is the plan:

  • Check with * in each position independently, i.e. instead of single job processing n*k string variants - start k independent jobs each checking n strings. You can spread these k jobs among multiple CPU/GPU cores. This is especially important if you are going to check 2+ char diffs. Smaller job size will also improve cache locality, which by itself can make program 10x faster.
  • If you are going to use hash tables, use your own implementation employing linear probing and ~50% load factor. It's fast and pretty easy to implement. Or use an existing implementation with open addressing. STL hash tables are slow due to use of separate chaining.
  • You may try to prefilter data using 3-state Bloom filter (distinguishing 0/1/1+ occurrences) as proposed by @AlexReynolds.
  • For each i from 0 to k-1 run the following job:
    • Generate 8-byte structs containing 4-5 byte hash of each string (with * at i-th position) and string index, and then either sort them or build hash table from these records.

For sorting, you may try the following combo:

  • first pass is MSD radix sort in 64-256 ways employing TLB trick
  • second pass is MSD radix sort in 256-1024 ways w/o TLB trick (64K ways total)
  • third pass is insertion sort to fix remaining inconsistencies
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.