预热:随机位向量
作为热身,我们可以从随机地均匀选择每个位向量的情况开始。事实证明,该问题可以在O (n 1.6 min (k ,lg n ))时间内解决(更确切地说,可以用lg 3代替1.6)。O(n1.6min(k,lgn))1.6lg3
我们将考虑问题的以下两套变体:
给定套小号,Ť ⊆ { 0 ,1 } ķ bitvectors的,确定其中存在非重叠的一对小号∈ 小号,吨∈ Ť。S,T⊆{0,1}ks∈S,t∈T
解决此问题的基本技术是分而治之。这是使用分治法的O (n 1.6 k )时间算法:O(n1.6k)
根据第一个比特位置拆分S和T。换句话说,形式š 0 = { 小号∈ 小号:小号0 = 0 },š 1 = { 小号∈ 小号:小号0 = 1 },Ť 0 = { 吨∈ Ť :吨0 = 0 },Ť 1 = { 吨∈ Ť :吨STS0={s∈S:s0=0}S1={s∈S:s0=1}T0={t∈T:t0=0}0 = 1 }。T1={t∈T:t0=1}
现在从S 0,T 0,S 0,T 1和T 1,S 0递归地寻找一个非重叠对。如果有任何递归调用找到非重叠对,则输出它,否则输出“不存在重叠对”。S0,T0S0,T1T1,S0
由于所有位向量都是随机选择的,因此我们可以期望| S b | ≈ | S | / 2和| Ť b | ≈ | T | / 2。因此,我们进行了三个递归调用,并且将问题的大小减少了两倍(两组的大小都减少了两倍)。后LG 分钟(|小号|,| Ť |)拆分,所述两个组中的一个是下降到大小为1,并且该问题可在线性时间内解决。我们得到如下的递归关系:|Sb|≈|S|/2|Tb|≈|T|/2lgmin(|S|,|T|)T(n)=3T(n/2)+O(nk)T(n)=3T(n/2)+O(nk), whose solution is T(n)=O(n1.6k)T(n)=O(n1.6k). Accounting for running time more precisely in the two-set case, we see the running time is O(min(|S|,|T|)0.6max(|S|,|T|)k)O(min(|S|,|T|)0.6max(|S|,|T|)k).
This can be further improved, by noting that if k≥2.5lgn+100k≥2.5lgn+100, then the probability that a non-overlapping pair exists is exponentially small. In particular, if x,yx,y are two random vectors, the probability that they're non-overlapping is (3/4)k(3/4)k. If |S|=|T|=n|S|=|T|=n, there are n2n2 such pairs, so by a union bound, the probability a non-overlapping pair exists is at most n2(3/4)kn2(3/4)k. When k≥2.5lgn+100k≥2.5lgn+100, this is ≤1/2100≤1/2100. So, as a pre-processing step, if k≥2.5lgn+100k≥2.5lgn+100, then we can immediately return "No non-overlapping pair exists" (the probability this is incorrect is negligibly small), otherwise we run the above algorithm.
Thus we achieve a running time of O(n1.6min(k,lgn))O(n1.6min(k,lgn)) (or O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn))O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn)) for the two-set variant proposed above), for the special case where the bitvectors are chosen uniformly at random.
Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.
Lessons from the warmup
We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many 11's in that position as possible; the more 00's there are, the less reduction in subproblem size you get.
Third, this suggests that the problem gets harder as the density of 11's gets smaller -- if there are very few 11's among the bitvectors (they are mostly 00's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density ΔΔ to be the fraction of bits that are 11 (i.e., out of all nknk bits), and the density of bit position ii to be the fraction of bitvectors that are 11 at position ii.
Handling very low density
As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than 1/√k1/k−−√, we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.
Why is this the case? Let's say that a pair of bitvectors x,yx,y is covered by bit position ii if xi=yi=1xi=yi=1. Note that every pair of overlapping bitvectors must be covered by some bit position. Now, if we fix a particular bit position ii, the number of pairs that can be covered by that bit position is at most (nΔ(i))2<n2/k(nΔ(i))2<n2/k. Summing across all kk of the bit positions, we find that the total number of pairs that are covered by some bit position is <n2<n2. This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.
However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on nn. So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.
Towards a general-case algorithm
In the general case, a natural heuristic seems to be: pick the bit position ii with the most number of 11's (i.e., with the highest density), and split on it. In other words:
Find a bit position ii that maximizes Δ(i)Δ(i).
Split SS and TT based upon bit position ii. In other words, form S0={s∈S:si=0}S0={s∈S:si=0}, S1={s∈S:si=1}S1={s∈S:si=1}, T0={t∈T:ti=0}T0={t∈T:ti=0}, T1={t∈T:ti=1}T1={t∈T:ti=1}.
Now recursively look for a non-overlapping pair from S0,T0S0,T0, from S0,T1S0,T1, and from T1,S0T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".
The challenge is to analyze its performance in the worst case.
Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if Δ(i)<1/√kΔ(i)<1/k−−√ for every ii, assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in O(nk)O(nk) time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.
What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about n/√kn/k−−√ bitvectors (e.g., from nn bitvectors to n−n/√kn−n/k−−√ bitvectors). Therefore, the recursion can only go about √kk−−√ levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than 3√k3k√ leaves), so I'm not sure what running time this should lead to.