查找一对不重叠的位向量


17

我给您列出了宽度为的n个位n向量。您的目标是从列表中返回两个不共有1的位向量,或者报告不存在这样的对。ķk

例如,如果我给您那么唯一的解决方案是。可替代地,输入没有解。任何包含全零位向量和另一个元素都具有平凡的解。[ 00110 01100 11000 ] [00110,01100,11000]{ 00110 11000 } {00110,11000}[ 111 011 110 101 ] [111,011,110,101]000 ... 0 000...0Ë e{ é 000 ... 0 }{e,000...0}

这是一个稍微困难一点的示例,没有解决方案(每行都是位向量,黑色正方形为1s,白色正方形为0s):

■ ■ ■ ■ □ □ □ □ □ □ □ □ □
■ □ □ □ ■ ■ ■ □ □ □ □ □ □ 
■ □ □ □ □ □ □ ■ ■ ■ □ □ □
■ □ □ □ □ □ □ □ □ □ ■ ■ ■
□ ■ □ □ □ ■ □ □ □ ■ ■ □ □
□ ■ □ □ ■ □ □ □ ■ □ □ □ ■
□ ■ □ □ □ □ ■ ■ □ □ □ ■ □ <-- All row pairs share a black square
□ □ ■ □ □ □ ■ □ ■ □ ■ □ □
□ □ ■ □ □ ■ □ ■ □ □ □ □ ■
□ □ ■ □ ■ □ □ □ □ ■ □ ■ □
□ □ □ ■ ■ □ □ ■ □ □ ■ □ □
□ □ □ ■ □ □ ■ □ □ ■ □ □ ■
□ □ □ ■ □ ■ □ □ ■ □ □ ■ □

如何找到两个不重叠的位向量,或者显示它们不存在,效率如何?

您只比较每个可能的对的朴素算法是。有可能做得更好吗?O n 2 k O(n2k)


可能的减少:您有一个图形GG,每个矢量有一个顶点,并且如果两个对应的矢量共有1,则在两个顶点之间有一个边。你想知道如果图直径22。但是似乎很难比O n 2 k O(n2k)
弗朗索瓦·

@FrançoisGodi任何具有三个节点且缺少边的连接图形组件的直径至少为2。使用邻接表表示时,要花费O V O(V)时间进行检查。
Craig Gidney 2015年

@Strilanc当然,如果没有解决方案,则图形是完整的(比直径= 1更清晰,您是对的),但是计算邻接表表示可能会很长。
弗朗索瓦2015年

k是否k小于机器的字宽?
拉斐尔

1
@TomvanderZanden听起来好像违反了数据结构可能依赖的不变式。特别是,这种平等应该是可传递的。我一直在想用特里已经和我没有看到如何避免因子的-2爆破每次查询掩码有0
克雷格Gidney

Answers:


10

预热:随机位向量

作为热身,我们可以从随机地均匀选择每个位向量的情况开始。事实证明,该问题可以在O n 1.6 min k lg n 时间内解决(更确切地说,可以用lg 3代替1.6)。O(n1.6min(k,lgn))1.6lg3

我们将考虑问题的以下两套变体:

给定套小号Ť { 0 1 } ķ bitvectors的,确定其中存在非重叠的一对小号小号ŤS,T{0,1}ksS,tT

解决此问题的基本技术是分而治之。这是使用分治法的O n 1.6 k 时间算法:O(n1.6k)

  1. 根据第一个比特位置拆分ST。换句话说,形式š 0 = { 小号小号小号0 = 0 }š 1 = { 小号小号小号0 = 1 }Ť 0 = { Ť 0 = 0 }Ť 1 = { Ť STS0={sS:s0=0}S1={sS:s0=1}T0={tT:t0=0}0 = 1 }T1={tT:t0=1}

  2. 现在从S 0T 0S 0T 1T 1S 0递归地寻找一个非重叠对。如果有任何递归调用找到非重叠对,则输出它,否则输出“不存在重叠对”。S0,T0S0,T1T1,S0

由于所有位向量都是随机选择的,因此我们可以期望| S b | | S | / 2| Ť b | | T | / 2。因此,我们进行了三个递归调用,并且将问题的大小减少了两倍(两组的大小都减少了两倍)。后LG 分钟|小号|| Ť |拆分,所述两个组中的一个是下降到大小为1,并且该问题可在线性时间内解决。我们得到如下的递归关系:|Sb||S|/2|Tb||T|/2lgmin(|S|,|T|)T(n)=3T(n/2)+O(nk)T(n)=3T(n/2)+O(nk), whose solution is T(n)=O(n1.6k)T(n)=O(n1.6k). Accounting for running time more precisely in the two-set case, we see the running time is O(min(|S|,|T|)0.6max(|S|,|T|)k)O(min(|S|,|T|)0.6max(|S|,|T|)k).

This can be further improved, by noting that if k2.5lgn+100k2.5lgn+100, then the probability that a non-overlapping pair exists is exponentially small. In particular, if x,yx,y are two random vectors, the probability that they're non-overlapping is (3/4)k(3/4)k. If |S|=|T|=n|S|=|T|=n, there are n2n2 such pairs, so by a union bound, the probability a non-overlapping pair exists is at most n2(3/4)kn2(3/4)k. When k2.5lgn+100k2.5lgn+100, this is 1/21001/2100. So, as a pre-processing step, if k2.5lgn+100k2.5lgn+100, then we can immediately return "No non-overlapping pair exists" (the probability this is incorrect is negligibly small), otherwise we run the above algorithm.

Thus we achieve a running time of O(n1.6min(k,lgn))O(n1.6min(k,lgn)) (or O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn))O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn)) for the two-set variant proposed above), for the special case where the bitvectors are chosen uniformly at random.

Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.

Lessons from the warmup

We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many 11's in that position as possible; the more 00's there are, the less reduction in subproblem size you get.

Third, this suggests that the problem gets harder as the density of 11's gets smaller -- if there are very few 11's among the bitvectors (they are mostly 00's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density ΔΔ to be the fraction of bits that are 11 (i.e., out of all nknk bits), and the density of bit position ii to be the fraction of bitvectors that are 11 at position ii.

Handling very low density

As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than 1/k1/k, we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.

Why is this the case? Let's say that a pair of bitvectors x,yx,y is covered by bit position ii if xi=yi=1xi=yi=1. Note that every pair of overlapping bitvectors must be covered by some bit position. Now, if we fix a particular bit position ii, the number of pairs that can be covered by that bit position is at most (nΔ(i))2<n2/k(nΔ(i))2<n2/k. Summing across all kk of the bit positions, we find that the total number of pairs that are covered by some bit position is <n2<n2. This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.

However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on nn. So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.

Towards a general-case algorithm

In the general case, a natural heuristic seems to be: pick the bit position ii with the most number of 11's (i.e., with the highest density), and split on it. In other words:

  1. Find a bit position ii that maximizes Δ(i)Δ(i).

  2. Split SS and TT based upon bit position ii. In other words, form S0={sS:si=0}S0={sS:si=0}, S1={sS:si=1}S1={sS:si=1}, T0={tT:ti=0}T0={tT:ti=0}, T1={tT:ti=1}T1={tT:ti=1}.

  3. Now recursively look for a non-overlapping pair from S0,T0S0,T0, from S0,T1S0,T1, and from T1,S0T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

The challenge is to analyze its performance in the worst case.

Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if Δ(i)<1/kΔ(i)<1/k for every ii, assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in O(nk)O(nk) time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.

What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about n/kn/k bitvectors (e.g., from nn bitvectors to nn/knn/k bitvectors). Therefore, the recursion can only go about kk levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than 3k3k leaves), so I'm not sure what running time this should lead to.


ad low density: this seems to be some kind of pigeon-hole argument. Maybe if we use your general idea (split w.r.t. the column with the most ones), we get better bounds because the (S1,T1)(S1,T1)-case (we don't recurse to) already gets rid of "most" ones?
Raphael

The total number of ones may be a useful parameter. You have already shown a lower bound we can use for cutting off the tree; can we show upper bounds, too? For example, if there are more than ckck ones, we have at least cc overlaps.
Raphael

By the way, how do you propose we do the first split; arbitrarily? Why not just split the whole input set w.r.t. some column ii? We only need to recurse in the 00-case (there is no solution among those that share a one at ii). In expectation, that gives via T(n)=T(n/2)+O(nk)T(n)=T(n/2)+O(nk) a bound of O(nk)O(nk) (if kk fixed). For a general bound, you have shown that we can (assuming the lower-bound-cutoff you propose) that we get rid of at least n/kn/k elements with every split, which seems to imply an O(nk)O(nk) worst-case bound. Or am I missing something?
Raphael

Ah, that's wrong, of course, since it does not consider 0-1-mismatches. That's what I get for trying to think before breakfast, I guess.
Raphael

@Raphael, there are two issues: (a) the vectors might be mostly zeros, so you can't count on getting a 50-50 split; the recurrence would be something more like T(n)=T((nn/k)k)+O(nk)T(n)=T((nn/k)k)+O(nk), (b) more importantly, it's not enough to just recurse on the 0-subset; you also need to examine pairings between a vector from the 0-subset and a vector from the 1-subset, so there's an additional recursion or two to do. (I think? I hope I got that right.)
D.W.

8

Faster solution when nknk, using matrix multiplication

Suppose that n=kn=k. Our goal is to do better than an O(n2k)=O(n3)O(n2k)=O(n3) running time.

We can think of the bitvectors and bit positions as nodes in a graph. There is an edge between a bitvector node and a bit position node when the bitvector has a 1 in that position. The resulting graph is bipartite (with the bitvector-representing nodes on one side and the bitposition-representing nodes on the other), and has n+k=2nn+k=2n nodes.

Given the adjacency matrix MM of a graph, we can tell if there is a two-hop path between two vertices by squaring MM and checking if the resulting matrix has an "edge" between those two vertices (i.e. the edge's entry in the squared matrix is non-zero). For our purposes, a zero entry in the squared adjacency matrix corresponds to a non-overlapping pair of bitvectors (i.e. a solution). A lack of any zeroes means there's no solution.

Squaring an n x n matrix can be done in O(nω)O(nω) time, where ωω is known to be under 2.3732.373 and conjectured to be 22.

So the algorithm is:

  • Convert the bitvectors and bit positions into a bipartite graph with n+kn+k nodes and at most nknk edges. This takes O(nk)O(nk) time.
  • Compute the adjacency matrix of the graph. This takes O((n+k)2)O((n+k)2) time and space.
  • Square the adjacency matrix. This takes O((n+k)ω)O((n+k)ω) time.
  • Search the bitvector section of the adjacency matrix for zero entries. This takes O(n2)O(n2) time.

The most expensive step is squaring the adjacency matrix. If n=kn=k then the overall algorithm takes O((n+k)ω)=O(nω)O((n+k)ω)=O(nω) time, which is better than the naive O(n3)O(n3) time.

This solution is also faster when kk grows not-too-much-slower and not-too-much-faster than nn. As long as kΩ(nω2)kΩ(nω2) and kO(n2ω1)kO(n2ω1), then (n+k)ω(n+k)ω is better than n2kn2k. For w2.373w2.373 that translates to n0.731kn1.373n0.731kn1.373 (asymptotically). If ww limits to 2, then the bounds widen towards nϵkn2ϵnϵkn2ϵ.


1. This is also better than the naive solution if k=Ω(n)k=Ω(n) but k=o(n1.457)k=o(n1.457). 2. If knkn, a heuristic could be: pick a random subset of nn bit positions, restrict to those bit positions and use matrix multiplication to enumerate all pairs that don't overlap in those nn bit positions; for each such pair, check if it solves the original problem. If there aren't many pairs that don't overlap in those nn bit positions, this provides a speedup over the naive algorithm. However I don't know a good upper bound on the number of such pairs.
D.W.

4

This is equivalent to finding a bit vector which is a subset of the complement of another vector; ie its 1's occur only where 0's occur in the other.

If k (or the number of 1's) is small, you can get O(n2k)O(n2k) time by simply generating all the subsets of the complement of each bitvector and putting them in a trie (using backtracking). If a bitvector is found in the trie (we can check each before complement-subset insertion) then we have a non-overlapping pair.

If the number of 1's or 0's is bounded to an even lower number than k, then the exponent can be replaced by that. The subset-indexing can be on either each vector or its complement, so long as probing uses the opposite.

There's also a scheme for superset-finding in a trie that only stores each vector only once, but does bit-skipping during probes for what I believe is similar aggregate complexity; ie it has o(k)o(k) insertion but o(2k)o(2k) searches.


thanks. The complexity of your solution is n2(1p)kn2(1p)k, where pp is the probability of 1's in the bitvector. A couple of implementation details: though this is a slight improvement, there's no need to compute and store the complements in the trie. Just following the complementary branches when checking for a non-overlapping match is enough. And, taking the 0's directly as wildcards, no special wildcard is needed, either.
Mauro Lacy

2

Represent the bit vectors as an n×kn×k matrix MM. Take ii and jj between 1 and nn.

(MMT)ij=lMilMjl.

(MMT)ij=lMilMjl.

(MMT)ij(MMT)ij, the dot product of the iith and jjth vector, is non-zero if, and only if, vectors ii and jj share a common 1. So, to find a solution, compute MMTMMT and return the position of a zero entry, if such an entry exists.

Complexity

Using naive multiplication, this requires O(n2k)O(n2k) arithmetic operations. If n=kn=k, it takes O(n2.37)O(n2.37) operations using the utterly impractical Coppersmith-Winograd algorithm, or O(n2.8)O(n2.8) using the Strassen algorithm. If k=O(n0.302)k=O(n0.302), then the problem may be solved using n2+o(1)n2+o(1) operations.


How is this different from Strilanc's answer?
D.W.

1
@D.W. Using an n-by-k matrix instead of an (n+k)-by-(n+k) matrix is an improvement. Also it mentions a way to cut off the factor of k when k << n, so that might be useful.
Craig Gidney
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.