0-1矩阵向量乘法的自动优化

22

题：

当矩阵密集且仅填充零和一时，是否存在用于生成有效地应用矩阵矢量乘法的代码的确定过程或理论？理想情况下，优化的代码将系统地利用先前计算的信息来减少重复的工作。

换句话说，我有一个矩阵 $M$ ，我想基于进行一些预计算 $M$ ，这将在以后接收到向量时使计算 $Mv$ 效率尽可能高。 $v$

$M$ 是在“编译时”已知的矩形密集二进制矩阵，而 $v$ 是仅在“运行时”已知的未知实向量。

示例1 ：（滑动窗口）

让我用一个简单的小例子来说明我的观点。考虑矩阵

M = [\begin{matrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 \end{matrix}] .

$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1\\ & 1 & 1 & 1 & 1 & 1 \\ & & 1 & 1 & 1 & 1 & 1\\ & & & 1 & 1 & 1 & 1 & 1\end{bmatrix}.$ 假设我们将此矩阵应用于向量

v

$v$ 以获得

w = M v

$w = Mv$ 。那么结果的输入为

\begin{aligned} w_{1} & = v_{1} + v_{2} + v_{3} + v_{4} + v_{5} \\ w_{2} & = v_{2} + v_{3} + v_{4} + v_{5} + v_{6} \\ w_{3} & = v_{3} + v_{4} + v_{5} + v_{6} + v_{7} \\ w_{4} & = v_{4} + v_{5} + v_{6} + v_{7} + v_{8} \end{aligned}

$\begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5\\ w_2 &= v_2 + v_3 + v_4 + v_5 + v_6\\ w_3 &= v_3 + v_4 + v_5 + v_6 + v_7\\ w_4 &= v_4 + v_5 + v_6 + v_7 + v_8 \end{align}$

执行标准矩阵向量乘法将完全按照这种方式进行计算。但是，很多这项工作是多余的。我们可以通过跟踪“运行总计”并加/减以获得下一个数字，来以较低的成本进行相同的矩阵计算：

\begin{aligned} w_{1} & = v_{1} + v_{2} + v_{3} + v_{4} + v_{5} \\ w_{2} & = w_{1} + v_{6} - v_{1} \\ w_{3} & = w_{2} + v_{7} - v_{2} \\ w_{4} & = w_{3} + v_{8} - v_{3} \end{aligned}

$\begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5\\ w_2 &= w_1 + v_6 - v_1\\ w_3 &= w_2 + v_7 - v_2\\ w_4 &= w_3 + v_8 - v_3 \end{align}$

示例2 ：（层次结构）

在前面的示例中，我们可以跟踪运行总计。但是，通常需要创建和存储中间结果树。例如，考虑可以使用一棵中间结果树来有效地计算：

M = [\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{matrix}]

$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 & 1 & 1\\ 1 & 1 & 1 & 1 \\ & & & & 1 & 1 & 1 & 1\\ 1 & 1 \\ & & & & 1 & 1 \\ & & 1 & 1 \\ & & & & & & 1 & 1\end{bmatrix}$

w = M v

$w = Mv$

计算和，并将它们相加得到 $w_5$ $w_7$ 。 $w_3$
计算和，并将它们相加得到 $w_4$ $w_6$ 。 $w_2$
将和相加得到 $w_2$ $w_3$ $w_1$

上面示例中的结构很容易看到，但是对于我感兴趣的实际矩阵，结构并不是那么简单。

示例3 ：（低等级）

为了消除一些混乱，矩阵通常不稀疏。具体地，解决该问题的方法需要能够找到有效的方法以在大块被块填充的情况下应用矩阵。例如，考虑

M = [\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \end{matrix}] .

$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & & \end{bmatrix}.$

该矩阵可以分解为两个秩1矩阵的差，

M = [\begin{matrix} 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \end{matrix}] - [\begin{matrix} 1 & 1 \\ 1 & 1 \\ 1 & 1 \end{matrix}]

$M = \begin{bmatrix}1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1\end{bmatrix} - \begin{bmatrix}& & & & & \\ & & & & & \\ & & & & 1 & 1 \\ & & & & 1 & 1 \\ & & & & 1 & 1\end{bmatrix}$

因此它对向量可以由有效地计算 $w := Mv$

\begin{aligned} w_{1} & = v_{1} + v_{2} + v_{3} + v_{4} + v_{5} + v_{6} \\ w_{2} & = w_{1} \\ w_{3} & = w_{2} - v_{5} - v_{6} \\ w_{4} & = w_{3} \\ w_{5} & = w_{4} . \end{aligned}

$\begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 + v_6 \\ w_2 &= w_1 \\ w_3 &= w_2 - v_5 - v_6 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}$

动机：

我正在研究用于某些图像处理的数值方法，并且有几种大型的稠密的矩阵具有不同的结构，这些矩阵一直固定。后来这些矩阵将需要被应用到许多未知向量将取决于用户的输入。现在，我正在使用铅笔和纸来为每个矩阵提供有效的代码，但是我想知道该过程是否可以自动化。 $0-1$ $v_i$

编辑：（附言）

到目前为止（截至15年9月5日），这里的所有答案都很有趣，但是没有一个答案能像我希望的那样令人满意。可能事实证明，这是一个艰苦的研究问题，没有人知道一个好的答案。

既然时间不多了，我会悬赏EvilJS的答案，因为它解决了正确的问题。但是，我希望答案包含更清晰和详细的解释。

tranisstor的答案将这个问题与在线布尔矩阵向量乘积（OMv）问题联系起来，但是联系并不完全是这个问题的要求。特别是，以下假设并不合适（我的粗体强调），

现在假设，对于所有和所有矩阵 $n \leq n_0$ $n \times n$ $M$ 我们知道一个算法，对于所有的矢量，计算在真正次二次时间，即在时间为一些。 $A_{n,M}$ $v$ $Mv$ $O(n^{2 - \varepsilon})$ $\varepsilon > 0$

是否存在针对所有矩阵的次二次算法与为特定矩阵找到算法的问题尽快正交。大多数0-1矩阵看起来像随机噪声，并且（如果我猜是）可能没有次二次算法。但是，实际上存在非常差的矩阵这一事实并不能阻止我在良好的矩阵（例如“滑动窗口”矩阵）上找到快速算法。

vzn的答案，第一答案，第二答案很有趣（在我看来，不应该有那么多否决票），但是由于那里的评论中讨论的原因，它们不适用于该问题。

linear-algebra matrices program-optimization

— 尼克·阿尔格
source

1

如果您的矩阵是这种形式，则TDMA是频带矩阵，即Thomas算法。尚未0-1，但应利用此功能。

— Evil

@EvilJS矩阵恰好适合于特定示例。通常，它不会被捆绑。我添加了另一个未绑定的示例。

— Nick Alger 2015年

您有很多常数矩阵N x M，它们是二进制的实向量，并且想在每个实例的预处理阶段预先计算最佳执行路径吗？这样的操作的输出是每个矩阵具有硬编码操作的代码，您想要方法吗？我指的是每个矩阵。只是检查。

— Evil

@EvilJS这个问题是关于存在一个已知的二进制矩阵

，该矩阵稍后将应用于许多未知的实向量

。基于

而已，我们要预先计算将应用代码

尽可能有效，这样以后当我们收到

，我们可以计算出

尽可能地快。在这种促使这个问题的特定应用我有二值矩阵这样的（12实际上），其是固定的所有时间，而矢量一小撮

是不可预测的，并取决于输入从节目的用户。

M

$M$

v_{i}

$v_i$

M

$M$

M

$M$

v_{i}

$v_i$

M v_{i}

$M v_i$

v_{i}

$v_i$

— Nick Alger

1

在两个元素的领域中，计算模拟给定线性变换的最小XOR门电路的问题是NP-hard。参见cstheory.stackexchange.com/a/32272/225

— Ryan Williams

5

如果可能的话，尝试利用矩阵的带状三对角性质。
否则，如果矩阵仅包含恒定数量的不同值（肯定是二进制值），则应尝试Mailman算法（由Edo Liberty，Steven W. Zucker撰写，在耶鲁大学技术报告＃1402中）：通过有限字典进行了优化
常见的子表达式消除与多重常数乘法一样在一段时间内是众所周知的，但是可以选择门级-这里使用的模式可以单独用作解决方案，也可以与其他方法合并，有关本文的“改进公共子表达式消除”吴宁，张晓强，叶云飞和蓝立东在《世界工程与计算机科学大会论文集2013 Vol II WCECS 2013》上发表于2013年10月23日至25日，美国旧金山” 门级CSE

还有一种粗略但可行的方法，可以生成带有常量的符号矩阵，带有变量的矢量，并将其插入编译器的静态单一声明（SSA），从而使手工编写矩阵的过程自动化。

新算法原型
您对总和执行的操作：

\begin{aligned} w_{1} & = v_{1} + v_{2} + v_{3} + v_{4} + v_{5} \\ w_{2} & = w_{1} + v_{6} - v_{1} \\ w_{3} & = w_{2} + v_{7} - v_{2} \\ w_{4} & = w_{3} + v_{8} - v_{3} \end{aligned}

$\begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 \\ w_2 &= w_1 + v_6 - v_1 \\ w_3 &= w_2 + v_7 - v_2 \\ w_4 &= w_3 + v_8 - v_3 \end{align}$
Gives 10 operations, and with my initial idea to use Thomas it is equivalent.
For now I am still writing and testing new algorithm, also runtimes are nasty, but first test result gave me surprising answer:

\begin{aligned} t m p_{1} & = v_{2} + v_{3} + v_{4} + v_{5} \\ w_{1} & = v_{1} + t m p_{1} \\ w_{2} & = t m p_{1} + v_{6} \\ w_{3} & = w_{2} + v_{7} - v_{2} \\ w_{4} & = w_{3} + v_{8} - v_{3} \end{aligned}

$\begin{align} tmp_1 &= v_2 + v_3 + v_4 + v_5 \\ w_1 &= v_1 + tmp_1 \\ w_2 &= tmp_1 + v_6 \\ w_3 &= w_2 + v_7 - v_2 \\ w_4 &= w_3 + v_8 - v_3 \end{align}$

\begin{aligned} w_{1} & = v_{1} + v_{2} + v_{3} + v_{4} + v_{5} + v_{6} \\ w_{2} & = w_{1} \\ w_{3} & = w_{2} - v_{5} - v_{6} \\ w_{4} & = w_{3} \\ w_{5} & = w_{4} . \end{aligned}

$\begin{align} w_1 &= v_1 + v_2 + v_3 + v_4 + v_5 + v_6 \\ w_2 &= w_1 \\ w_3 &= w_2 - v_5 - v_6 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}$

This gives 7 operations, my algorithm result gave:

\begin{aligned} t m p_{1} & = v_{1} + v_{2} + v_{3} + v_{4} \\ t m p_{2} & = v_{5} + v_{6} \\ w_{1} & = t m p_{1} + t m p_{2} \\ w_{2} & = w_{1} \\ w_{3} & = w_{2} - t m p_{2} \\ w_{4} & = w_{3} \\ w_{5} & = w_{4} . \end{aligned}

$\begin{align} tmp_1 &= v_1 + v_2 + v_3 + v_4 \\ tmp_2 &= v_5 + v_6 \\ w_1 &= tmp_1 + tmp_2 \\ w_2 &= w_1 \\ w_3 &= w_2 - tmp_2 \\ w_4 &= w_3 \\ w_5 &= w_4. \end{align}$
Which gives 6 operations For now I can tell that I am using Hamming distance, & and | bitwise operations, counting usages and making something like Cocke–Younger–Kasami (CYK) - "a parsing algorithm for context-free grammars, named after its inventors, John Cocke, Daniel Younger and Tadao Kasami. It employs bottom-up parsing and dynamic programming." - from Wikipedia This is the same technique I use to build blocks of variables.

— Evil
source

(re rev5) plz give ref on "evergreen method". also, what is SSA? CYK dynamic algorithm?

— vzn

I have awarded the bounty to this answer, and explained why in an edit to my original question.

— Nick Alger

8

This is related to an open research question, which is known as the "Online Boolean Matrix-Vector Multiplication (OMv) problem". This problem reads as follows (see [1]): Given a binary $n \times n$ matrix $M$ and $n$ binary column vectors $v_1, \dots, v_n$ , we need to compute $M v_i$ before $v_{i+1}$ arrives.

Notice that the problem from the question is somewhat more general: It allows for $m \times n$ matrices and real-valued vectors. Observe that the problem with $n \times n$ matrices and Boolean vectors is "easier", as it presents a special case.

Clearly, the naïve algorithm for the Online Boolean Matrix-Vector Multiplication problem (which just uses standard matrix-vector-multipliction) takes time $O(n^3)$ . There is a conjecture (see e.g. [1]) that this cannot be done truly faster than $O(n^3)$ . (In more detail, this conjecture goes as follows: There exists no truly subcubic algorithm, which solves the Online Boolean Matrix-Vector Multiplication Problem, i.e. there is no algorithm with running time $O(n^{3 - \varepsilon})$ for $\varepsilon > 0$ ).

It is known that Williams's algorithm solves this problem in time $O(n^3 / \log^2 n)$ . See [2] for more details.

It would be a breakthrough in the area of conditional lower bounds, if one could prove or disprove the above conjecture.

[1] Unifying and Strengthening Hardness for Dynamic Problems via an Online Matrix-Vector Multiplication Conjecture. by Henzinger, Krinninger, Nanongkai and Saranurak
[ http://eprints.cs.univie.ac.at/4351/1/OMv_conjecture.pdf ]

[2] Matrix-vector multiplication in sub-quadratic time: (some preprocessing required). by Williams
[ http://dl.acm.org/citation.cfm?id=1283383.1283490 ]

Update

One of the questions in the comments was as follows: We know $M$ at compile time. Can't we adjust our algorithm to suit $M$ , so the OMv problem (conjecture) does not apply? We will see that this is not the case, unless the OMv conjecture fails.

The proof idea is simple: Assume we could give fast algorithms for all matrices up to some certain size (e.g. distinguishing all possible cases). After this certain size we use divide and conquer.

Here are the details:
Fix some $n_0 \in \mathbb{N}$ , which (without loss of generality) is a power of 2 and bigger than 2. Now assume that for all $n \leq n_0$ and all $n \times n$ matrices $M$ we know an algorithm $A_{n,M}$ , that for all vectors $v$ computes $Mv$ in truly subquadratic time, i.e. in time $O(n^{2 - \varepsilon})$ for some $\varepsilon > 0$ . (Notice that this allows an individual algorithm for each matrix up to size $n_0 \times n_0$ .)

Now we will solve OMv in truly subcubic time:
Given a binary matrix $M$ of size $n \times n$ , where $n = 2^k$ for some $k$ and $n > n_0$ , we use a divide and conquer strategy. We divide $M$ into four submatrices $M_1, M_2, M_3, M_4$ of sizes $2^{k-1} \times 2^{k-1}$ . If $2^{k-1} \leq n_0$ , then we use algorithm $A_{2^{k-1},M_i}$ , otherwise, we recurse. (As $n_0$ is some fixed number, we can pick the correct algorithm in constant time.)

Notice that we will need at most $O(\log n)$ recursion steps. Also, for $n$ vectors $v_1, \dots, v_n$ , we will $n$ computations. Thus, to process all matrix-vector multiplications we will need a total computation time of $O(n^{3 - \varepsilon} \log n)$ .

It is well known that the logarithm grows slower than any polynomial (in particular slower than any root). Fixing some $\tilde \varepsilon > 0$ with $\tilde \varepsilon < \varepsilon$ , we see that our total computation is running in truly subcubic time (in particular, in time $O(n^{3 - \tilde \varepsilon})$ ). Thus, the OMv conjecture would be wrong.

(If $M$ has size $m \times n$ and $m$ and $n$ are not powers of 2, then the bounds on the running times still apply, as we could just increase $n$ and $m$ to the next powers of 2.)

Conclusion: If you could make use of case distinctions on the input matrices to derive fast algorithms, then you could improve the OMv conjecture.

— tranisstor
source

As pointed out by author and vzn, this is not the case, vector is not binary, Matrix is not necessary N x N, and author wants to precalculate operations, and there is no need for online processing. Based on conjecture is not enough. Both papers are irrelevant to question. The case here is to precompute constant matrix to provide minimal number of operations. There will be possible different approaches for full, banded, symmetric cases.

— Evil

@EvilJS: If you allow any M x N matrix and real-valued vectors, then the problem just gets harder than the one I gave in the answer (i.e. the Online Boolean Matrix-Vector Multiplication will be a special case). If you could solve the more general problem truly faster than O(n^3), then you would also make an improvement on the conjecture (which would be big news!). Furthermore, the author says in a comment to the question that the vectors are initially unknown. If you knew all vectors beforehand, you could just use fast matrix multiplication (e.g. a version of Strassen's algorithm).

— tranisstor

I just pointed authors case "real vector". Look at Thomas matrix - only special case of matrices in O(n). I do not imply general case. And if Matrix are constant and vectors are known you hardcode answer not implement Strassen ;(

— Evil

@EvilJS: I am not sure I completely understand what you are trying to say. Of course, for special types of matrices like the Thomas matrix you can get a significant speed up, but in general this is harder. Maybe I should also point out that the problem I introduced does consider a preprocessing step (before any vector arrives). If you could tell me how to systematically "hardcode" your algorithm for any matrix that I give you, you could also improve on the conjecture (since you could implement this hardcoding step as a preprocessing step of an algorithm).

— tranisstor

agreed it works; however the 2nd ref by williams does not seem to consider binary matrices at all in particular. fyi he has slides here

— vzn

-2

this is essentially research-level CS, the problem is studied in at least two guises, one of multiplication of sparse matrices (example paper just cited), and also the special case of "binary sparse matrices" is also studied. the 2^nd case is known to be related to optimizing straight-line programs. minimal programs may also be like DAGs with two types of "gates", addition and multiplication, so some circuit minimization literature may connect with this, and possibly "off the shelf" software could be adapted for the purpose. here is a specific ref on the 2^nd case and also the same question on cstheory with some basic initial empirical study.

— vzn
source

1

I only skimmed the links, but so far I'm skeptical for two reasons. 1) The matrices are not sparse. Often the number of nonzeros is around the same as the number of zeros, yet there is often a way to come up with

O (n)

$O(n)$ algorithms for matrices with

O (n^{2})

$O(n^2)$ nonzeros, based on the patterns. 2) These papers seem to be based on only using addition and multiplication as the basic binary "gates", but in my experience it is important to also use subtraction. This would require including a subtraction binary gate, or a "multiply by -1" unary gate.

— Nick Alger

the refs are on, as the titles indicate, sparse matrices. maybe you have some different definition than those in the papers? if you are sensitive to an exact definition of sparsity (most are roughly correlated/ nearly interchangeable) it should be stated in the question.

— vzn

1

The matrices I'm interested in are dense matrices. By the way, even though I don't think this fully addresses my question, I appreciate the answer.

— Nick Alger

ok, sorry! got mixed up, didnt realize the exact question. on cursory look your example #2 has less than ½ fill & looked "sparse" to me & figured some of the sparse theory would be at least somewhat applicable. basically the more dense the matrix, the less the operation can be optimized, so probably most of the theory about this type of optimization is oriented around sparse matrices.

— vzn

-3

not sure if this problem has been studied exactly but this research is related & seems a reasonable lead/ start. it looks at hypergraph decomposition for sparse matrix multiplication. binary matrices are a special case of this approach. this approach will find more optimal strategies than the "straight" multiplication method. further optimizations (within this framework) might be possible based on the binary matrix property.

Hypergraph-Partitioning Based Decomposition for Parallel Sparse-Matrix Vector Multiplication / Umit Catalyurek and Cevdet Aykanat

— vzn
source

2

I don't see what this has to do with the question. That paper is about partitioning matrix multiplication among a distributed system, for parallel computation, to minimize the amount of inter-processor communication. What does that have to do with this question? The question does not seem to mention anything about parallel computation or inter-processor communication. I encourage you to edit your answer to make the connection more explicit.

— D.W.

afaik its the same problem and minimizing the parallel computation also minimizes a single processor implementation of the same calculations. at least, the questioner did not rule out parallel implementations.

— vzn

1

Thank you for the link. However I am skeptical of the method for this problem since it does not take advantage of the fact that the matrix entries contain only zeros and ones, whereas this property is very important, as far as I can tell. For example, the "running total" algorithm in the first example will only work if all the nonzero entries in a given column of the matrix have the same value.

— Nick Alger

NA your observation/ objection is addressed in the answer. further optimization is probably possible using the 0/1 property; this method seems to minimize total # of addition/ multiplication operations under the guise of parallelization. the addition/ multiplication operations can also be seen as "gates" in a DAG and the technique is minimizing gates. the substantial complexity of the paper reveals some of the inherent deeper/ substantial complexity of this optimization process. as stated answer is not intended to be definitive on this difficult problem, just "better than nothing".

— vzn