使用反向传播训练神经网络的时间复杂度是多少？

17

假设一个NN包含 $n$ 隐藏层， $m$ 训练实例中， $x$ 的功能，和 $n_i$ 在每个层中的节点。使用反向传播训练该NN的时间复杂度是多少？

我对它们如何找到算法的时间复杂度有一个基本的想法，但是这里有4个不同的因素需要考虑，即迭代，层，每层中的节点，训练示例，也许还有更多因素。我在这里找到了答案，但还不够清楚。

除了上文所述，还有其他因素会影响NN训练算法的时间复杂度吗？

— 杜塔
source

另请参阅https://qr.ae/TWttzq。

— nbro

11

我还没有看到来自可靠来源的答案，但是我将尝试通过一个简单的示例（以我目前的知识）自己回答这个问题。

通常，请注意，通常使用矩阵来实现使用反向传播训练MLP。

矩阵乘法的时间复杂度

$M_{ij} * M_{jk}$ 的矩阵乘法的时间复杂度就是 $\mathcal{O}(i*j*k)$ 。

注意，这里我们假设最简单的乘法算法：存在一些其他算法，它们的时间复杂度更好。

前馈算法

前馈传播算法如下。

首先，从层去 $i$ 到 $j$ ，你做

S_{j} = W_{j i} * Z_{i}

$S_j = W_{ji}*Z_i$

然后您应用激活功能

Z_{j} = f (S_{j})

$Z_j = f(S_j)$

如果我们有 $N$ 层（包括输入和输出层），它将运行 $N-1$ 次。

例

例如，让我们为具有 $4$ 层的MLP计算前向通过算法的时间复杂度，其中 $i$ 表示输入层的节点数， $j$ 表示第二层的节点数， $k$ 表示输入层的节点数。第三层和 $l$ 输出层中的节点数。

由于共有 $4$ 层，因此需要 $3$ 矩阵来表示这些层之间的权重。让我们用 $W_{ji}$ ， $W_{kj}$ 和 $W_{lk}$ 表示它们，其中 $W_{ji}$ 是一个具有 $j$ 行和 $i$ 列的矩阵（因此， $W_{ji}$ 包含从第 $i$ 层到第 $j$ 层的权重）。

假设你有 $t$ 的训练例子。为了从第 $i$ 层传播到 $j$ ，我们首先

S_{j t} = W_{j i} * Z_{i t}

$S_{jt} = W_{ji} * Z_{it}$

并且此运算（即矩阵乘法时间复杂度为 $\mathcal{O}(j*i*t)$ 。然后我们应用激活功能

Z_{j t} = f (S_{j t})

$Z_{jt} = f(S_{jt})$

并且这具有 $\mathcal{O}(j*t)$ 时间复杂度，因为它是元素操作。

因此，总的来说，

O (j * i * t + j * t) = O (j * t * (t + 1)) = O (j * i * t)

$\mathcal{O}(j*i*t + j*t) = \mathcal{O}(j*t*(t + 1)) = \mathcal{O}(j*i*t)$

使用相同的逻辑，对于 $j \to k$ ，我们有 $\mathcal{O}(k*j*t)$ ，对于 $k \to l$ ，我们有 $\mathcal{O}(l*k*t)$ 。

总的来说，前馈传播的时间复杂度为

O (j * i * t + k * j * t + l * k * t) = O (t * (i j + j k + k l))

$\mathcal{O}(j*i*t + k*j*t + l*k*t) = \mathcal{O}(t*(ij + jk + kl))$

我不确定是否可以进一步简化。也许只是 $\mathcal{O}(t*i*j*k*l)$ ，但我不确定。

反向传播算法

反向传播算法进行如下。从输出层 $l \to k$ ，我们计算误差信号 $E_{lt}$ ，该矩阵包含层节点的误差信号 $l$

E_{l t} = f^{'} (S_{l t}) ⊙ (Z_{l t} - O_{l t})

$E_{lt} = f'(S_{lt}) \odot {(Z_{lt} - O_{lt})}$

其中 $\odot$ 表示逐元素乘法。注意， $E_{lt}$ 具有 $l$ 行和 $t$ 列：这仅表示每一列都是训练示例 $t$ 的错误信号。

然后我们计算“Δ权重”， $D_{lk} \in \mathbb{R}^{l \times k}$ （层之间 $l$ 和层 $k$ ）

D_{l k} = E_{l t} * Z_{t k}

$D_{lk} = E_{lt} * Z_{tk}$

其中 $Z_{tk}$ 是 $Z_{kt}$ 的转置。

然后，我们调整权重

W_{l k} = W_{l k} - D_{l k}

$W_{lk} = W_{lk} - D_{lk}$

对于 $l \to k$ ，我们的时间复杂度为 $\mathcal{O}(lt + lt + ltk + lk) = \mathcal{O}(l*t*k)$ 。

现在，从 $k \to j$ 。我们首先有

E_{k t} = f^{'} (S_{k t}) ⊙ (W_{k l} * E_{l t})

$E_{kt} = f'(S_{kt}) \odot (W_{kl} * E_{lt})$

然后

D_{k j} = E_{k t} * Z_{t j}

$D_{kj} = E_{kt} * Z_{tj}$

接着

W_{k j} = W_{k j} - D_{k j}

$W_{kj} = W_{kj} - D_{kj}$

where $W_{kl}$ is the transpose of $W_{lk}$ . For $k \to j$ , we have the time complexity $\mathcal{O}(kt + klt + ktj + kj) = \mathcal{O}(k*t(l+j))$ .

And finally, for $j \to i$ , we have $\mathcal{O}(j*t(k+i))$ . In total, we have

O (l t k + t k (l + j) + t j (k + i)) = O (t * (l k + k j + j i))

$\mathcal{O}(ltk + tk(l + j) + tj (k + i)) = \mathcal{O}(t*(lk + kj + ji))$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be

O (t * (i j + j k + k l)) .

$O(t*(ij + jk + kl)).$

This time complexity is then multiplied by number of iterations (epochs). So, we have

O (n * t * (i j + j k + k l)),

$O(n*t*(ij + jk + kl)),$ where

n

$n$ is number of iterations.

Notes

Note that these matrix operations can greatly be paralelized by GPUs.

Conclusion

We tried to find the time complexity for training a neural network that has 4 layers with respectively $i$ , $j$ , $k$ and $l$ nodes, with $t$ training examples and $n$ epochs. The result was $\mathcal{O}(nt*(ij + jk + kl))$ .

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.

Sources

The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

If you're not familiar with back-propagation, check this article:

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4

— M.kazem Akhgary
source

Your answer is great..I could not find any ambiguity till now, but you forgot the no. of iterations part, just add it...and if no one answers in 5 days i'll surely accept your answer

— DuttaA

@DuttaA I tried to put every thing I knew. it may not be 100% correct so feel free to leave this unaccepted :) I'm also waiting for other answers to see what other points I missed.

— M.kazem Akhgary

4

For the evaluation of a single pattern, you need to process all weights and all neurons. Given that every neuron has at least one weight, we can ignore them, and have $\mathcal{O}(w)$ where $w$ is the number of weights, i.e., $n * n_i$ , assuming full connectivity between your layers.

The back-propagation has the same complexity as the forward evaluation (just look at the formula).

So, the complexity for learning $m$ examples, where each gets repeated $e$ times, is $\mathcal{O}(w*m*e)$ .

The bad news is that there's no formula telling you what number of epochs $e$ you need.

— maaartinus
source

From the above answer don't you think itdepends on more factors?

— DuttaA

1

@DuttaA No. There's a constant amount of work per weight, which gets repeated e times for each of m examples. I didn't bother to compute the number of weights, I guess, that's the difference.

— maaartinus

1

I think the answers are same. in my answer I can assume number of weights w = ij + jk + kl. basically sum of n * n_i between layers as you noted.

— M.kazem Akhgary

1

A potential disadvantage of gradient-based methods is that they head for the nearest minimum, which is usually not the global minimum.

This means that the only difference between these search methods is the speed with which solutions are obtained, and not the nature of those solutions.

An important consideration is time complexity, which is the rate at which the time required to find a solution increases with the number of parameters (weights). In short, the time complexities of a range of different gradient-based methods (including second-order methods) seem to be similar.

Six different error functions exhibit a median run-time order of approximately O(N to the power 4) on the N-2-N encoder in this paper:

Lister, R and Stone J "An Empirical Study of the Time Complexity of Various Error Functions with Conjugate Gradient Back Propagation" , IEEE International Conference on Artificial Neural Networks (ICNN95), Perth, Australia, Nov 27-Dec 1, 1995.

Summarised from my book: Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning.

— James V Stone
source

Hi J. Stone. Thanks for trying to contribute to the site. However, please, note that this is not a place for advertising yourself. Anyway, you can surely provide a link to your own books if they are useful for answering the questions and provided you're not just trying to advertise yourself.

— nbro

@nbro If James Stone can provide an insightful answer - and it seems so - then i'm fine with him also mentioning some of his work. Having experts on this network is a solid contribution to the quality and level.

— javadba

Dear nbro, That is a fair comment. I dislike adverts too. But it is possible for a book and/or paper to be relevant to a question, as I believe it is in this case. regards, Jim Stone

— James V Stone