反向模式自动微分的分步示例

不知道这个问题是否属于这里，但它与优化中的梯度方法密切相关，在这里似乎很热门。无论如何，如果您认为其他社区对此主题有更好的专业知识，请随时迁移。

简而言之，我正在寻找反向模式自动微分的分步示例。关于该主题的文献不多，并且在不了解其背后的理论的情况下，很难理解现有的实现（例如TensorFlow中的内容）。因此，如果有人能详细显示我们传入的内容，我们如何处理它以及从计算图中得出的内容，我将非常感激。

我最难解决的几个问题：

种子 -为什么我们完全需要它们？
反向差异化规则 -我知道如何进行差异化，但是我们如何向后退？例如，在从示例本节，我们怎么知道 $\bar{w_2}=\bar{w_3}w_1$ ？
我们只使用符号还是通过实际值？例如，在相同的示例，是 $w_i$ 和 $\bar{w_i}$ 符号或值？

— 朋友
source

我认为“使用Scikit-Learn和TensorFlow进行动手机器学习”附录D给出了很好的解释。我推荐它。

— Agustin Barrachina

假设我们有表达式 $z = x_1x_2 + \sin(x_1)$ 并想找到导数 $\frac{dz}{dx_1}$ 和 $\frac{dz}{dx_2}$ 。反向模式AD将此任务分为两部分，即正向和反向传递。

前传

首先，我们将复杂的表达式分解为一组原始表达式，即最多由单个函数调用组成的表达式。请注意，尽管没有必要，我也重命名了输入和输出变量以保持一致性：

w_{1} = x_{1}

$w_1 = x_1$

w_{2} = x_{2}

$w_2 = x_2$

w_{3} = w_{1} w_{2}

$w_3 = w_1w_2$

w_{4} = \sin (w_{1})

$w_4 = \sin(w_1)$

w_{5} = w_{3} + w_{4}

$w_5 = w_3 + w_4$

z = w_{5}

$z = w_5$

这种表示的优点是，每个单独表达式的区分规则都是已知的。例如，我们知道 $\sin$ 导数是 $\cos$ ，所以 $\frac{dw_4}{dw_1} = \cos(w_1)$ 。我们将在下面的反向传递中使用此事实。

本质上，前向传递包括评估每个表达式并保存结果。假设我们的输入是： $x_1 = 2$ 和 $x_2 = 3$ 。然后我们有：

w_{1} = x_{1} = 2

$w_1 = x_1 = 2$

w_{2} = x_{2} = 3

$w_2 = x_2 = 3$

w_{3} = w_{1} w_{2} = 6

$w_3 = w_1w_2 = 6$

w_{4} = \sin (w_{1}) = 0.9

$w_4 = \sin(w_1) ~= 0.9$

w_{5} = w_{3} + w_{4} = 6.9

$w_5 = w_3 + w_4 = 6.9$

z = w_{5} = 6.9

$z = w_5 = 6.9$

反向通过

这是魔术的开始，它是从连锁法则开始的。链式规则以其基本形式表示，如果您有变量 $t(u(v))$ 取决于 $u$ ，而又取决于 $v$ ，则：

\frac{d t}{d v} = \frac{d t}{d u} \frac{d u}{d v}

$\frac{dt}{dv} = \frac{dt}{du}\frac{du}{dv}$

或者，如果 $t$ 通过多个路径/变量依赖于 $v$ ，例如： $u_i$

u_{1} = f (v)

$u_1 = f(v)$

u_{2} = g (v)

$u_2 = g(v)$

t = h (u_{1}, u_{2})

$t = h(u_1, u_2)$

然后（请参见此处的证明）：

\frac{d t}{d v} = \sum_{i} \frac{d t}{d u_{i}} \frac{d u_{i}}{d v}

$\frac{dt}{dv} = \sum_i \frac{dt}{du_i}\frac{du_i}{dv}$

就表达式图而言，如果我们有一个最终节点 $z$ 和输入节点 $w_i$ ，并且从 $z$ 到 $w_i$ 路径经过中间节点 $w_p$ （即 $z = g(w_p)$ ，其中 $w_p = f(w_i)$ ），我们可以找到导数 $\frac{dz}{dw_i}$ 为

\frac{d z}{d w_{i}} = \sum_{p \in p a r e n t s (i)} \frac{d z}{d w_{p}} \frac{d w_{p}}{d w_{i}}

$\frac{dz}{dw_i} = \sum_{p \in parents(i)} \frac{dz}{dw_p} \frac{dw_p}{dw_i}$

换句话说，要计算任何中间变量或输入变量的输出变量 $z$ 的导数，我们只需要知道其父代的导数和计算原始表达式导数的公式即可。 $w_i$ $w_p = f(w_i)$

反向通过从末尾开始（即 $\frac{dz}{dz}$ ）并向后传播到所有依赖项。这里有（“种子”的表达式）：

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$

这可以被理解为“在变革 $z$ 在完全一样的变化结果 $z$ ”，这是相当明显的。

那么我们知道 $z = w_5$ ，所以：

\frac{d z}{d w_{5}} = 1

$\frac{dz}{dw_5} = 1$

$w_5$ 线性取决于 $w_3$ 和 $w_4$ ，所以 $\frac{dw_5}{dw_3} = 1$ 和 $\frac{dw_5}{dw_4} = 1$

\frac{d z}{d w_{3}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{3}} = 1 \times 1 = 1

$\frac{dz}{dw_3} = \frac{dz}{dw_5} \frac{dw_5}{dw_3} = 1 \times 1 = 1$

\frac{d z}{d w_{4}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{4}} = 1 \times 1 = 1

$\frac{dz}{dw_4} = \frac{dz}{dw_5} \frac{dw_5}{dw_4} = 1 \times 1 = 1$

From definition $w_3 = w_1w_2$ and rules of partial derivatives, we find that $\frac{dw_3}{dw_2} = w_1$ . Thus:

\frac{d z}{d w_{2}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{2}} = 1 \times w_{1} = w_{1}

$\frac{dz}{dw_2} = \frac{dz}{dw_3} \frac{dw_3}{dw_2} = 1 \times w_1 = w_1$

Which, as we already know from forward pass, is:

\frac{d z}{d w_{2}} = w_{1} = 2

$\frac{dz}{dw_2} = w_1 = 2$

Finally, $w_1$ contributes to $z$ via $w_3$ and $w_4$ . Once again, from the rules of partial derivatives we know that $\frac{dw_3}{dw_1} = w_2$ and $\frac{dw_4}{dw_1} = \cos(w_1)$ . Thus:

\frac{d z}{d w_{1}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{1}} + \frac{d z}{d w_{4}} \frac{d w_{4}}{d w_{1}} = w_{2} + \cos (w_{1})

$\frac{dz}{dw_1} = \frac{dz}{dw_3} \frac{dw_3}{dw_1} + \frac{dz}{dw_4} \frac{dw_4}{dw_1} = w_2 + \cos(w_1)$

And again, given known inputs, we can calculate it:

\frac{d z}{d w_{1}} = w_{2} + \cos (w_{1}) = 3 + \cos (2) = 2.58

$\frac{dz}{dw_1} = w_2 + \cos(w_1) = 3 + \cos(2) ~= 2.58$

Since $w_1$ and $w_2$ are just aliases for $x_1$ and $x_2$ , we get our answer:

\frac{d z}{d x_{1}} = 2.58

$\frac{dz}{dx_1} = 2.58$

\frac{d z}{d x_{2}} = 2

$\frac{dz}{dx_2} = 2$

And that's it!

This description concerns only scalar inputs, i.e. numbers, but in fact it can also be applied to multidimensional arrays such as vectors and matrices. Two things that one should keep in mind when differentiating expressions with such objects:

Derivatives may have much higher dimensionality than inputs or output, e.g. derivative of vector w.r.t. vector is a matrix and derivative of matrix w.r.t. matrix is a 4-dimensional array (sometimes referred to as a tensor). In many cases such derivatives are very sparse.
Each component in output array is an independent function of 1 or more components of input array(s). E.g. if $y = f(x)$ and both $x$ and $y$ are vectors, $y_i$ never depends on $y_j$ , but only on subset of $x_k$ . In particular, this means that finding derivative $\frac{dy_i}{dx_j}$ boils down to tracking how $y_i$ depends on $x_j$ .

The power of automatic differentiation is that it can deal with complicated structures from programming languages like conditions and loops. However, if all you need is algebraic expressions and you have good enough framework to work with symbolic representations, it's possible to construct fully symbolic expressions. In fact, in this example we could produce expression $\frac{dz}{dw_1} = w_2 + \cos(w_1) = x_2 + \cos(x_1)$ and calculate this derivative for whatever inputs we want.

— ffriend
source

Very useful question/answer. Thanks. Just a litte criticism: you seem to move on a tree structure without explaining (that's when you start talking about parents, etc..)

— MadHatter

Also it won't hurt clarifying why we need seeds.

— MadHatter

@MadHatter thanks for the comment. I tried to rephrase a couple of paragraphs (these that refer to parents) to emphasize a graph structure. I also added "seed" to the text, although this name itself may be misleading in my opinion: in AD seed is always a fixed expression -

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$ , not something you can choose or generate.

— ffriend

Thanks! I noticed when you have to set more than one "seed", generally one chooses 1 and 0. I'd like to know why. I mean, one takes the "quotient" of a differential w.r.t. itself, so "1" is at least intuitively justified.. But what about 0? And what if one has to pick more than 2 seeds?

— MadHatter

As far as I understand, more than one seed is used only in forward-mode AD. In this case you set the seed to 1 for an input variable you want to differentiate with respect to and set the seed to 0 for all the other input variables so that they don't contribute to the output value. In reverse-mode you set the seed to an output variable, and you normally have only one output variable. I guess, you can construct reverse-mode AD pipeline with several output variables and set all of them but one to 0 to get the same effect as in forward mode, but I have never investigated this option.

— ffriend