反向模式自动微分的分步示例


27

不知道这个问题是否属于这里,但它与优化中的梯度方法密切相关,在这里似乎很热门。无论如何,如果您认为其他社区对此主题有更好的专业知识,请随时迁移。

简而言之,我正在寻找反向模式自动微分的分步示例。关于该主题的文献不多,并且不了解其背后的理论的情况下,很难理解现有的实现(例如TensorFlow中的内容)。因此,如果有人能详细显示我们传入的内容,我们如何处理它以及计算图中得出的内容,我将非常感激。

我最难解决的几个问题:

  • 种子 -为什么我们完全需要它们?
  • 反向差异化规则 -我知道如何进行差异化,但是我们如何向后退?例如,在从示例本节,我们怎么知道w2¯=w3¯w1
  • 我们使用符号还是通过实际?例如,在相同的示例,是wiwi¯符号或值?

我认为“使用Scikit-Learn和TensorFlow进行动手机器学习”附录D给出了很好的解释。我推荐它。
Agustin Barrachina

Answers:


37

假设我们有表达式z=x1x2+sin(x1)并想找到导数dzdx1dzdx2。反向模式AD将此任务分为两部分,即正向和反向传递。

前传

首先,我们将复杂的表达式分解为一组原始表达式,即最多由单个函数调用组成的表达式。请注意,尽管没有必要,我也重命名了输入和输出变量以保持一致性:

w1=x1
w2=x2
w3=w1w2
w4=sin(w1)
w5=w3+w4
z=w5

这种表示的优点是,每个单独表达式的区分规则都是已知的。例如,我们知道sin导数是cos,所以dw4dw1=cos(w1)。我们将在下面的反向传递中使用此事实。

本质上,前向传递包括评估每个表达式并保存结果。假设我们的输入是:x1=2x2=3。然后我们有:

w1=x1=2
w2=x2=3
w3=w1w2=6
w4=sin(w1) =0.9
w5=w3+w4=6.9
z=w5=6.9

反向通过

这是魔术的开始,它是从连锁法则开始的。链式规则以其基本形式表示,如果您有变量t(u(v))取决于u,而u又取决于v,则:

dtdv=dtdududv

或者,如果t通过多个路径/变量u i依赖于v,例如:ui

u1=f(v)
u2=g(v)
t=h(u1,u2)

然后(请参见此处的证明):

dtdv=idtduiduidv

就表达式图而言,如果我们有一个最终节点z和输入节点wi,并且从zwi路径经过中间节点wp(即z=g(wp),其中wp=f(wi)),我们可以找到导数dzdwi

dzdwi=pparents(i)dzdwpdwpdwi

换句话说,要计算任何中间变量或输入变量w i的输出变量z的导数,我们只需要知道其父代的导数和计算原始表达式w p = f w i)的导数的公式即可。wiwp=f(wi)

反向通过从末尾开始(即dzdz)并向后传播到所有依赖项。这里有(“种子”的表达式):

dzdz=1

这可以被理解为“在变革z在完全一样的变化结果z ”,这是相当明显的。

那么我们知道z=w5,所以:

dzdw5=1

w5线性取决于w3w4,所以dw5dw3=1dw5dw4=1

dzdw3=dzdw5dw5dw3=1×1=1
dzdw4=dzdw5dw5dw4=1×1=1

From definition w3=w1w2 and rules of partial derivatives, we find that dw3dw2=w1. Thus:

dzdw2=dzdw3dw3dw2=1×w1=w1

Which, as we already know from forward pass, is:

dzdw2=w1=2

Finally, w1 contributes to z via w3 and w4. Once again, from the rules of partial derivatives we know that dw3dw1=w2 and dw4dw1=cos(w1). Thus:

dzdw1=dzdw3dw3dw1+dzdw4dw4dw1=w2+cos(w1)

And again, given known inputs, we can calculate it:

dzdw1=w2+cos(w1)=3+cos(2) =2.58

Since w1 and w2 are just aliases for x1 and x2, we get our answer:

dzdx1=2.58
dzdx2=2

And that's it!


This description concerns only scalar inputs, i.e. numbers, but in fact it can also be applied to multidimensional arrays such as vectors and matrices. Two things that one should keep in mind when differentiating expressions with such objects:

  1. Derivatives may have much higher dimensionality than inputs or output, e.g. derivative of vector w.r.t. vector is a matrix and derivative of matrix w.r.t. matrix is a 4-dimensional array (sometimes referred to as a tensor). In many cases such derivatives are very sparse.
  2. Each component in output array is an independent function of 1 or more components of input array(s). E.g. if y=f(x) and both x and y are vectors, yi never depends on yj, but only on subset of xk. In particular, this means that finding derivative dyidxj boils down to tracking how yi depends on xj.

The power of automatic differentiation is that it can deal with complicated structures from programming languages like conditions and loops. However, if all you need is algebraic expressions and you have good enough framework to work with symbolic representations, it's possible to construct fully symbolic expressions. In fact, in this example we could produce expression dzdw1=w2+cos(w1)=x2+cos(x1) and calculate this derivative for whatever inputs we want.


1
Very useful question/answer. Thanks. Just a litte criticism: you seem to move on a tree structure without explaining (that's when you start talking about parents, etc..)
MadHatter

1
Also it won't hurt clarifying why we need seeds.
MadHatter

@MadHatter thanks for the comment. I tried to rephrase a couple of paragraphs (these that refer to parents) to emphasize a graph structure. I also added "seed" to the text, although this name itself may be misleading in my opinion: in AD seed is always a fixed expression - dzdz=1, not something you can choose or generate.
ffriend

Thanks! I noticed when you have to set more than one "seed", generally one chooses 1 and 0. I'd like to know why. I mean, one takes the "quotient" of a differential w.r.t. itself, so "1" is at least intuitively justified.. But what about 0? And what if one has to pick more than 2 seeds?
MadHatter

1
As far as I understand, more than one seed is used only in forward-mode AD. In this case you set the seed to 1 for an input variable you want to differentiate with respect to and set the seed to 0 for all the other input variables so that they don't contribute to the output value. In reverse-mode you set the seed to an output variable, and you normally have only one output variable. I guess, you can construct reverse-mode AD pipeline with several output variables and set all of them but one to 0 to get the same effect as in forward mode, but I have never investigated this option.
ffriend
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.