深度神经网络-使用ReLU进行反向传播

我在使用ReLU进行反向传播时遇到了一些困难，并且做了一些工作，但是我不确定自己是否走对了。

成本函数： $\frac{1}{2}(y-\hat y)^2$ ，其中 $y$ 是真正的值，并且是一个预测值。还要假设> 0总是。 $\hat y$ $x$

1层ReLU，其中第一层的权重为 $w_1$

$\frac{dC}{dw_1}=\frac{dC}{dR}\frac{dR}{dw_1}$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

2层ReLU，其中第一层的权重为 $w_2$ ，第二层的权重为 $w_1$ ，我想更新第一层的权重 $w_2$

$\frac{dC}{dw_2}=\frac{dC}{dR}\frac{dR}{dw_2}$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

由于 $ReLU(w_1*ReLU(w_2x))=w_1w_2x$

3层ReLU，其中第一层的权重为，第二层的权重为和第三层的权重为 $w_3$ $w_2$ $w_1$

$\frac{dC}{dw_3}=\frac{dC}{dR}\frac{dR}{dw_3}$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

由于 $ReLU(w_1*ReLU(w_2(*ReLU(w_3))=w_1w_2w_3x$

由于链规则仅持续2个导数，而S型曲线则可能长达个层。 $n$

假设我想更新所有3层权重，其中是第三层，是第二层，是第三层 $w_1$ $w_2$ $w_1$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

如果这个推导是正确的，那么如何防止消失呢？与Sigmoid相比，Sigmoid在方程式中有很多乘以0.25，而ReLU没有任何常数值乘法。如果有成千上万的图层，则由于权重会产生很多乘法，那么这会不会导致梯度消失或爆炸？

neural-network backpropagation

— 用户名
source

@NeilSlater感谢您的答复！您能详细说明一下吗，我不确定您的意思吗？

— user1157751

啊，我想我知道你的意思。好吧，我提出这个问题的原因是我确定推导是正确的吗？我四处搜寻，没有找到完全从头开始衍生的ReLU范例？

— user1157751

ReLU函数及其派生类的工作定义：

$ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ x, & \text{otherwise}. \end{cases}$

$\frac{d}{dx} ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ 1, & \text{otherwise}. \end{cases}$

导数是单位步长函数。这确实忽略了 $x=0$ 处的问题，其中没有严格定义梯度，但是对于神经网络而言，这并不是实际问题。使用上面的公式，导数为0时为1，但您也可以将其等效为0或0.5，而对神经网络的性能没有实际影响。

简化网络

通过这些定义，让我们看一下示例网络。

您正在使用成本函数回归 $C = \frac{1}{2}(y-\hat{y})^2$ 。您已将 $R$ 定义为人工神经元的输出，但尚未定义输入值。要补充的是为了完整性-称之为 $z$ ，通过层添加一些索引，我更喜欢小写的向量和矩阵上的情况下，所以 $r^{(1)}$ 第一层的输出， $z^{(1)}$ 对于其输入和 $W^{(0)}$ 用于将神经元连接到其输入 $x$ 的权重（在较大的网络中，它可能会连接到更深的 $r$ 值）。我还调整了权重矩阵的索引号-为什么对于较大的网络它会变得更清楚。注意：我现在暂时忽略每层中神经元以上的内容。

查看您的简单1层1神经元网络，前馈方程为：

$z^{(1)} = W^{(0)}x$

$\hat{y} = r^{(1)} = ReLU(z^{(1)})$

估算示例的成本函数的导数为：

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}} = \frac{\partial}{\partial r^{(1)}}\frac{1}{2}(y-r^{(1)})^2 = \frac{1}{2}\frac{\partial}{\partial r^{(1)}}(y^2 - 2yr^{(1)} + (r^{(1)})^2) = r^{(1)} - y$

使用链式规则反向传播到预变换（ $z$ ）值：

$\frac{\partial C}{\partial z^{(1)}} = \frac{\partial C}{\partial r^{(1)}} \frac{\partial r^{(1)}}{\partial z^{(1)}} = (r^{(1)} - y)Step(z^{(1)}) = (ReLU(z^{(1)}) - y)Step(z^{(1)})$

$\frac{\partial C}{\partial z^{(1)}}$ 是中间阶段和backprop连接步骤一起的关键部分。派生经常跳过这部分，因为成本函数和输出层的巧妙组合意味着简化了这一过程。这里不是。

获得相对于权重的梯度 $W^{(0)}$ ，这是链式规则的另一次迭代：

$\frac{\partial C}{\partial W^{(0)}} = \frac{\partial C}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(0)}} = (ReLU(z^{(1)}) - y)Step(z^{(1)})x = (ReLU(W^{(0)}x) - y)Step(W^{(0)}x)x$

. . . because $z^{(1)} = W^{(0)}x$ therefore $\frac{\partial z^{(1)}}{\partial W^{(0)}} = x$

That is the full solution for your simplest network.

However, in a layered network, you also need to carry the same logic down to the next layer. Also, you typically have more than one neuron in a layer.

More general ReLU network

If we add in more generic terms, then we can work with two arbitrary layers. Call them Layer $(k)$ indexed by $i$ , and Layer $(k+1)$ indexed by $j$ . The weights are now a matrix. So our feed-forward equations look like this:

$z^{(k+1)}_j = \sum_{\forall i} W^{(k)}_{ij}r^{(k)}_i$

$r^{(k+1)}_j = ReLU(z^{(k+1)}_j)$

In the output layer, then the initial gradient w.r.t. $r^{output}_j$ is still $r^{output}_j - y_j$ . However, ignore that for now, and look at the generic way to back propagate, assuming we have already found $\frac{\partial C}{\partial r^{(k+1)}_j}$ - just note that this is ultimately where we get the output cost function gradients from. Then there are 3 equations we can write out following the chain rule:

First we need to get to the neuron input before applying ReLU:

$\frac{\partial C}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j} \frac{\partial r^{(k+1)}_j}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j}Step(z^{(k+1)}_j)$

We also need to propagate the gradient to previous layers, which involves summing up all connected influences to each neuron:

$\frac{\partial C}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} W^{(k)}_{ij}$

And we need to connect this to the weights matrix in order to make adjustments later:

$\frac{\partial C}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} r^{(k)}_{i}$

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the $Step(z^{(k+1)}_j)$ in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.

Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's $y(1-y)$ , applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when $x = 0, y = 0.5$ ), and it gets worse than that and saturates quickly to near zero derivative away from $x=0$ . The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

— Neil Slater
source

是否在执行链条规则

\frac{d C}{d \hat{y}}

$\frac{dC}{d \hat y}$ ？

— user1157751

@ user1157751：不，

\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}$ 因为

\hat{y} = r^{(1)}

$\hat{y} = r^{(1)}$ 。成本函数C非常简单，您可以立即采用其导数。我唯一没有显示的是正方形的扩展-您要我添加它吗？

— 尼尔·斯莱特

But

C

$C$ is

\frac{1}{2} (y - \hat{y})^{2}

$\frac{1}{2}(y- \hat y)^2$ , don't we need to perform chain rule so that we can perform the derivative on

\hat{y}

$\hat y$ ?

\frac{d C}{d \hat{y}} = \frac{d C}{d U} \frac{d U}{d \hat{y}}

$\frac{dC}{d \hat y}=\frac{dC}{dU}\frac{dU}{d \hat y}$ , where

U = y - \hat{y}

$U = y - \hat y$ . Apologize for asking really simple questions, my maths ability is probably causing trouble for you : (

— user1157751

如果您可以通过扩展使事情变得更简单。然后请扩大正方形。

— user1157751

@ user1157751：是的，您可以以这种方式使用链式规则，它给出的答案与我展示的相同。我只是扩大了广场-我将展示它。

— 尼尔·斯莱特