该误差由成本函数调整。在反向传播中,我们试图调整隐藏层的权重。我能理解的输出错误,就是e = d - y
[没有下标]。
问题是:
- 如何获得隐藏层的错误?如何计算呢?
- 如果我反向传播它,应该使用它作为自适应滤波器的成本函数还是应该使用指针(在C / C ++中)编程意义来更新权重?
该误差由成本函数调整。在反向传播中,我们试图调整隐藏层的权重。我能理解的输出错误,就是e = d - y
[没有下标]。
问题是:
Answers:
我想我会在这里为所有感兴趣的人回答一个独立的帖子。这将使用此处描述的符号。
反向传播背后的想法是拥有一组我们用来训练网络的“训练示例”。这些中的每一个都有一个已知的答案,因此我们可以将它们插入到神经网络中,并找出错误的程度。
例如,使用手写识别,您将在手写字符和实际字符之间有很多。然后,可以通过反向传播对神经网络进行训练,以“学习”如何识别每个符号,因此,当以后向它提供一个未知的手写字符时,它可以识别出正确的字符。
具体来说,我们将一些训练样本输入到神经网络中,查看其效果如何,然后“向后ckle动”以发现我们可以改变多少以改变每个节点的权重和偏差以获得更好的结果,然后相应地进行调整。随着我们继续这样做,网络将“学习”。
培训过程中可能还包括其他步骤(例如,辍学),但是我将主要关注反向传播,因为这就是这个问题的目的。
偏导数是的衍生物˚F相对于一些变量X。
例如,如果, ∂ ˚F,因为是简单地相对于以恒定 X。同样, ∂ ˚F,因为 X 2是简单地相对于以恒定 ÿ。
的函数的梯度,指定,是包含用于f中的每一个变量的偏导数的函数。特别:
,
其中是指向变量v 1方向的单位矢量。
现在,一旦我们已经计算出的一些功能˚F,如果我们在位置(v 1,v 2,。。。,v Ñ),我们可以通过“向下滑动” ˚F通过在方向前进- ∇ ˚F (v 1,v 2。
与我们的例子,单位矢量是ë 1 = (1 ,0 )和ë 2 = (0 ,1 ),因为v 1 = X和v 2 = ÿ,这些向量指向x和y轴的方向。因此,∇ ˚F (X ,ÿ。
现在,“滑下”我们的功能,让我们说,我们正处在一个点(- 2 ,4 )。然后,我们将需要在方向上移动- ∇ ˚F (- 2 ,- 4 )= - (2 ⋅ - 2 ⋅ (1 ,0 )+ 2 ⋅ 4 ⋅ (0 ,1 ))= - (( - 4 ,0 )+。
该矢量的大小将使我们知道山坡的陡峭程度(值越大,山坡越陡峭)。在这种情况下,我们有。
两个矩阵的Hadamard积,就像矩阵此外,除了代替添加矩阵逐元素,我们将它们相乘,逐元素。
形式上,而矩阵加法是,其中Ç ∈ [R Ñ × 中号,使得
,
Hadamard积,其中Ç ∈ [R Ñ × 中号,使得
(本节的大部分内容来自Neilsen的书)。
我们有一组训练样本,其中S r是单个输入训练样本,而E r是该训练样本的预期输出值。我们还有神经网络,它由偏差W和权重B组成。r用于防止与前馈网络定义中使用的i,j和k混淆。
接下来,我们定义成本函数 that takes in our neural network and a single training example, and outputs how good it did.
通常使用的是二次成本,由
where is the output to our neural network, given input sample
Then we want to find and for each node in our feedforward neural network.
We can call this the gradient of at each neuron because we consider and as constants, since we can't change them when we are trying to learn. And this makes sense - we want to move in a direction relative to and that minimizes cost, and moving in the negative direction of the gradient with respect to and will do this.
To do this, we define as the error of neuron in layer .
We start with computing by plugging into our neural network.
Then we compute the error of our output layer, , via
Which can also be written as
Next, we find the error in terms of the error in the next layer , via
Now that we have the error of each node in our neural network, computing the gradient with respect to our weights and biases is easy:
Note that the equation for the error of the output layer is the only equation that's dependent on the cost function, so, regardless of the cost function, the last three equations are the same.
As an example, with quadratic cost, we get
for the error of the output layer. and then this equation can be plugged into the second equation to get the error of the layer:
which we can repeat this process to find the error of any layer with respect to , which then allows us to compute the gradient of any node's weights and bias with respect to .
I could write up an explanation and proof of these equations if desired, though one can also find proofs of them here. I'd encourage anyone that is reading this to prove these themselves though, beginning with the definition and applying the chain rule liberally.
For some more examples, I made a list of some cost functions alongside their gradients here.
Now that we have these gradients, we need to use them learn. In the previous section, we found how to move to "slide down" the curve with respect to some point. In this case, because it's a gradient of some node with respect to weights and a bias of that node, our "coordinate" is the current weights and bias of that node. Since we've already found the gradients with respect to those coordinates, those values are already how much we need to change.
We don't want to slide down the slope at a very fast speed, otherwise we risk sliding past the minimum. To prevent this, we want some "step size" .
Then, find the how much we should modify each weight and bias by, because we have already computed the gradient with respect to the current we have
Thus, our new weights and biases are
Using this process on a neural network with only an input layer and an output layer is called the Delta Rule.
Now that we know how to perform backpropagation for a single sample, we need some way of using this process to "learn" our entire training set.
One option is simply performing backpropagation for each sample in our training data, one at a time. This is pretty inefficient though.
A better approach is Stochastic Gradient Descent. Instead of performing backpropagation for each sample, we pick a small random sample (called a batch) of our training set, then perform backpropagation for each sample in that batch. The hope is that by doing this, we capture the "intent" of the data set, without having to compute the gradient of every sample.
For example, if we had 1000 samples, we could pick a batch of size 50, then run backpropagation for each sample in this batch. The hope is that we were given a large enough training set that it represents the distribution of the actual data we are trying to learn well enough that picking a small random sample is sufficient to capture this information.
However, doing backpropagation for each training example in our mini-batch isn't ideal, because we can end up "wiggling around" where training samples modify weights and biases in such a way that they cancel each other out and prevent them from getting to the minimum we are trying to get to.
To prevent this, we want to go to the "average minimum," because the hope is that, on average, the samples' gradients are pointing down the slope. So, after choosing our batch randomly, we create a mini-batch which is a small random sample of our batch. Then, given a mini-batch with training samples, and only update the weights and biases after averaging the gradients of each sample in the mini-batch.
Formally, we do
and
where is the computed change in weight for sample , and is the computed change in bias for sample .
Then, like before, we can update the weights and biases via:
This gives us some flexibility in how we want to perform gradient descent. If we have a function we are trying to learn with lots of local minima, this "wiggling around" behavior is actually desirable, because it means that we're much less likely to get "stuck" in one local minima, and more likely to "jump out" of one local minima and hopefully fall in another that is closer to the global minima. Thus we want small mini-batches.
On the other hand, if we know that there are very few local minima, and generally gradient descent goes towards the global minima, we want larger mini-batches, because this "wiggling around" behavior will prevent us from going down the slope as fast as we would like. See here.
One option is to pick the largest mini-batch possible, considering the entire batch as one mini-batch. This is called Batch Gradient Descent, since we are simply averaging the gradients of the batch. This is almost never used in practice, however, because it is very inefficient.
I haven't dealt with Neural Networks for some years now, but I think you will find everything you need here:
Neural Networks - A Systematic Introduction, Chapter 7: The backpropagation algorithm
I apologize for not writing the direct answer here, but since I have to look up the details to remember (like you) and given that the answer without some backup may be even useless, I hope this is ok. However, if any questions remain, drop a comment and I'll see what I can do.