18

我试图了解S型函数的导数在神经网络中的作用。

首先，我绘制了sigmoid函数，并使用python定义了所有点。该衍生物的确切作用是什么？

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def derivative(x, step):
    return (sigmoid(x+step) - sigmoid(x)) / step

x = np.linspace(-10, 10, 1000)

y1 = sigmoid(x)
y2 = derivative(x, 0.0000000000001)

plt.plot(x, y1, label='sigmoid')
plt.plot(x, y2, label='derivative')
plt.legend(loc='upper left')
plt.show()

machine-learning neural-network

— 卢卡斯
source

2

如果您还有其他问题，请随时询问

— -JahKnows

23

在神经网络中使用导数用于称为反向传播的训练过程。该技术使用梯度下降来找到最佳的模型参数集，以最小化损失函数。在您的示例中，您必须使用S形导数，因为这是您的单个神经元正在使用的激活。

损失函数

机器学习的本质是优化成本函数，以便我们可以最小化或最大化某些目标函数。这通常称为损失或成本功能。我们通常要最小化此功能。成本函数根据模型参数将数据传递通过模型时，基于产生的错误来关联一些惩罚。 $C$

让我们看一下我们尝试标记图像包含猫还是狗的示例。如果我们有一个完美的模型，我们可以给模型一个图片，它会告诉我们它是猫还是狗。但是，没有任何模型是完美的，并且会出错。

当我们训练模型以能够从输入数据中推断出含义时，我们希望最大程度地减少犯下的错误。因此，我们使用训练集，该数据包含许多狗和猫的图片，并且具有与该图片相关的地面真相标签。每次我们运行模型的训练迭代时，我们都会计算模型的成本（错误数量）。我们将希望最小化此成本。

存在许多成本函数，每个函数都有其自己的目的。常用的成本函数是二次成本，它定义为

。 $C = \frac{1}{N} \sum_{i=0}^{N}(\hat{y} - y)^2$

这是我们训练过的幅图像的预测标签和地面真实标签之间差异的平方。我们将以某种方式将其最小化。 $N$

最小化损失函数

实际上，大多数机器学习只是一系列框架，它们能够通过最小化某些成本函数来确定分布。我们可以问的问题是“如何最小化功能”？

让我们最小化以下功能

。 $y = x^2-4x+6$

如果绘制此图，我们可以看到处存在最小值。为了进行分析，我们可以将此函数的导数作为 $x = 2$

$\frac{dy}{dx} = 2x - 4 = 0$

。 $x = 2$

但是，通常无法通过分析找到全局最小值。因此，我们改为使用一些优化技术。这里也存在许多不同的方式，例如：Newton-Raphson，网格搜索等。其中包括梯度下降。这是神经网络使用的技术。

梯度下降

让我们使用一个著名的类比来理解这一点。想象一个2D最小化问题。这相当于在旷野中进行山区远足。您想回到最底端的村庄。即使您不知道村庄的主要方向。您需要做的就是不断沿着最陡峭的路下去，最终您将到达村庄。因此，我们将根据坡度的陡度向下倾斜表面。

让我们发挥功能

$y = x^2-4x+6$

我们将确定为其最小化。梯度下降算法首先说，我们将为选择一个随机值。让我们在处初始化。然后，该算法将迭代执行以下操作，直到达到收敛为止。 $x$ $y$ $x$ $x=8$

$x^{new} = x^{old} - \nu \frac{dy}{dx}$

其中是学习率，我们可以将其设置为我们想要的任何值。但是，有一种明智的选择方式。太大了，我们将永远无法达到最小值。太大了，我们将浪费很多时间才能到达那里。它类似于您要沿着陡峭的斜坡走下的台阶的大小。一小步，您将死在山上，您永远不会跌倒。太大一步，您就有冒险射击村庄并最终到达山的另一边的风险。导数是我们沿着这个斜率向最小值移动的方法。 $\nu$

$\frac{dy}{dx} = 2x - 4$

$\nu = 0.1$

迭代1：

$x^{new} = 8 - 0.1(2 * 8 - 4) = 6.8$
$x^{new} = 6.8 - 0.1(2 * 6.8 - 4) = 5.84$
$x^{new} = 5.84 - 0.1(2 * 5.84 - 4) = 5.07$
$x^{new} = 5.07 - 0.1(2 * 5.07 - 4) = 4.45$
$x^{new} = 4.45 - 0.1(2 * 4.45 - 4) = 3.96$
$x^{new} = 3.96 - 0.1(2 * 3.96 - 4) = 3.57$
$x^{new} = 3.57 - 0.1(2 * 3.57 - 4) = 3.25$
$x^{new} = 3.25 - 0.1(2 * 3.25 - 4) = 3.00$
$x^{new} = 3.00 - 0.1(2 * 3.00 - 4) = 2.80$
$x^{new} = 2.80 - 0.1(2 * 2.80 - 4) = 2.64$
$x^{new} = 2.64 - 0.1(2 * 2.64 - 4) = 2.51$
$x^{new} = 2.51 - 0.1(2 * 2.51 - 4) = 2.41$
$x^{new} = 2.41 - 0.1(2 * 2.41 - 4) = 2.32$
$x^{new} = 2.32 - 0.1(2 * 2.32 - 4) = 2.26$
$x^{new} = 2.26 - 0.1(2 * 2.26 - 4) = 2.21$
$x^{new} = 2.21 - 0.1(2 * 2.21 - 4) = 2.16$
$x^{new} = 2.16 - 0.1(2 * 2.16 - 4) = 2.13$
$x^{new} = 2.13 - 0.1(2 * 2.13 - 4) = 2.10$
$x^{new} = 2.10 - 0.1(2 * 2.10 - 4) = 2.08$
$x^{new} = 2.08 - 0.1(2 * 2.08 - 4) = 2.06$
$x^{new} = 2.06 - 0.1(2 * 2.06 - 4) = 2.05$
$x^{new} = 2.05 - 0.1(2 * 2.05 - 4) = 2.04$
$x^{new} = 2.04 - 0.1(2 * 2.04 - 4) = 2.03$
$x^{new} = 2.03 - 0.1(2 * 2.03 - 4) = 2.02$
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.02$
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$

And we see that the algorithm converges at $x = 2$ ! We have found the minimum.

Applied to neural networks

The first neural networks only had a single neuron which took in some inputs $x$ and then provide an output $\hat{y}$ . A common function used is the sigmoid function

$\sigma(z) = \frac{1}{1+exp(z)}$

$\hat{y}(w^Tx) = \frac{1}{1+exp(w^Tx + b)}$

where $w$ is the associated weight for each input $x$ and we have a bias $b$ . We then want to minimize our cost function

$C = \frac{1}{2N} \sum_{i=0}^{N}(\hat{y} - y)^2$ .

How to train the neural network?

We will use gradient descent to train the weights based on the output of the sigmoid function and we will use some cost function $C$ and train on batches of data of size $N$ .

$C = \frac{1}{2N} \sum_i^N (\hat{y} - y)^2$

$\hat{y}$ is the predicted class obtained from the sigmoid function and $y$ is the ground truth label. We will use gradient descent to minimize the cost function with respect to the weights $w$ . To make life easier we will split the derivative as follows

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w}$ .

$\frac{\partial C}{\partial \hat{y}} = \hat{y} - y$

and we have that $\hat{y} = \sigma(w^Tx)$ and the derivative of the sigmoid function is $\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1-\sigma(z))$ thus we have,

$\frac{\partial \hat{y}}{\partial w} = \frac{1}{1+exp(w^Tx + b)} (1 - \frac{1}{1+exp(w^Tx + b)})$ .

So we can then update the weights through gradient descent as

$w^{new} = w^{old} - \eta \frac{\partial C}{\partial w}$

where $\eta$ is the learning rate.

— JahKnows
source

2

please tell me why is this process not so nicely described in books? Do you have a blog? What materials for learning neural networks do you recommend? I have test data and I want to train it. Can I draw a function that I will minimize? I would like to visualize this process to better understand it.

— lukassz

Can you explain backpropagation in this simple way?

— lukassz

1

Amazing Answer...(+1)

— Aditya

1

Backprop is also similar to what JahKnows has Explained above... Its just the gradient is carried all the way to the inputs right from the outputs.. A quick google search will make this clear.. Also the same goes every other activation functions also..

— Aditya

1

@lukassz, notice that his equation is the same as the one I have for the weight update in the before last equation.

\frac{\partial C}{\partial w} = (\hat{y} - y) * derivative of sigmoid

$\frac{\partial C}{\partial w} = (\hat{y} - y) * \text{derivative of sigmoid}$ . He uses the same cost function as me, dont forget that you need to take the derivative of the loss function too, that becomes

\hat{y} - y

$\hat{y} - y$ , where

\hat{y}

$\hat{y}$ are the predicted labels and

y

$y$ are the ground truth labels.

— JahKnows

2

During the phase where the neural network generates its prediction, it feeds the input forward through the network. For each layer, the layer's input $X$ goes first through an affine transformation $W \cdot X + b$ and then is passed through the sigmoid function $σ(W \cdot X + b)$ .

In order to train the network, the output $\hat y$ is then compared to the expected output (or label) $y$ through a cost function $L(y, \hat y)=L\left(y, σ(W \cdot X + b)\right)$ . The goal of the whole training procedure is to minimize that cost function. In order to do that, a technique called gradient descent is performed which calculates how we should change $W$ and $b$ so that the cost reduces.

Gradient Descent requires calculating the derivative of the cost function w.r.t $W$ and $b$ . In order to do that we must apply the chain rule, because the derivative we need to calculate is a composition of two functions. As dictated by the chain rule we must calculate the derivative of the sigmoid function.

One of the reasons that the sigmoid function is popular with neural networks, is because its derivative is easy to compute.

— M Sef
source

1

In simple words:

Derivative shows neuron's ability to learn on particular input.

For example if input is 0 or 1 or -2, the derivative (the "learning ability") is high and back-propagation will improve neuron's weights for this sample dramatically.

On other hand, if input is 20, the the derivative will be very close to 0. It means that back-propagation on this sample will not "teach" this neuron to produce a better result.

The things above are valid for a single sample.

Let's look at the bigger picture, for all samples in the training set. Here we have several situations:

If derivative is 0 for all samples in your training set AND neuron always produces wrong results - it means the neuron is saturated (dumb) and will not improve.
If derivative is 0 for all samples in your training set AND neuron always produces correct results - it means the neuron have been studying really well and already as smart as it could (side note: this case is good but it may indicate potential overfitting, which is not good)
If derivative is 0 on some samples, non-0 on other samples AND neuron produces mixed results - it indicates that this neuron doing some good work and potentially may improve from further training (though not necessarily as it depends on other neurons and training data you have)

So, when you are looking at the derivative plot, you can see how much the neuron prepared to learn and absorb the new knowledge, given a particular input.

— VeganHunter
source

0

The derivative you see here is important in neural networks. It's the reason why people generally prefer something else such as rectified linear unit.

Do you see the derivative drop for the two ends? What if your network is on the very left side, but it needs to move to the right side? Imagine you're on -10.0 but you want 10.0. The gradient will be too small for your network to converge quickly. We don't want to wait, we want quicker convergence. RLU doesn't have this problem.

We call this problem "Neural Network Saturation".

Please see https://www.quora.com/What-is-special-about-rectifier-neural-units-used-in-NN-learning

— HelloWorld
source

S形函数在神经网络中的作用导数

损失函数

最小化损失函数

梯度下降

Applied to neural networks

How to train the neural network?