从感知器规则到梯度下降：具有S型激活函数的感知器与逻辑回归有何不同？

本质上，我的问题是在多层感知器中，感知器具有S形激活功能。因此，在更新规则计算公式为 $\hat{y}$

\hat{y} = \frac{1}{1 + \exp (- w^{T} x_{i})}

$\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)}$

那么，这种“ S型”感知器与逻辑回归有何不同？

我要说的是一个单层乙状结肠感知等效于逻辑回归的意义上，二者使用 $\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)}$ 更新规则中为。此外，这两个返回 $\operatorname{sign}(\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)})$ 在预测。但是，在多层感知器中，与逻辑回归和单层感知器相比，使用S形激活函数来返回概率，而不是通断信号。

我认为“感知器”一词的用法可能有点含糊，所以让我根据对单层感知器的当前理解提供一些背景知识：

经典感知器规则

首先，是F. Rosenblatt的经典感知器，其中具有阶跃函数：

Δ w_{d} = η (y_{i} - \hat{y_{i}}) x_{i d} y_{i}, \hat{y_{i}} \in {- 1, 1}

$\Delta w_d = \eta(y_{i} - \hat{y_i})x_{id} \quad\quad y_{i}, \hat{y_i} \in \{-1,1\}$

更新权重

w_{k} := w_{k} + Δ w_{k} (k \in {1, . . ., d})

$w_k := w_k + \Delta w_k \quad \quad (k \in \{1, ..., d\})$

这样计算公式为 $\hat{y}$

\hat{y} = sign (w^{T} x_{i}) = sign (w_{0} + w_{1} x_{i 1} + . . . + w_{d} x_{i d})

$\hat{y} = \operatorname{sign}(\mathbf{w}^T\mathbf{x}_i) = \operatorname{sign}(w_0 + w_1x_{i1} + ... + w_dx_{id})$

梯度下降

使用梯度下降，我们优化（最小化）成本函数

J (w) = \sum_{i} \frac{1}{2} (y_{i} - \hat{y_{i}})^{2} y_{i}, \hat{y_{i}} \in R

$J(\mathbf{w}) = \sum_{i} \frac{1}{2}(y_i - \hat{y_i})^2 \quad \quad y_i,\hat{y_i} \in \mathbb{R}$

这里有“实”数，所以我看到这基本上类似于线性回归，不同之处在于我们的分类输出受阈值限制。

在这里，当我们更新权重时，我们向渐变的负方向迈出了一步

Δ w_{k} = - η \frac{\partial J}{\partial w_{k}} = - η \sum_{i} (y_{i} - \hat{y_{i}}) (- x_{i k}) = η \sum_{i} (y_{i} - \hat{y_{i}}) x_{i k}

$\Delta w_k = - \eta \frac{\partial J}{\partial w_k} = - \eta \sum_i (y_i - \hat{y_i})(- x_{ik}) = \eta \sum_i (y_i - \hat{y_i})x_{ik}$

但在这里，我们有，而不是 $\hat{y} = \mathbf{w}^T\mathbf{x}_i$ $\hat{y} = \operatorname{sign}(\mathbf{w}^T\mathbf{x}_i)$

w_{k} := w_{k} + Δ w_{k} (k \in {1, . . ., d})

$w_k := w_k + \Delta w_k \quad \quad (k \in \{1, ..., d\})$

此外，与经典感知器规则相反，我们计算了整个训练数据集上的完整遍历的平方误差总和，而经典感知器规则随着新训练样本的到来而更新了权重（模拟到随机梯度下降-在线学习）。

乙状结肠激活功能

现在，这是我的问题：

在多层感知器中，感知器与乙状结肠激活功能一起使用。因此，在更新规则计算公式为 $\hat{y}$

\hat{y} = \frac{1}{1 + \exp (- w^{T} x_{i})}

$\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)}$

那么，这种“ S型”感知器与逻辑回归有何不同？

令人惊讶的是，这个问题本身使我能够浓缩我的机器学习和神经网络基础知识！

— varun

Answers:

使用梯度下降，我们优化（最小化）成本函数

$J (w) = \sum_{i} \frac{1}{2} (y_{i} - \hat{y_{i}})^{2} y_{i}, \hat{y_{i}} \in R$ $J(\mathbf{w}) = \sum_{i} \frac{1}{2}(y_i - \hat{y_i})^2 \quad \quad y_i,\hat{y_i} \in \mathbb{R}$

如果最小化均方误差，则它与逻辑回归不同。Logistic回归通常与交叉熵损失相关，这是scikit-learn库的介绍页面。

（我假设多层感知器就是称为神经网络的东西。）

如果将交叉熵损失（带正则化）用于单层神经网络，则它将与逻辑回归模型相同（对数线性模型）。如果改为使用多层网络，则可以将其视为具有参数非线性基础函数的逻辑回归。

但是，在多层感知器中，与逻辑回归和单层感知器相比，使用S形激活函数来返回概率，而不是通断信号。

具有S形激活函数的逻辑回归和神经网络的输出都可以解释为概率。由于交叉熵损失实际上是通过伯努利分布定义的负对数似然。

— Dontloo
source

因为梯度下降以减少输出误差的方式更新每个参数，该误差必须是所有参数的连续函数。基于阈值的激活是不可区分的，这就是为什么使用S型或tanh激活的原因。

这是单层NN

$\frac{dJ(w,b)}{d\omega_{kj}} =\frac{dJ(w,b)}{dz_k}\cdot \frac{dz_k}{d\omega_{kj}}$

$\frac{dJ(w,b)}{dz_k} = (a_k -y_k)(a_k(1-a_k))$

$\frac{dz_k}{d\omega_{kj}} = x_k$

$J(w,b) = \frac{1}{2} (y_k - a_k)^2$

$a_k = sigm(z_k) = sigm(W_{kj}*x_k + b_k)$

if activation function were a basic step function (threshold), derivative of $J$ w.r.t $z_k$ would be non-differentiable.

here is a link that explain it in general.

Edit: Maybe, I misunderstood what you mean by perceptron. If I'm not mistaken, perceptron is threholded weighed sum of inputs. If you change threholding with logistic function it turns into logistic regression. Multi-layer NN with sigmoid (logistic) activation functions is cascaded layers composed of logistic regressions.

— yasin.yazici
source

This doesn't answer the question.

— Neil G

Thanks for writing this nice comment, but this was not what I was asking for. My question was not "why gradient descent" but "what makes a perceptron with a sigmoid activation function different from logistic regression"

@SebastianRaschka They are the same. What makes you think that they are different? I've drove gradient descent because I saw a mistake in your gradient descent evaluation. You assumed

y = W^{T} X

$y = W^T X$ when you were driving it. That is why you found the same derivation for both Perceptron and Gradient update.

— yasin.yazici

"What makes you think that they are different?" -- the nomenclature, thus I was wondering if there is something else; I am just curious why we have 2 different terms for the same thing. Btw. I don't see any mistake in the gradient descent in my question.

y = w_{j}^{T} x_{j i}

$y = w_j^Tx_{ji}$ is correct. And I also didn't find the same derivation between "perceptron rule" and "gradient descent" update. The former is done in an online learning manner (sample by sample), the latter is done in batch, and also we minimize the sum of squared errors instead of using a stepwise function.

I think what might caused the confusion is that you have distinguish between the "classification" and the "learning" step. The classification step is always thresholded (-1 or 1, or 0 and 1 if you like). However, the update is different, in the classic perceptron, the update is done via

η (y - s i g n (w^{T} x_{i})) x

$\eta (y - sign(w^Tx_i))x$ whereas in let's say stochastic gradient descent it is

η (y - w^{T} x_{i}) x_{i}

$\eta (y - w^Tx_i)x_i$

Intuitively, I think of a multilayer perceptron as computing a nonlinear transformation on my input features, and then feeding these transformed variables into a logistic regression.

The multinomial (that is, N > 2 possible labels) case may make this more clear. In traditional logistic regression, for a given data point, you want to compute a "score", $\beta_i X$ , for each class, $i$ . And the way you convert these to probabilities is just by taking the score for the given class over the sum of scores for all classes, $\frac{\beta_i X}{\sum_j \beta_j X}$ . So a class with a large score has a larger share of the combined score and so a higher probability. If forced to predict a single class, you choose the class with the largest probability (which is also the largest score).

I don't know about you, but in my modeling courses and research, I tried all kinds of sensible and stupid transformations of the input features to improve their significance and overall model prediction. Squaring things, taking logs, combining two into a rate, etc. I had no shame, but I had limited patience.

A multilayer perceptron is like a graduate student with way too much time on her hands. Through the gradient descent training and sigmoid activations, it's going to compute arbitrary nonlinear combinations of your original input variables. In the final layer of the perceptron, these variables effectively become the $X$ in the above equation, and your gradient descent also computes an associated final $\beta_i$ . The MLP framework is just an abstraction of this.

— Dan Van Boxel
source