什么是GELU激活？

18

我正在浏览使用GELU（高斯误差线性单位）的 BERT论文，该论文将方程表示为依次近似为

G E L U (x) = x P (X \leq x) = x Φ (x) .

$GELU(x) = xP(X ≤ x) = xΦ(x).$

0.5 x (1 + t a n h [\sqrt{2 / π} (x + 0.044715 x^{3})])

$0.5x(1 + tanh[\sqrt{ 2/π}(x + 0.044715x^3)])$

您能简化方程式并解释它是如何近似的。

activation-function bert mathematics

— 比那托兹
source

19

GELU功能

我们可以展开的累积分布 $\mathcal{N}(0, 1)$ ，即 $\Phi(x)$ ，如下：

格鲁 （ X ） ：= X P （ X \leq X ） = X Φ （ X ） = 0.5 X （ 1个 + 埃尔夫 （ \frac{X}{\sqrt{2}} ） ）

$\text{GELU}(x):=x{\Bbb P}(X \le x)=x\Phi(x)=0.5x\left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$

请注意，这是一个定义，而不是方程式（或关系）。作者为此提议提供了一些理由，例如随机类比，但是从数学上讲，这只是一个定义。

这是GELU的图：

tanh近似

对于这些类型的数值逼近，关键思想是找到一个相似的函数（主要基于经验），对其进行参数化，然后将其拟合至原始函数中的一组点。

知道 $\text{erf}(x)$ 非常接近 $\text{tanh}(x)$

和一阶导数 $\text{erf}(\frac{x}{\sqrt{2}})$ 与相符 $\text{tanh}(\sqrt{\frac{2}{\pi}}x)$ 在 $x=0$ ，这是 $\sqrt{\frac{2}{\pi}}$ ，我们继续拟合

谭 （ \sqrt{\frac{2}{π}} （ X + 一种 X^{2} + b X^{3} + C X^{4} + d X^{5} ） ）

$\text{tanh}\left(\sqrt{\frac{2}{\pi}}(x+ax^2+bx^3+cx^4+dx^5)\right)$ （或带有更多项）到一组点

(x_{i}, erf (\frac{x_{i}}{\sqrt{2}}))

$\left(x_i, \text{erf}\left(\frac{x_i}{\sqrt{2}}\right)\right)$ 。

我已经安装此功能之间的20个样品 $(-1.5, 1.5)$ （使用本网站），这里是系数：

通过设置 $a=c=d=0$ ， $b$ 估计为 $0.04495641$ 。如果有更多样本处于更宽的范围内（该位置仅允许20个样本），系数 $b$ 将更接近纸张的 $0.044715$ 。最后我们得到

$\text{GELU}(x)=x\Phi(x)=0.5x\left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)\simeq 0.5x\left(1+\text{tanh}\left(\sqrt{\frac{2}{\pi}}(x+0.044715x^3)\right)\right)$

与均方误差 $\sim 10^{-8}$ 为 $x \in [-10, 10]$ 。

请注意，如果我们未利用一阶导数之间的关系，则术语 $\sqrt{\frac{2}{\pi}}$ 将被包含在参数中，如下所示：

0.5 X （ 1个 + 谭 （ 0.797885 X + 0.035677 X^{3} ） ）

$0.5x\left(1+\text{tanh}\left(0.797885x+0.035677x^3\right)\right)$ ，它不那么漂亮（分析性更强，数值更大）！

利用平价

正如@BookYourLuck所建议的，我们可以利用函数的奇偶性来限制搜索多项式的空间。也就是说，由于 $\text{erf}$ 是奇函数，即 $f(-x)=-f(x)$ ，和 $\text{tanh}$ 也是奇函数，多项式函数 $\text{pol}(x)$ 内 $\text{tanh}$ 也应该是奇数（应该仅具有奇次幂 $x$ ）有

erf (- x) ≃ tanh (pol (- x)) = tanh (- pol (x)) = - tanh (pol (x)) ≃ - erf (x)

$\text{erf}(-x)\simeq\text{tanh}(\text{pol}(-x))=\text{tanh}(-\text{pol}(x))=-\text{tanh}(\text{pol}(x))\simeq-\text{erf}(x)$

以前，我们很幸运最后得到偶数幂 $x^2$ 和 $x^4$ （几乎）零系数，但是通常，这可能导致低质量近似，例如，像 $0.23x^2$ 这样的项被取消了。不用额外选择（偶数或奇数），而不是简单地选择 $0x^2$ 。

乙状结肠逼近

$\text{erf}(x)$ $2\left(\sigma(x)-\frac{1}{2}\right)$ $\sim 10^{-4}$ $x \in [-10, 10]$

这是用于生成数据点，拟合函数并计算均方误差的Python代码：

import math
import numpy as np
import scipy.optimize as optimize


def tahn(xs, a):
    return [math.tanh(math.sqrt(2 / math.pi) * (x + a * x**3)) for x in xs]


def sigmoid(xs, a):
    return [2 * (1 / (1 + math.exp(-a * x)) - 0.5) for x in xs]


print_points = 0
np.random.seed(123)
# xs = [-2, -1, -.9, -.7, 0.6, -.5, -.4, -.3, -0.2, -.1, 0,
#       .1, 0.2, .3, .4, .5, 0.6, .7, .9, 2]
# xs = np.concatenate((np.arange(-1, 1, 0.2), np.arange(-4, 4, 0.8)))
# xs = np.concatenate((np.arange(-2, 2, 0.5), np.arange(-8, 8, 1.6)))
xs = np.arange(-10, 10, 0.001)
erfs = np.array([math.erf(x/math.sqrt(2)) for x in xs])
ys = np.array([0.5 * x * (1 + math.erf(x/math.sqrt(2))) for x in xs])

# Fit tanh and sigmoid curves to erf points
tanh_popt, _ = optimize.curve_fit(tahn, xs, erfs)
print('Tanh fit: a=%5.5f' % tuple(tanh_popt))

sig_popt, _ = optimize.curve_fit(sigmoid, xs, erfs)
print('Sigmoid fit: a=%5.5f' % tuple(sig_popt))

# curves used in https://mycurvefit.com:
# 1. sinh(sqrt(2/3.141593)*(x+a*x^2+b*x^3+c*x^4+d*x^5))/cosh(sqrt(2/3.141593)*(x+a*x^2+b*x^3+c*x^4+d*x^5))
# 2. sinh(sqrt(2/3.141593)*(x+b*x^3))/cosh(sqrt(2/3.141593)*(x+b*x^3))
y_paper_tanh = np.array([0.5 * x * (1 + math.tanh(math.sqrt(2/math.pi)*(x + 0.044715 * x**3))) for x in xs])
tanh_error_paper = (np.square(ys - y_paper_tanh)).mean()
y_alt_tanh = np.array([0.5 * x * (1 + math.tanh(math.sqrt(2/math.pi)*(x + tanh_popt[0] * x**3))) for x in xs])
tanh_error_alt = (np.square(ys - y_alt_tanh)).mean()

# curve used in https://mycurvefit.com:
# 1. 2*(1/(1+2.718281828459^(-(a*x))) - 0.5)
y_paper_sigmoid = np.array([x * (1 / (1 + math.exp(-1.702 * x))) for x in xs])
sigmoid_error_paper = (np.square(ys - y_paper_sigmoid)).mean()
y_alt_sigmoid = np.array([x * (1 / (1 + math.exp(-sig_popt[0] * x))) for x in xs])
sigmoid_error_alt = (np.square(ys - y_alt_sigmoid)).mean()

print('Paper tanh error:', tanh_error_paper)
print('Alternative tanh error:', tanh_error_alt)
print('Paper sigmoid error:', sigmoid_error_paper)
print('Alternative sigmoid error:', sigmoid_error_alt)

if print_points == 1:
    print(len(xs))
    for x, erf in zip(xs, erfs):
        print(x, erf)

输出：

Tanh fit: a=0.04485
Sigmoid fit: a=1.70099
Paper tanh error: 2.4329173471294176e-08
Alternative tanh error: 2.698034519269613e-08
Paper sigmoid error: 5.6479106346814546e-05
Alternative sigmoid error: 5.704246564663601e-05

— 埃斯迈良
source

2

为什么需要近似值？他们不能只使用erf函数吗？

— SebiSebi

8

首先要注意的是

Φ （ X ） = \frac{1个}{2} Ë [R F C （ - \frac{X}{\sqrt{2}} ） = \frac{1个}{2} （ 1个 + Ë [R F （ \frac{X}{\sqrt{2}} ） ）

$\Phi(x) = \frac12 \mathrm{erfc}\left(-\frac{x}{\sqrt{2}}\right) = \frac12 \left(1 + \mathrm{erf}\left(\frac{x}{\sqrt2}\right)\right)$ 按平价

e r f

$\mathrm{erf}$ 。我们需要证明

Ë [R F （ \frac{X}{\sqrt{2}} ） \approx 谭 （ \sqrt{\frac{2}{π}} （ X + 一种 X^{3} ） ）

$\mathrm{erf}\left(\frac x {\sqrt2}\right) \approx \tanh\left(\sqrt{\frac2\pi} \left(x + a x^3\right)\right)$ 对于

a \approx 0.044715

$a \approx 0.044715$ 。

对于较大的值 $x$ ，两个函数都受限制 $[-1, 1]$ 。对于小 $x$ ，各自的泰勒级数读为

谭 （ X ） = X - \frac{X^{3}}{3} + Ø （ X^{3} ）

$\tanh(x) = x - \frac{x^3}{3} + o(x^3)$ 和

Ë [R F （ X ） = \frac{2}{\sqrt{π}} （ X - \frac{X^{3}}{3} ） + Ø （ X^{3} ） 。

$\mathrm{erf}(x) = \frac{2}{\sqrt{\pi}} \left(x - \frac{x^3}{3}\right) + o(x^3).$ 代入，我们得到

谭 （ \sqrt{\frac{2}{π}} （ X + 一种 X^{3} ） ） = \sqrt{\frac{2}{π}} （ X + （ 一种 - \frac{2}{3 π} ） X^{3} ） + Ø （ X^{3} ）

$\tanh\left(\sqrt{\frac2\pi} \left(x + a x^3\right)\right) = \sqrt\frac{2}{\pi} \left(x + \left(a-\frac{2}{3\pi}\right)x^3\right) + o(x^3)$ 和

Ë [R F （ \frac{X}{\sqrt{2}} ） = \sqrt{\frac{2}{π}} （ X - \frac{X^{3}}{6} ） + Ø （ X^{3} ） 。

$\mathrm{erf}\left(\frac x {\sqrt2}\right) = \sqrt\frac2\pi \left(x - \frac{x^3}{6}\right) + o(x^3).$ 的等式系数

x^{3}

$x^3$ ，我们发现

一种 \approx 0.04553992412

$a \approx 0.04553992412$ 接近论文的

0.044715

$0.044715$ 。

— 预订您的运气
source