了解偏差方差折衷推导

我正在阅读《统计学习的要素》一书中的偏方差权衡一章，并对第29页的公式感到怀疑。让数据来自模型，使得

Y = f (x) + ϵ

$Y = f(x)+\epsilon$ ，其中

ϵ

$\epsilon$ 是具有期望值

和方差。让该模型的误差的期望值是其中是预测

\hat{ϵ} = E [ϵ] = 0

$\hat{\epsilon} = E[\epsilon]=0$

E [(ϵ - \hat{ϵ})^{2}] = E [ϵ^{2}] = σ^{2}

$E[(\epsilon - \hat\epsilon)^2]=E[\epsilon^2]=\sigma^2$

E [(Y - f_{k} (x))^{2}]

$E[(Y-f_k(x))^2]$

f_{k} (x)

$f_k(x)$

x

$x$ 我们的学习者。根据这本书，误差为

E [(Y - f_{k} (x))^{2}] = σ^{2} + B i a s (f_{k})^{2} + V a r (f_{k} (x)) .

$E[(Y-f_k(x))^2]=\sigma^2+Bias(f_k)^2+Var(f_k(x)).$

我的问题是为什么偏项不是0？开发错误的公式，我看到

E [(Y - f_{k} (x))^{2}] = E [(f (x) + ϵ - f_{k} (x))^{2}] = E [(f (x) - f_{k} (x))^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + E [ϵ^{2}] = V a r (f_{k} (x)) + 2 E [(f (x) - f_{k} (x)) ϵ] + σ^{2}

$E[(Y-f_k(x))^2]=\\ E[(f(x)+\epsilon-f_k(x))^2]=\\ E[(f(x)-f_k(x))^2]+2E[(f(x)-f_k(x))\epsilon]+E[\epsilon^2]=\\ Var(f_k(x))+2E[(f(x)-f_k(x))\epsilon]+\sigma^2$

因为是一个独立的随机数 $\epsilon$ $2E[(f(x)-f_k(x))\epsilon]=2E[(f(x)-f_k(x))]E[\epsilon]=0$

我哪里错了？

— 信使
source

Answers:

你没有错，但你在一个步骤，因为犯了一个错误。是 $E[(f(x)-f_k(x))^2] \ne Var(f_k(x))$ $E[(f(x)-f_k(x))^2]$ 。 $\text{MSE}(f_k(x)) = Var(f_k(x)) + \text{Bias}^2(f_k(x))$

\begin{aligned} E [(Y - f_{k} (x))^{2}] & = E [(f (x) + ϵ - f_{k} (x))^{2}] \\ = E [(f (x) - f_{k} (x))^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + E [ϵ^{2}] \\ = E [{(f (x) - E (f_{k} (x)) + E (f_{k} (x)) - f_{k} (x))}^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + σ^{2} \\ = V a r (f_{k} (x)) + {Bias}^{2} (f_{k} (x)) + σ^{2} . \end{aligned}

$\begin{align*} E[(Y-f_k(x))^2]& = E[(f(x)+\epsilon-f_k(x))^2] \\ &= E[(f(x)-f_k(x))^2]+2E[(f(x)-f_k(x))\epsilon]+E[\epsilon^2]\\ &= E\left[\left(f(x) - E(f_k(x)) + E(f_k(x))-f_k(x) \right)^2 \right] + 2E[(f(x)-f_k(x))\epsilon]+\sigma^2 \\ & = Var(f_k(x)) + \text{Bias}^2(f_k(x)) + \sigma^2. \end{align*}$

注意： $E[(f_k(x)-E(f_k(x)))(f(x)-E(f_k(x))] = E[f_k(x)-E(f_k(x))](f(x)-E(f_k(x))) = 0.$

— 格林帕克
source

如果是二元结果，是否存在将交叉熵作为误差度量的等效证明？

— emanuele

对于二进制响应，效果不是很好。请参阅“统计学习的要素”第二版中的例7.2。

— 马修·德鲁里

你能解释一下如何从

至

E [{(f (x) - E (f_{k} (x)) + E (f_{k} (x)) - f_{k} (x))}^{2}] + 2 E [(f (x) - f_{k} (x)) ϵ] + σ^{2}

$E\left[\left(f(x) - E(f_k(x)) + E(f_k(x))-f_k(x) \right)^2 \right] + 2E[(f(x)-f_k(x))\epsilon]+\sigma^2$

？

V a r (f_{k} (x)) + {Bias}^{2} (f_{k} (x)) + σ^{2}

$Var(f_k(x)) + \text{Bias}^2(f_k(x)) + \sigma^2$

— Antoine

偏差-方差分解的更多步骤

确实，在教科书中很少给出完整的推导，因为它涉及许多无启发的代数。这是使用第223页的“统计学习的要素”一书中的符号进行的更完整推导

如果我们假设 $Y = f(X) + \epsilon$ 和 $E[\epsilon] = 0$ 和 $Var(\epsilon) = \sigma^2_\epsilon$ 然后我们可以推导出用于回归拟合的预期预测误差的表达式使用平方误差损失在输入 $\hat f(X)$ $X = x_0$

E r r (x_{0}) = E [(Y - \hat{f} (x_{0}))^{2} | X = x_{0}]

$Err(x_0) = E[ (Y - \hat f(x_0) )^2 | X = x_0]$

为了标记简单起见设，和回想和 $\hat f(x_0) = \hat f$ $f(x_0) = f$ $E[f] = f$ $E[Y] = f$

\begin{aligned} E [(Y - \hat{f})^{2}] & = E [(Y - f + f - \hat{f})^{2}] \\ = E [(y - f)^{2}] + E [(f - \hat{f})^{2}] + 2 E [(f - \hat{f}) (y - f)] \\ = E [(f + ϵ - f)^{2}] + E [(f - \hat{f})^{2}] + 2 E [f Y - f^{2} - \hat{f} Y + \hat{f} f] \\ = E [ϵ^{2}] + E [(f - \hat{f})^{2}] + 2 (f^{2} - f^{2} - f E [\hat{f}] + f E [\hat{f}]) \\ = σ_{ϵ}^{2} + E [(f - \hat{f})^{2}] + 0 \end{aligned}

$\begin{aligned} E[ (Y - \hat f)^2 ] &= E[(Y - f + f - \hat f )^2] \\ & = E[(y - f)^2] + E[(f - \hat f)^2] + 2 E[(f - \hat f)(y - f)] \\ & = E[(f + \epsilon - f)^2] + E[(f - \hat f)^2] + 2E[fY - f^2 - \hat f Y + \hat f f] \\ & = E[\epsilon^2] + E[(f - \hat f)^2] + 2( f^2 - f^2 - f E[\hat f] + f E[\hat f] ) \\ & = \sigma^2_\epsilon + E[(f - \hat f)^2] + 0 \end{aligned}$

For the term $E[(f - \hat f)^2]$ we can use a similar trick as above, adding and subtracting $E[\hat f]$ to get

\begin{aligned} E [(f - \hat{f})^{2}] & = E [(f + E [\hat{f}] - E [\hat{f}] - \hat{f})^{2}] \\ = E {[f - E [\hat{f}]]}^{2} + E {[\hat{f} - E [\hat{f}]]}^{2} \\ = {[f - E [\hat{f}]]}^{2} + E {[\hat{f} - E [\hat{f}]]}^{2} \\ = B i a s^{2} [\hat{f}] + V a r [\hat{f}] \end{aligned}

$\begin{aligned} E[(f - \hat f)^2] & = E[(f + E[\hat f] - E[\hat f] - \hat f)^2] \\ & = E \left[ f - E[\hat f] \right]^2 + E\left[ \hat f - E[ \hat f] \right]^2 \\ & = \left[ f - E[\hat f] \right]^2 + E\left[ \hat f - E[ \hat f] \right]^2 \\ & = Bias^2[\hat f] + Var[\hat f] \end{aligned}$

Putting it together

E [(Y - \hat{f})^{2}] = σ_{ϵ}^{2} + B i a s^{2} [\hat{f}] + V a r [\hat{f}]

$E[ (Y - \hat f)^2 ] = \sigma^2_\epsilon + Bias^2[\hat f] + Var[\hat f]$

Some comments on why $E[\hat f Y] = f E[\hat f]$

Taken from Alecos Papadopoulos here

Recall that $\hat f$ is the predictor we have constructed based on the $m$ data points $\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)}) \}$ so we can write $\hat f = \hat f_m$ to remember that.

On the other hand $Y$ is the prediction we are making on a new data point $(x^{(m+1)},y^{(m+1)})$ by using the model constructed on the $m$ data points above. So the Mean Squared Error can be written as

E [{\hat{f}}_{m} (x^{(m + 1)}) - y^{(m + 1)}]^{2}

$E[\hat f_m(x^{(m+1)}) - y^{(m+1)}]^2$

Expanding the equation from the previous section

E [{\hat{f}}_{m} Y] = E [{\hat{f}}_{m} (f + ϵ)] = E [{\hat{f}}_{m} f + {\hat{f}}_{m} ϵ] = E [{\hat{f}}_{m} f] + E [{\hat{f}}_{m} ϵ]

$E[\hat f_m Y]=E[\hat f_m (f+ \epsilon)]=E[\hat f_m f+\hat f_m \epsilon]=E[\hat f_m f]+E[\hat f_m \epsilon]$

The last part of the equation can be viewed as

E [{\hat{f}}_{m} (x^{(m + 1)}) \cdot ϵ^{(m + 1)}] = 0

$E[\hat f_m(x^{(m+1)}) \cdot \epsilon^{(m+1)}] = 0$

Since we make the following assumptions about the point $x^{(m+1)}$ :

It was not used when constructing $\hat f_m$
It is independent of all other observations $\{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)}) \}$
It is independent of $\epsilon^{(m+1)}$

Other sources with full derivations

— Xavier Bourret Sicotte
source

Why

E [\hat{f} Y] = f E [\hat{f}]

$E[\hat{f}Y]=f E[\hat{f}]$ ? I don't think

Y

$Y$ and

\hat{f}

$\hat{f}$ are independent, since

\hat{f}

$\hat{f}$ is essentially constructed using

Y

$Y$ .

— Felipe Pérez

But the question is essentially the same, why

E [\hat{f} ϵ] = 0

$E[\hat{f}\epsilon]=0$ ? The randomness of

\hat{f}

$\hat{f}$ comes from the error

ϵ

$\epsilon$ so I don't see why would

\hat{f}

$\hat{f}$ and

ϵ

$\epsilon$ be independent, and hence,

E (\hat{f} ϵ) = 0

$\mathbb{E}(\hat{f}\epsilon)=0$ .

— Felipe Pérez

From your precisation seems that the in sample vs out of sample perspective is crucial. It's so? If we work only in sample and, then, see

ϵ

$\epsilon$ as residual the bias variance tradeoff disappear?

— markowitz

@FelipePérez as far as I understand, the randomness of

\hat{f}

$\hat{f}$ comes from the train-test split (which points ended up in the training set and gave

\hat{f}

$\hat{f}$ as the trained predictor). In other words, the variance of

\hat{f}

$\hat{f}$ comes from all the possible subsets of a given fixed data-set that we can take as the training set. Because the data-set is fixed, there is no randomness coming from

ϵ

$\epsilon$ and therefore

\hat{f}

$\hat{f}$ and

ϵ

$\epsilon$ are independent.

— Alberto Santini

了解偏差方差折衷推导

偏差-方差分解的更多步骤

Some comments on why E[f^Y]=fE[f^]E[f^Y]=fE[f^]E[\hat f Y] = f E[\hat f]

Other sources with full derivations

Some comments on why $E[\hat f Y] = f E[\hat f]$