了解偏差方差折衷推导


20

我正在阅读《统计学习的要素》一书中的偏方差权衡一章,并对第29页的公式感到怀疑。让数据来自模型,使得

Y=f(x)+ϵ
,其中ϵ是具有期望值ε = ë [ ε ] = 0和方差。让该模型的误差的期望值是 其中是预测ϵ^=E[ϵ]=0E[(ϵϵ^)2]=E[ϵ2]=σ2
E[(Yfk(x))2]
fk(x)x我们的学习者。根据这本书,误差为
E[(Yfk(x))2]=σ2+Bias(fk)2+Var(fk(x)).

我的问题是为什么偏项不是0?开发错误的公式,我看到

E[(Yfk(x))2]=E[(f(x)+ϵfk(x))2]=E[(f(x)fk(x))2]+2E[(f(x)fk(x))ϵ]+E[ϵ2]=Var(fk(x))+2E[(f(x)fk(x))ϵ]+σ2

因为是一个独立的随机数ϵ2E[(f(x)fk(x))ϵ]=2E[(f(x)fk(x))]E[ϵ]=0

我哪里错了?

Answers:


19

你没有错,但你在一个步骤,因为犯了一个错误ë [ ˚F X - ˚F ķX 2 ]MSE ˚F ķX = VE[(f(x)fk(x))2]Var(fk(x))E[(f(x)fk(x))2]MSE(fk(x))=Var(fk(x))+Bias2(fk(x))

E[(Yfk(x))2]=E[(f(x)+ϵfk(x))2]=E[(f(x)fk(x))2]+2E[(f(x)fk(x))ϵ]+E[ϵ2]=E[(f(x)E(fk(x))+E(fk(x))fk(x))2]+2E[(f(x)fk(x))ϵ]+σ2=Var(fk(x))+Bias2(fk(x))+σ2.

注意:E[(fk(x)E(fk(x)))(f(x)E(fk(x))]=E[fk(x)E(fk(x))](f(x)E(fk(x)))=0.


如果是二元结果,是否存在将交叉熵作为误差度量的等效证明?
emanuele

1
对于二进制响应,效果不是很好。请参阅“统计学习的要素”第二版中的例7.2。
马修·德鲁里

3
你能解释一下如何从VE[(f(x)E(fk(x))+E(fk(x))fk(x))2]+2E[(f(x)fk(x))ϵ]+σ2Var(fk(x))+Bias2(fk(x))+σ2
Antoine

16

偏差-方差分解的更多步骤

确实,在教科书中很少给出完整的推导,因为它涉及许多无启发的代数。这是使用第223页的“统计学习的要素”一书中的符号进行的更完整推导


如果我们假设Y=f(X)+ϵE[ϵ]=0Var(ϵ)=σϵ2然后我们可以推导出用于回归拟合的预期预测误差的表达式˚FX 使用平方误差损失在输入X = x 0处f^(X)X=x0

Err(x0)=E[(Yf^(x0))2|X=x0]

为了标记简单起见设˚FX 0= ˚F˚F X 0= ˚F和回想ë [ ˚F ] = ˚Fë [ ÿ ] = ˚Ff^(x0)=f^f(x0)=fE[f]=fE[Y]=f

E[(Yf^)2]=E[(Yf+ff^)2]=E[(yf)2]+E[(ff^)2]+2E[(ff^)(yf)]=E[(f+ϵf)2]+E[(ff^)2]+2E[fYf2f^Y+f^f]=E[ϵ2]+E[(ff^)2]+2(f2f2fE[f^]+fE[f^])=σϵ2+E[(ff^)2]+0

For the term E[(ff^)2] we can use a similar trick as above, adding and subtracting E[f^] to get

E[(ff^)2]=E[(f+E[f^]E[f^]f^)2]=E[fE[f^]]2+E[f^E[f^]]2=[fE[f^]]2+E[f^E[f^]]2=Bias2[f^]+Var[f^]

Putting it together

E[(Yf^)2]=σϵ2+Bias2[f^]+Var[f^]


Some comments on why E[f^Y]=fE[f^]

Taken from Alecos Papadopoulos here

Recall that f^ is the predictor we have constructed based on the m data points {(x(1),y(1)),...,(x(m),y(m))} so we can write f^=f^m to remember that.

On the other hand Y is the prediction we are making on a new data point (x(m+1),y(m+1)) by using the model constructed on the m data points above. So the Mean Squared Error can be written as

E[f^m(x(m+1))y(m+1)]2

Expanding the equation from the previous section

E[f^mY]=E[f^m(f+ϵ)]=E[f^mf+f^mϵ]=E[f^mf]+E[f^mϵ]

The last part of the equation can be viewed as

E[f^m(x(m+1))ϵ(m+1)]=0

Since we make the following assumptions about the point x(m+1):

  • It was not used when constructing f^m
  • It is independent of all other observations {(x(1),y(1)),...,(x(m),y(m))}
  • It is independent of ϵ(m+1)

Other sources with full derivations


1
Why E[f^Y]=fE[f^]? I don't think Y and f^ are independent, since f^ is essentially constructed using Y.
Felipe Pérez

5
But the question is essentially the same, why E[f^ϵ]=0? The randomness of f^ comes from the error ϵ so I don't see why would f^ and ϵ be independent, and hence, E(f^ϵ)=0.
Felipe Pérez

From your precisation seems that the in sample vs out of sample perspective is crucial. It's so? If we work only in sample and, then, see ϵ as residual the bias variance tradeoff disappear?
markowitz

1
@FelipePérez as far as I understand, the randomness of f^ comes from the train-test split (which points ended up in the training set and gave f^ as the trained predictor). In other words, the variance of f^ comes from all the possible subsets of a given fixed data-set that we can take as the training set. Because the data-set is fixed, there is no randomness coming from ϵ and therefore f^ and ϵ are independent.
Alberto Santini
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.