预期的预测误差-推导

我正在努力理解低于预期（ESL）的预期预测误差的推导，尤其是在2.11和2.12的推导上（条件，即逐步达到最小点）。任何指针或链接，不胜感激。

我在下面报告ESL pg的摘录。18.前两个公式按顺序是公式2.11和2.12。

让 $X \in \mathbb{R}^p$ 分别表示实值随机输入向量，并 $Y \in \mathbb{R}$ 实值随机输出变量，与联合分布 $\text{Pr}(X,Y)$ 。我们追求的是功能 $f(X)$ 预测 $Y$ 输入的给定值 $X$ 。该理论要求损失函数 $L(Y,f(X))$ 用于惩罚预测误差，到目前为止，最常见和最方便的方法是平方误差损失： $L(Y,f(X))=(Y-f(X))^2$ 。这使我们得出选择 $f$ 的标准，

\begin{aligned} EPE (f) & = E (Y - f (X))^{2} \\ = \int [y - f (x)]^{2} Pr (d x, d y) \end{aligned}

$\begin{split} \text{EPE}(f) &= \text{E}(Y - f(X))^2\\ & = \int [y - f(x)]^2 \text{Pr}(dx, dy) \end{split}$

预期（平方）的预测误差。通过以 $X$ 条件，我们可以将EPE编写为

EPE (f) = E_{X} E_{Y | X} ([Y - f (X)]^{2} | X)

$\text{EPE}(f) = \text{E}_X \text{E}_{Y|X}([Y-f(X)]^2|X)$

并且我们看到足以将EPE逐点最小化：

f (x) = {argmin}_{c} E_{Y | X} ([Y - c]^{2} | X)

$f(x) = \text{argmin}_c \text{E}_{Y|X}([Y-c]^2|X)$

解决方法是

f (x) = E (Y | X = x)

$f(x) = \text{E}(Y|X=x)$

条件期望，也称为回归函数。

regression prediction error

— 用户名
source

在Wikipedia上有关总期望法则的第一个方程式中交换

和

得出（2.9）和（2.11）的等价关系。阅读该文章以获取证明。（2.12）是即时的，因为要选择

以最小化EPE。

X

$X$

Y

$Y$

f

$f$

— ub

旁注：这摘自统计学习元素

— 朱巴（Juzhub），2016年

对于那些也正在阅读本书的人，请查看Weathermax和Epstein

— 撰写的

@Dodgie该链接已消失：（

— Matthew Drury

@MatthewDrury幸运的是，“ Weathermax和Epstein统计信息”的搜索返回了链接，这是第一个结果；）- waxworksmath.com/Authors/G_M/Hastie/WriteUp/…–

— Dodgie

Answers:

\begin{aligned} E P E (f) & = \int [y - f (x)]^{2} P r (d x, d y) \\ = \int [y - f (x)]^{2} p (x, y) d x d y \\ = \int_{x} \int_{y} [y - f (x)]^{2} p (x, y) d x d y \\ = \int_{x} \int_{y} [y - f (x)]^{2} p (x) p (y | x) d x d y \\ = \int_{x} (\int_{y} [y - f (x)]^{2} p (y | x) d y) p (x) d x \\ = \int_{x} (E_{Y | X} ([Y - f (X)]^{2} | X = x)) p (x) d x \\ = E_{X} E_{Y | X} ([Y - f (X)]^{2} | X = x) \end{aligned}

$\begin{align*} EPE(f) &= \int [y - f(x)]^2 Pr(dx, dy) \\ &= \int [y - f(x)]^2p(x,y)dxdy \\ &= \int_x \int_y [y - f(x)]^2p(x,y)dxdy \\ &= \int_x \int_y [y - f(x)]^2p(x)p(y|x)dxdy \\ &= \int_x\left( \int_y [y - f(x)]^2p(y|x)dy \right)p(x)dx \\ &= \int_x \left( E_{Y|X}([Y - f(X)]^2|X = x) \right) p(x)dx\\ &= E_{X}E_{Y|X}([Y - f(X)]^2| X = x) \end{align*}$

— 用户名
source

我了解您的写意，但是您认为如果OP被问题中显示的推导所混淆，他/她将理解您的答案？当然，我已经理解了问题中显示的推导。

— 马克·L·斯通

我从Google带着同样的问题来到这里，实际上发现这个推导正是我所需要的。

— 分号和胶带2016年

@ MarkL.Stone-这可能是一个愚蠢的问题，但是您能否解释

含义以及它如何变成

？感谢一大堆

P r (d x, d y)

$Pr(dx,dy)$

p (x, y) d x d y

$p(x,y)dxdy$

— Xavier Bourret Sicotte，

前者是后者。我认为使用dP（x，y）或dF（x，y）更为常见。在1D模式中，您经常会看到dF（x）表示f（x）dx，其中f（x）是概率密度函数，但是这种符号也可以考虑离散的概率质量函数（求和），甚至可以混合使用连续密度和离散概率质量。

— 马克·L·斯通

说（最后一个公式）

更精确吗？

E_{X} (E_{Y | X} ([Y - f (X)]^{2} | X = x))

$E_{X}(E_{Y|X}([Y - f(X)]^2| X = x))$

— D1X

公式（2.11）是以下等式的结果。对于任何两个随机变量和以及任何函数 $Z_1$ $Z_2$ $g$

E_{Z_{1}, Z_{2}} (g (Z_{1}, Z_{2})) = E_{Z_{2}} (E_{Z_{1} ∣ Z_{2}} (g (Z_{1}, Z_{2}) ∣ Z_{2}))

$E_{Z_1, Z_2} (g(Z_1, Z_2)) = E_{Z_2}(E_{Z_1 \mid Z_2}(g(Z_1, Z_2) \mid Z_2))$

The notation $E_{Z_1, Z_2}$ is the expectation over the joint distribution. The notation $E_{Z_1 \mid Z_2}$ essentially says "integrate over the conditional distribution of $Z_1$ as if $Z_2$ was fixed".

It's easy to verify this in the case that $Z_1$ and $Z_2$ are discrete random variables by just unwinding the definitions involved

\begin{aligned} E_{Z_{2}} & (E_{Z_{1} ∣ Z_{2}} (g (Z_{1}, Z_{2}) ∣ Z_{2})) \\ = E_{Z_{2}} (\sum_{z_{1}} g (z_{1}, Z_{2}) P r (Z_{1} = z_{1} ∣ Z_{2})) \\ = \sum_{z_{2}} (\sum_{z_{1}} g (z_{1}, z_{2}) P r (Z_{1} = z_{1} ∣ Z_{2} = z_{2})) P r (Z_{2} = z_{2}) \\ = \sum_{z_{1}, z_{2}} g (z_{1}, z_{2}) P r (Z_{1} = z_{1} ∣ Z_{2} = z_{2}) P r (Z_{2} = z_{2}) \\ = \sum_{z_{1}, z_{2}} g (z_{1}, z_{2}) P r (Z_{1} = z_{1}, Z_{2} = z_{2}) \\ = E_{Z_{1}, Z_{2}} (g (Z_{1}, Z_{2})) \end{aligned}

$\begin{align} E_{Z_2} & (E_{Z_1 \mid Z_2}(g(Z_1, Z_2) \mid Z_2)) \\ &= E_{Z_2} \left( \sum_{z_1} g(z_1, Z_2) Pr(Z_1 = z_1 \mid Z_2 ) \right) \\ &= \sum_{z_2} \left( \sum_{z_1} g(z_1, z_2) Pr(Z_1 = z_1 \mid Z_2 = z_2 ) \right) Pr(Z_2 = z_2) \\ &= \sum_{z_1, z_2} g(z_1, z_2) Pr(Z_1 = z_1 \mid Z_2 = z_2) Pr(Z_2 = z_2) \\ &= \sum_{z_1, z_2} g(z_1, z_2) Pr(Z_1 = z_1, Z_2 = z_2 ) \\ &= E_{Z_1, Z_2} (g(Z_1, Z_2)) \end{align}$

The continuous case can either be viewed informally as a limit of this argument, or formally verified once all the measure theoretic do-dads are in place.

To unwind the application, take $Z_1 = Y$ , $Z_2 = X$ , and $g(x, y) = (y - f(x))^2$ . Everything lines up exactly.

The assertion (2.12) asks us to consider minimizing

E_{X} E_{Y ∣ X} (Y - f (X))^{2}

$E_X E_{Y \mid X} (Y - f(X))^2$

where we are free to choose $f$ as we wish. Again, focusing on the discrete case, and dropping halfway into the unwinding above, we see that we are minimizing

\sum_{x} (\sum_{y} (y - f (x))^{2} P r (Y = y ∣ X = x)) P r (X = x)

$\sum_{x} \left( \sum_{y} (y - f(x))^2 Pr(Y = y \mid X = x) \right) Pr(X = x)$

Everything inside the big parenthesis is non-negative, and you can minimize a sum of non-negative quantities by minimizing the summands individually. In context, this means that we can choose $f$ to minimize

\sum_{y} (y - f (x))^{2} P r (Y = y ∣ X = x)

$\sum_{y} (y - f(x))^2 Pr(Y = y \mid X = x)$

individually for each discrete value of $x$ . This is exactly the content of what ESL is claiming, only with fancier notation.

— Matthew Drury
source

I find some parts in this book express in a way that is difficult to understand, especially for those who do not have a strong background in statistics.

I will try to make it simple and hope that you can get rid of confusion.

Claim 1 (Smoothing) $E(X) = E(E(X|Y)),\forall X,Y$

Proof: Notice that E(Y) is a constant but E(Y|X) is a random variable depending on X.

\begin{aligned} E (E (X | Y)) & = \int E (X | Y = y) f_{Y} (y) d y \\ = \int \int x f_{X | Y} (x | y) d x f_{Y} (y) d y \\ = \int \int x f_{X | Y} (x | y) f_{Y} (y) d x d y \\ = \int \int x f_{X Y} (x, y) d x d y \\ = \int x (\int f_{X Y} (x, y) d y) d x \\ = \int x f_{X} (x) d x = E (X) \end{aligned}

$\begin{align} E(E(X|Y)) &= \displaystyle\int E(X|Y=y) f_Y(y) dy \\ &= \int \int x f_{X|Y} (x|y) dx f_Y(y) dy \\ &= \int \int x f_{X|Y} (x|y) f_Y(y) dx dy \\ &= \int \int x f_{XY} (x,y) dx dy \\ &= \int x \left(\int f_{XY} (x,y) dy \right) dx \\ &= \int x f_X(x) dx = E(X) \end{align}$

Claim 2: $E(Y - f(X))^2 \geq E(Y - E(Y|X))^2, \forall f$

Proof:

\begin{aligned} E ((Y - f (X))^{2} | X) & = E (([Y - E (Y | X)] + [E (Y | X) - f (X)])^{2} | X) \\ = E ((Y - E (Y | X))^{2} | X) + E ((E (Y | X) - f (X))^{2} | X) + \\ 2 E ((Y - E (Y | X)) (E (Y | X) - f (X)) | X) \\ = E ((Y - E (Y | X))^{2} | X) + E ((E (Y | X) - f (X))^{2} | X) + \\ 2 (E (Y | X) - f (X)) E (Y - E (Y | X)) | X) \\ (since E (Y | X) - f (X) is constant given X) \\ = E ((Y - E (Y | X))^{2} | X) + E ((E (Y | X) - f (X))^{2} | X) ( use Claim 1) \\ \geq E ((Y - E (Y | X))^{2} | X) \end{aligned}

$\begin{align} E((Y - f(X))^2 | X) &= E( ([Y - E(Y|X)] + [E(Y|X) - f(X)])^2|X) \\ &= E((Y-E(Y|X))^2 |X) + E((E(Y|X) - f(X))^2|X) + \\ &\qquad 2 E((Y - E(Y|X))(E(Y|X) - f(X))|X) \\ &=E((Y-E(Y|X))^2 |X) + E((E(Y|X) - f(X))^2|X) + \\ &\qquad 2 (E(Y|X) - f(X)) E(Y - E(Y|X))|X) \\[5pt] &( \text{ since } E(Y|X) - f(X) \text{ is constant given } X) \\[5pt] &= E((Y-E(Y|X))^2 |X) + E((E(Y|X) - f(X))^2|X) \text{ ( use Claim 1 }) \\ &\geq E((Y-E(Y|X))^2 |X) \end{align}$

Taking expectation both sides of the above equation give Claim 2 (Q.E.D)

Therefore, the optimal f is $f(X) = E(Y|X)$

— thanhtang
source