公式(2.11)是以下等式的结果。对于任何两个随机变量和Z 2以及任何函数gZ1Z2g
EZ1,Z2(g(Z1,Z2))=EZ2(EZ1∣Z2(g(Z1,Z2)∣Z2))
The notation EZ1,Z2 is the expectation over the joint distribution. The notation EZ1∣Z2 essentially says "integrate over the conditional distribution of Z1 as if Z2 was fixed".
It's easy to verify this in the case that Z1 and Z2 are discrete random variables by just unwinding the definitions involved
EZ2(EZ1∣Z2(g(Z1,Z2)∣Z2))=EZ2(∑z1g(z1,Z2)Pr(Z1=z1∣Z2))=∑z2(∑z1g(z1,z2)Pr(Z1=z1∣Z2=z2))Pr(Z2=z2)=∑z1,z2g(z1,z2)Pr(Z1=z1∣Z2=z2)Pr(Z2=z2)=∑z1,z2g(z1,z2)Pr(Z1=z1,Z2=z2)=EZ1,Z2(g(Z1,Z2))
The continuous case can either be viewed informally as a limit of this argument, or formally verified once all the measure theoretic do-dads are in place.
To unwind the application, take Z1=Y, Z2=X, and g(x,y)=(y−f(x))2. Everything lines up exactly.
The assertion (2.12) asks us to consider minimizing
EXEY∣X(Y−f(X))2
where we are free to choose f as we wish. Again, focusing on the discrete case, and dropping halfway into the unwinding above, we see that we are minimizing
∑x(∑y(y−f(x))2Pr(Y=y∣X=x))Pr(X=x)
Everything inside the big parenthesis is non-negative, and you can minimize a sum of non-negative quantities by minimizing the summands individually. In context, this means that we can choose f to minimize
∑y(y−f(x))2Pr(Y=y∣X=x)
individually for each discrete value of x. This is exactly the content of what ESL is claiming, only with fancier notation.