最小二乘估计量方差中


18

如果XX是满秩,逆X Ť XXTX存在并且我们得到的最小二乘估计β = X Ť X - 1 X ÿ

β^=(XTX)1XY
VAR β= σ 2X Ť X - 1
Var(β^)=σ2(XTX)1

我们如何在方差公式中直观地解释?推导技术对我来说很清楚。X T X 1(XTX)1


3
您可能要添加注释指出,公式你说对的方差-协方差矩阵β -假设β是OLS估计-是正确的只有当高斯-马尔科夫条件定理满意和,特别地,只有当误差项的方差-协方差矩阵由下式给出σ 2 ñ,其中ññ × ñ单位矩阵和ñ是行数X(和ÿ)。您提供的公式不适用于非球面误差的更一般情况。β^β^σ2InInn×nnXY
米科

Answers:


13

考虑一个没有常数项的简单回归,其中单个回归值以样本均值为中心。然后 X ' XXX为(Ñn倍)其样本方差,和X ' X - 1(XX)1其recirpocal。因此,回归变量中的方差=变异性越高,系数估算器的方差越小:解释变量的变异性越大,我们可以更准确地估计未知系数。

为什么?因为回归变量的变化越大,它包含的信息就越多。当回归变量很多时,这将推广到其方差-协方差矩阵的逆矩阵,该矩阵还考虑了回归变量的协方差。在极端情况下,其中X ' XXX是对角的,则对于每个估计的系数的精度仅取决于相关联的回归(给出的误差项的方差)的方差/变性。


您能否将此论点与方差-协方差矩阵的逆产生偏相关这一事实联系起来?
海森堡

5

观看的一个简单的方法σ 2 X Ť X - 1σ2(XTX)1是作为基体(多变量)的类似物σ 2Σ Ñ = 1X - ˉ X 2σ2ni=1(XiX¯)2,这是在简单OLS回归斜率系数的方差。人们甚至可以得到σ2对于该方差,n i = 1 X 2 i通过省略模型中的截距,即通过原点进行回归来实现。σ2ni=1X2i

从这些公式中的任一个,可以看出,预测变量的较大可变性通常将导致对其系数的更精确估计。这是在实验设计中经常采用的想法,通过选择(非随机)预测变量的值,人们试图使X T X 的决定因素尽可能大,该决定因素是变化的量度。(XTX)


2

高斯随机变量的线性变换有帮助吗?使用规则是,如果,X Ñμ Σ ,然后X + b Ñμ + b Ť Σ xN(μ,Σ) Ax+b N(Aμ+b,ATΣA)

假设,即ÿ = X β + ε是基础模型和ε Ñ0 σ 2Y=Xβ+ϵϵN(0,σ2)

ÿ ÑX β σ 2X Ť ÿ ÑX Ť X β X σ 2 X ŤX Ť X - 1 X Ť ÿ Ñ [ β X Ť X - 1 σ 2 ]

YN(Xβ,σ2)XTYN(XTXβ,Xσ2XT)(XTX)1XTYN[β,(XTX)1σ2]

因此X T X 1 X T只是一个复杂的缩放矩阵,可转换Y的分布。(XTX)1XTY

希望对您有所帮助。


Nothing in the derivation of the OLS estimator and its variance requires normality of the error terms. All that's required is E(ε)=0E(ε)=0 and E(εεT)=σ2InE(εεT)=σ2In. (Of course, normality is required to show that OLS achieves the Cramer-Rao lower bound, but that's not what the OP's posting is about, is it?)
Mico

2

I'll take a different approach towards developing the intuition that underlies the formula Varˆβ=σ2(XX)1Varβ^=σ2(XX)1. When developing intuition for the multiple regression model, it's helpful to consider the bivariate linear regression model, viz., yi=α+βxi+εi,i=1,,n.

yi=α+βxi+εi,i=1,,n.
α+βxiα+βxi is frequently called the deterministic contribution to yiyi, and εiεi is called the stochastic contribution. Expressed in terms of deviations from the sample means (ˉx,ˉy)(x¯,y¯), this model may also be written as (yiˉy)=β(xiˉx)+(εiˉε),i=1,,n.
(yiy¯)=β(xix¯)+(εiε¯),i=1,,n.

To help develop the intuition, we will assume that the simplest Gauss-Markov assumptions are satisfied: xixi nonstochastic, ni=1(xiˉx)2>0ni=1(xix¯)2>0 for all nn, and εiiid(0,σ2)εiiid(0,σ2) for all i=1,,ni=1,,n. As you already know very well, these conditions guarantee that Varˆβ=1nσ2(Varx)1,

Varβ^=1nσ2(Varx)1,
where VarxVarx is the sample variance of xx. In words, this formula makes three claims: "The variance of ˆββ^ is inversely proportional to the sample size nn, it is directly proportional to the variance of εε, and it is inversely proportional to the variance of xx."

Why should doubling the sample size, ceteris paribus, cause the variance of ˆββ^ to be cut in half? This result is intimately linked to the iid assumption applied to εε: Since the individual errors are assumed to be iid, each observation should be treated ex ante as being equally informative. And, doubling the number of observations doubles the amount of information about the parameters that describe the (assumed linear) relationship between xx and yy. Having twice as much information cuts the uncertainty about the parameters in half. Similarly, it should be straightforward to develop one's intuition as to why doubling σ2σ2 also doubles the variance of ˆββ^.

Let's turn, then, to your main question, which is about developing intuition for the claim that the variance of ˆββ^ is inversely proportional to the variance of xx. To formalize notions, let us consider two separate bivariate linear regression models, called Model (1)(1) and Model (2)(2) from now on. We will assume that both models satisfy the assumptions of the simplest form of the Gauss-Markov theorem and that the models share the exact same values of αα, ββ, nn, and σ2σ2. Under these assumptions, it is easy to show that Eˆβ(1)=Eˆβ(2)=βEβ^(1)=Eβ^(2)=β; in words, both estimators are unbiased. Crucially, we will also assume that whereas ˉx(1)=ˉx(2)=ˉxx¯(1)=x¯(2)=x¯, Varx(1)Varx(2)Varx(1)Varx(2). Without loss of generality, let us assume that Varx(1)>Varx(2)Varx(1)>Varx(2). Which estimator of ˆββ^ will have the smaller variance? Put differently, will ˆβ(1)β^(1) or ˆβ(2)β^(2) be closer, on average, to ββ? From the earlier discussion, we have Varˆβ(k)=1nσ2/Varx(k))Varβ^(k)=1nσ2/Varx(k)) for k=1,2k=1,2. Because Varx(1)>Varx(2)Varx(1)>Varx(2) by assumption, it follows that Varˆβ(1)<Varˆβ(2)Varβ^(1)<Varβ^(2). What, then, is the intuition behind this result?

Because by assumption Varx(1)>Varx(2)Varx(1)>Varx(2), on average each x(1)ix(1)i will be farther away from ˉxx¯ than is the case, on average, for x(2)ix(2)i. Let us denote the expected average absolute difference between xixi and ˉxx¯ by dxdx. The assumption that Varx(1)>Varx(2)Varx(1)>Varx(2) implies that d(1)x>d(2)xd(1)x>d(2)x. The bivariate linear regression model, expressed in deviations from means, states that dy=βd(1)xdy=βd(1)x for Model (1)(1) and dy=βd(2)xdy=βd(2)x for Model (2)(2). If β0β0, this means that the deterministic component of Model (1)(1), βd(1)xβd(1)x, has a greater influence on dydy than does the deterministic component of Model (2)(2), βd(2)xβd(2)x. Recall that the both models are assumed to satisfy the Gauss-Markov assumptions, that the error variances are the same in both models, and that β(1)=β(2)=ββ(1)=β(2)=β. Since Model (1)(1) imparts more information about the contribution of the deterministic component of yy than does Model (2)(2), it follows that the precision with which the deterministic contribution can be estimated is greater for Model (1)(1) than is the case for Model (2)(2). The converse of greater precision is a lower variance of the point estimate of ββ.

It is reasonably straightforward to generalize the intuition obtained from studying the simple regression model to the general multiple linear regression model. The main complication is that instead of comparing scalar variances, it is necessary to compare the "size" of variance-covariance matrices. Having a good working knowledge of determinants, traces and eigenvalues of real symmetric matrices comes in very handy at this point :-)


1

Say we have n observations (or sample size) and p parameters.

The covariance matrix Var(ˆβ) of the estimated parameters ˆβ1,ˆβ2 etc. is a representation of the accuracy of the estimated parameters.

If in an ideal world the data could be perfectly described by the model, then the noise will be σ2=0. Now, the diagonal entries of Var(ˆβ) correspond to Var(^β1),Var(^β2) etc. The derived formula for the variance agrees with the intuition that if the noise is lower, the estimates will be more accurate.

In addition, as the number of measurements gets larger, the variance of the estimated parameters will decrease. So, overall the absolute value of the entries of XTX will be higher, as the number of columns of XT is n and the number of rows of X is n, and each entry of XTX is a sum of n product pairs. The absolute value of the entries of the inverse (XTX)1 will be lower.

Hence, even if there is a lot of noise, we can still reach good estimates ^βi of the parameters if we increase the sample size n.

I hope this helps.

Reference: Section 7.3 on Least squares: Cosentino, Carlo, and Declan Bates. Feedback control in systems biology. Crc Press, 2011.


1

This builds on @Alecos Papadopuolos' answer.

Recall that the result of a least-squares regression doesn't depend on the units of measurement of your variables. Suppose your X-variable is a length measurement, given in inches. Then rescaling X, say by multiplying by 2.54 to change the unit to centimeters, doesn't materially affect things. If you refit the model, the new regression estimate will be the old estimate divided by 2.54.

The XX matrix is the variance of X, and hence reflects the scale of measurement of X. If you change the scale, you have to reflect this in your estimate of β, and this is done by multiplying by the inverse of XX.

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.