残差和因变量之间的预期相关性是什么？

26

在多元线性回归中，我可以理解残差和预测变量之间的相关性为零，但是残差和标准变量之间的预期相关性是什么？它应该为零还是高度相关？那是什么意思？

regression residuals

— fly
source

4

什么是“标准变量”？

— ub

2

@whuber我猜Jfly是指响应/结果/依赖/等。变量。davidmlane.com/hyperstat/A101702.html有趣的是看到了许许多多的名字，变量去了：en.wikipedia.org/wiki/...

— 杰罗米Anglim

@Jeromy谢谢！我猜想这是意思，但不确定。这对我和维基百科来说显然是一个新名词。

— ub

我还以为这将是等于

E [R^{2}]

$E[R^2]$ 或类似的东西，如

R^{2} = [c o r r (y, \hat{y})]^{2}

$R^2=[corr(y,\hat{y})]^2$

— probabilityislogic

y = f (x) + e

$y = f(x) + e$ ，其中

f

$f$ 是回归函数，

e

$e$ 是误差，

C o v (f (x), e) = 0

$Cov(f(x),e) = 0$ 。那么

。这就是样本统计数据；它的期望值将相似但更混乱。

C o r r (y, e) = S D (e) / S D (y) = \sqrt{1 - R^{2}}

$Corr(y,e) = SD(e)/SD(y) = \sqrt{1-R^2}$

— 雷·库普曼

20

在回归模型中：

y_{i} = x_{i}^{'} β + u_{i}

$y_i=\mathbf{x}_i'\beta+u_i$

通常假设是，是一个iid样本。在假设和满秩，普通最小二乘估计： $(y_i,\mathbf{x}_i,u_i)$ $i=1,...,n$ $E\mathbf{x}_iu_i=0$ $E(\mathbf{x}_i\mathbf{x}_i')$

\hat{β} = {(\sum_{i = 1}^{n} x_{i} x_{i}^{'})}^{- 1} \sum_{i = 1} x_{i} y_{i}

$\widehat{\beta}=\left(\sum_{i=1}^n\mathbf{x}_i\mathbf{x}_i'\right)^{-1}\sum_{i=1}\mathbf{x}_iy_i$

是一致且渐近正常的。残差和响应变量之间的预期协方差为：

E y_{i} u_{i} = E (x_{i}^{'} β + u_{i}) u_{i} = E u_{i}^{2}

$Ey_iu_i=E(\mathbf{x}_i'\beta+u_i)u_i=Eu_i^2$

如果我们进一步假定和，我们可以计算之间的预期协方差及其回归残差： $E(u_i|\mathbf{x}_1,...,\mathbf{x}_n)=0$ $E(u_i^2|\mathbf{x}_1,...,\mathbf{x}_n)=\sigma^2$ $y_i$

\begin{aligned} E y_{i} {\hat{u}}_{i} & = E y_{i} (y_{i} - x_{i}^{'} \hat{β}) \\ = E (x_{i}^{'} β + u_{i}) (u_{i} - x_{i} (\hat{β} - β)) \\ = E (u_{i}^{2}) (1 - E x_{i}^{'} {(\sum_{j = 1}^{n} x_{j} x_{j}^{'})}^{- 1} x_{i}) \end{aligned}

$\begin{align*} Ey_i\widehat{u}_i&=Ey_i(y_i-\mathbf{x}_i'\widehat{\beta})\\\\ &=E(\mathbf{x}_i'\beta+u_i)(u_i-\mathbf{x}_i(\widehat{\beta}-\beta))\\\\ &=E(u_i^2)\left(1-E\mathbf{x}_i' \left(\sum_{j=1}^n\mathbf{x}_j\mathbf{x}_j'\right)^{-1}\mathbf{x}_i\right) \end{align*}$

Now to get the correlation we need to calculate $\text{Var}(y_i)$ and $\text{Var}(\hat{u}_i)$ . It turns out that

Var ({\hat{u}}_{i}) = E (y_{i} {\hat{u}}_{i}),

$\text{Var}(\hat u_i)=E(y_i\hat{u}_i),$

hence

Corr (y_{i}, {\hat{u}}_{i}) = \sqrt{1 - E x_{i}^{'} {(\sum_{j = 1}^{n} x_{j} x_{j}^{'})}^{- 1} x_{i}}

$\text{Corr}(y_i,\hat u_i)=\sqrt{1-E\mathbf{x}_i' \left(\sum_{j=1}^n\mathbf{x}_j\mathbf{x}_j'\right)^{-1}\mathbf{x}_i}$

Now the term $\mathbf{x}_i' \left(\sum_{j=1}^n\mathbf{x}_j\mathbf{x}_j'\right)^{-1}\mathbf{x}_i$ comes from diagonal of the hat matrix $H=X(X'X)^{-1}X'$ , where $X=[\mathbf{x}_i,...,\mathbf{x}_N]'$ . The matrix $H$ is idempotent, hence it satisfies a following property

trace (H) = \sum_{i} h_{i i} = rank (H),

$\text{trace}(H)=\sum_{i}h_{ii}=\text{rank}(H),$

where $h_{ii}$ is the diagonal term of $H$ . The $\text{rank}(H)$ is the number of linearly independent variables in $\mathbf{x}_i$ , which is usually the number of variables. Let us call it $p$ . The number of $h_{ii}$ is the sample size $N$ . So we have $N$ nonnegative terms which should sum up to $p$ . Usually $N$ is much bigger than $p$ , hence a lot of $h_{ii}$ would be close to the zero, meaning that the correlation between the residual and the response variable would be close to 1 for the bigger part of observations.

The term $h_{ii}$ is also used in various regression diagnostics for determining influential observations.

— mpiktas
source

10

+1 This is exactly the right analysis. But why don't you finish the job and answer the question? The OP asks whether this correlation is "high" and what it might mean.

— whuber

So you could say that the correlation is roughly

\sqrt{1 - \frac{p}{N}}

$\sqrt{1-\frac{p}{N}}$

— probabilityislogic

1

Correlation is different for every observation, but yeah you can say that, provided X does not have outliers.

— mpiktas

21

The correlation depends on the $R^2$ . If $R^2$ is high, it means that much of variation in your dependent variable can be attributed to variation in your independent variables, and NOT your error term.

However, if $R^2$ is low, then it means that much of the variation in your dependent variable is unrelated to variation in your independent variables, and thus must be related to the error term.

Consider the following model:

$Y=X\beta+\varepsilon$ , where $Y$ and $X$ are uncorrelated.

Assuming sufficient regularity conditions for the CLT to hold.

$\hat{\beta}$ $0$ $X$ $Y$ $\hat{Y}=X\hat{\beta}$ $\varepsilon:=Y-\hat{Y}=Y-0=Y$ $\varepsilon$ $Y$ are perfectly correlated!!!

$R^2$ $R^2$ (and hence high correlation between error and dependent) may be due to model misspecification.

— Matt
source

I find this answer confusing, in part through its use of "

ε

$\varepsilon$ " to stand both for the error terms in the model and the residuals

Y - \hat{Y}

$Y-\hat Y$ . Another point of confusion is the reference to "converge to" even though there is no sequence of anything at all in evidence to which convergence might apply. The assumption that

X

$X$ and

Y

$Y$ are uncorrelated seems special and not illustrative of general circumstances. All this obscures whatever this answer might be trying to say or which claims are generally true.

— whuber

17

I find this topic quite interesting and current answers are unfortunately incomplete or partly misleading - despite the relevance and high popularity of this question.

By definition of classical OLS framework there should be no relationship between $y ̂$ and $\hat u$ , since the residuals obtained are per construction uncorrelated with $y ̂$ when deriving the OLS estimator. The variance minimizing property under homoskedasticity ensures that the residual error are randomly spread around the fitted values. This can be formally shown by:

Cov (y ̂, u ̂ | X) = Cov (P y, M y | X) = Cov (P y, (I - P) y | X) = P Cov (y, y) (I - P)^{'}

$\text{Cov}(y ̂,u ̂|X)=\text{Cov}(Py,My|X)=\text{Cov}(Py,(I-P)y|X)=P\text{Cov}(y,y)(I-P)'$

= P σ^{2} - P σ^{2} = 0

$=Pσ^2-Pσ^2=0$

Where $M$ and $P$ are idempotent matrices defined as: $P=X(X' X)X'$ and $M=I-P$ .

This result is based on strict exogeneity and homoskedasticity, and practically holds in large samples. The intuition for their uncorrelatedness is the following: The fitted values $y ̂$ conditional on $X$ are centered around $u ̂$ , which are thought as independently and identically distributed. However, any deviation from the strict exogeneity and homoskedasticity assumption could cause the explanatory variables to be endogenous and spur a latent correlation between $u ̂$ and $y ̂$ .

Now the correlation between the residuals $u ̂$ and the "original" $y$ is a completely different story:

Cov (y, u ̂ | X) = Cov (y M y | X) = Cov (y, (1 - P) y) = Cov (y, y) (1 - P) = σ^{2} M

$\text{Cov}(y,u ̂|X)=\text{Cov}(yMy|X)=\text{Cov}(y,(1-P)y)=\text{Cov}(y,y)(1-P)=σ^2 M$

Some checking in the theory and we know that this covariance matrix is identical to the covariance matrix of the residual $\hat{u}$ itself (proof omitted). We have:

Var (u ̂) = σ^{2} M = Cov (y, u ̂ | X)

$\text{Var}(u ̂ )=σ^2 M=\text{Cov}(y,u ̂|X)$

If we would like to calculate the (scalar) covariance between $y$ and $\hat{u}$ as requested by the OP, we obtain:

⟹ {Cov}_{s c a l a r} (y, u ̂ | X) = Var (u ̂ | X) = (\sum u_{i}^{2}) / N

$\implies \text{Cov}_{scalar}(y,u ̂|X)=\text{Var}(u ̂|X)=\left(∑u_i^2 \right)/N$

(= by summing up of the diagonal entries of the covariance matrix and divide by N)

The above formula indicates an interesting point. If we test the relationship by regressing $y$ on the residuals $\hat{u}$ (+constant), the slope coefficient $\beta_{\hat{u},y}=1$ , which can be easily derived when we divide the above expression by the $\text{Var}(u ̂|X)$ .

On the other hand, the correlation is the standardized covariance by the respective standard deviations. Now, the variance matrix of the residuals is $σ^2 M$ , while the variance of $y$ is $σ^2 I$ . The correlation $\text{Corr}(y,u ̂ )$ becomes therefore:

Corr (y, u ̂) = \frac{Var (u ̂)}{\sqrt{Var (\hat{u}) Var (y)}} = \sqrt{\frac{Var (u ̂)}{Var (y)}} = \sqrt{\frac{Var (u ̂)}{σ^{2}}}

$\text{Corr}(y,u ̂ )=\frac{\text{Var}(u ̂ )}{\sqrt{\text{Var}(\hat{u})\text{Var}(y)}}=\sqrt{\frac{\text{Var}(u ̂ )}{\text{Var}(y)} }=\sqrt{\frac{\text{Var}(u ̂ )}{σ^2 }}$

This is the core result which ought to hold in a linear regression. The intuition is that the $\text{Corr}(y,u ̂ )$ expresses the error between the true variance of the error term and a proxy for the variance based on residuals. Notice that the variance of $y$ is equal to the variance of $\hat{y}$ plus the variance of the residuals $\hat{u}$ . So it can be more intuitively rewritten as:

Corr (y, u ̂) = \frac{1}{\sqrt{1 + \frac{Var (\hat{y)}}{Var (u ̂)}}}

$\text{Corr}(y,u ̂ )=\frac{1}{\sqrt{1+\frac{\text{Var}(\hat{y)}}{\text{Var}(u ̂ )}}}$

The are two forces here at work. If we have a great fit of the regression line, the correlation is expected to be low due to $\text{Var}(u ̂ )\approx 0$ . On the other hand, $\text{Var}(\hat{y})$ is a bit of a fudge to esteem as it is unconditional and a line in parameter space. Comparing an unconditional and conditional variances within a ratio may not be an appropriate indicator after all. Perhaps, that's why it rarely done in practice.

An attempt conclude the question: The correlation between $y$ and $u ̂$ is positive and relates to the ratio of the variance of the residuals and the variance of the true error term, proxied by the unconditional variance in $y$ . Hence, it is a bit of a misleading indicator.

Notwithstanding this exercise may give us some intuition on the workings and inherent theoretical assumptions of an OLS regression, we rarely evaluate the correlation between $y$ and $u ̂$ . There are certainly more established tests for checking properties of the true error term. Secondly, keep in mind that the residuals are not the error term, and tests on residuals $u ̂$ that make predictions of the characteristics on the true error term $u$ are limited and their validity need to be handled with utmost care.

For example, I would like to point out a statement made by a previous poster here. It is said that,

"If your residuals are correlated with your independent variables, then your model is heteroskedastic..."

I think that may not be entirely valid in this context. Believe it or not, but the OLS residuals $u ̂$ are by construction made to be uncorrelated with the independent variable $x_k$ . To see this, consider:

X^{'} u_{i} = X^{'} M y = X^{'} (I - P) y = X^{'} y - X^{'} P y

$X'u_i=X'My=X'(I-P)y=X'y-X'Py$

= X^{'} y - X^{'} X (X^{'} X) X^{'} y = X^{'} y - X^{'} y = 0

$=X'y-X'X(X'X)X'y=X'y-X'y=0$

⟹ X^{'} u_{i} = 0 ⟹ Cov (X^{'}, u_{i} | X) = 0 ⟹ Cov (x_{k i}, u_{i} | x_{k} i) = 0

$\implies X'u_i=0 \implies \text{Cov}(X',u_i|X)=0 \implies \text{Cov}(x_{ki},u_i|x_ki)=0$

However, you may have heard claims that an explanatory variable is correlated with the error term. Notice that such claims are based on assumptions about the whole population with a true underlying regression model, that we do not observe first hand. Consequently, checking the correlation between $y$ and $u ̂$ seems pointless in a linear OLS framework. However, when testing for heteroskedasticity, we take here into account the second conditional moment, for example, we regress the squared residuals on $X$ or a function of $X$ , as it is often the case with FGSL estimators. This is different from evaluating the plain correlation. I hope this helps to make matters more clear.

— Majte
source

1

Note that we have

\frac{v a r (\hat{u})}{v a r (y)} = \frac{S S E}{T S S} = 1 - R^{2}

$\frac{var(\hat{u})}{var(y)}=\frac{SSE}{TSS}=1-R^2$ (at least roughly anyway). This gives

c o r r (y, \hat{u}) = \sqrt{1 - R^{2}}

$corr(y,\hat{u})=\sqrt{1-R^2}$ which is some further intuition about what you mention in later paragraphs.

— probabilityislogic

2

What I find interesting about this answer is that the correlation is always positive.

— probabilityislogic

You state that

V a r (y)

$Var(y)$ is matrix, yet you divide by it.

— mpiktas

@probabilityislogic: Not sure if I can follow your step. It would be then under the squareroot 1+(1/1-R^2), which is (2-R^2)/(1-R^2)? Yet what's true is that it remains positive. The intuition is that if you have a line through a scatterplot, and you regress this line on errors from that line, it should be obvious that as the value y of that line increases the value of the residuals increase as well. This is because the residuals are positively dependent on y by construction.

— Majte

@mpiktas: In this case the matrix becomes a scalar as we are dealing y being only in one dimension.

— Majte

6

The Adam's answer is wrong. Even with a model that fits data perfectly, you can still get high correlation between residuals and dependent variable. That's the reason no regression book asks you to check this correlation. You can find the answer on Dr. Draper's "Applied Regression Analysis" book.

— Jeff
source

3

Even if correct, this is more of an assertion than an answer according to CV's standards, @Jeff. Would you mind elaborating / backing up your claim? Even just a page number & edition of Draper & Smith would suffice.

— gung - Reinstate Monica

4

So, the residuals are your unexplained variance, the difference between your model's predictions and the actual outcome you're modeling. In practice, few models produced through linear regression will have all residuals close to zero unless linear regression is being used to analyze a mechanical or fixed process.

Ideally, the residuals from your model should be random, meaning they should not be correlated with either your independent or dependent variables (what you term the criterion variable). In linear regression, your error term is normally distributed, so your residuals should also be normally distributed as well. If you have significant outliers, or If your residuals are correlated with either your dependent variable or your independent variables, then you have a problem with your model.

If you have significant outliers and non-normal distribution of your residuals, then the outliers may be skewing your weights (Betas), and I would suggest calculating DFBETAS to check the influence of your observations on your weights. If your residuals are correlated with your dependent variable, then there is a significantly large amount of unexplained variance that you are not accounting for. You may also see this if you're analyzing repeated observations of the same thing, due to autocorrelation. This can be checked for by seeing if your residuals are correlated with your time or index variable. If your residuals are correlated with your independent variables, then your model is heteroskedastic (see: http://en.wikipedia.org/wiki/Heteroscedasticity). You should check (if you haven't already) if your input variables are normally distributed, and if not, then you should consider scaling or transforming your data (the most common kinds are log and square-root) in order to make it more normalized.

In the case of both, your residuals, and your independent variables, you should take a QQ-Plot, as well as perform a Kolmogorov-Smirnov test (this particular implementation is sometimes referred to as the Lilliefors test) to make sure that your values fit a normal distribution.

Three things that are quick and may be helpful in dealing with this problem, are examining the median of your residuals, it should be as close to zero as possible (the mean will almost always be zero as a result of how the error term is fitted in linear regression), a Durbin-Watson test for autocorrelation in your residuals (especially as I mentioned before, if you are looking at multiple observations of the same things), and performing a partial residual plot will help you look for heteroscedasticity and outliers.

— Adam
source

Thank you very much. Your explanation is very helpful to me.

— Jfly

1

+1 Nice, comprehensive answer. I'm going to nitpick on 2 points. "If your residuals are correlated with your independent variables, then your model is heteroskedastic"--I would say that if the variance of your residuals depends on the level of an independent variable, then you have heteroscedasticity. Also, I have heard the Kolmogorov-Smirnov/Lilliefors tests described as "notoriously unreliable," and in practive I have certainly found this to be true. Better to make a subjective determination based on a Q-Q plot or a simple histogram.

— rolando2

4

The claim that "the residuals from your model... should not be correlated with... your... dependent variable" is not generally true, as explained in other answers on this thread. Would you mind correcting this post?

— gung - Reinstate Monica

1

(-1) I think this post is not relevant enough to the question asked. It is good as general advice, but perhaps a case of the "right answer to the wrong question".

— probabilityislogic