错误项的分布如何影响响​​应的分布?


Answers:


7

也许我不在了,但我认为我们应该对感到疑惑,这就是我阅读OP的方式。线性回归的非常简单的情况下,如果你的模型是Ŷ = X β + ε然后在模型中唯一的随机分量是误差项。因此,它确定y的采样分布。如果ε Ñ 0 σ 2然后ÿ | X β Ñ X β f(y|β,X)y=Xβ+ϵyϵN(0,σ2I)。@Aniko所说的对于 f y (在 X β之上肯定是正确的。因此,目前的问题有点模糊。y|X,βN(Xβ,σ2I)f(y)X,β


我喜欢所有评论!他们似乎都是正确的。但是我只是在寻找最简单的答案:)当您假设错误的术语是正态分布时会发生什么。实际上,从其他答案中可以很明显地看出这种情况!非常感谢!
MarkDollar 2011年

17

简短的答案是,您无法就的分布得出任何结论,因为它取决于x的分布以及关系的强度和形状。更正式地,yyxy将具有“正态混合”分布,实际上它几乎可以是任何东西。

这里有两个极端的例子来说明这一点:

  1. 假设只有两个可能的值,为0的1,和ÿ = 10 X + Ñ 0 1 。则y将具有强烈的双峰分布,其凸点分别为0和10。xy=10x+N(0,1)y
  2. 现在假设相同的关系,但是让在0-1区间上均匀分布并具有很多值。则y将在0-10间隔内几乎均匀分布(边缘处有一些半正态尾巴)。xy

实际上,由于每个分布都可以很好地近似于正态混合,因此您实际上可以获得任何分布。y


8
+1关于最后一个陈述:我曾经也犯过这样的错误。从数学上讲您是正确的,但实际上几乎不可能用法线(例如J形或U型分布)来近似不可微分的尖峰:法线在其峰值处太平坦而无法捕获尖峰中的密度。您需要太多的组件。正态分布非常适合近似pdf非常平滑的分布。
ub

1
@whuber同意。实际上,我不建议对任何分布使用正态混合近似,我只是想举一个极端的反例。
Aniko

5

我们通过对真实数据施加虚拟模型来发明误差项。误差项的分布不影响响应的分布。

我们经常假设误差是正态分布的,因此尝试构建模型,使得我们估计的残差呈正态分布。对于某些分布,这可能很困难。在这些情况下,我想您可以说响应的分布会影响误差项。y


2
“我们常常试图构建模型,使得我们的误差项是正态分布” -准确地说,我想你指的是残差。这些估算值以相同的方式误差项的该X β是的估计ëÝ = X β。我们希望残差看起来很正常,因为这就是我们从开始就对误差项所做的假设。我们通过指定模型而不是拟合模型来“发明”误差项。yXβ^Xβ^E(y)=Xβ
JMS

我同意您的精确度,JMS。+1,我将调整答案。
Thomas Levine

2

y=m+e
myeym=eN(0,σ2)σσCauchy(0,γ) which says that most of the errors are small, but some errors are quite large - the model has the occasional "blunder" or "shocker" in terms of predicting the response.

In a sense the error distribution is more closely linked to the model than to the response. This can be seen from the non-identifiability of the above equation, for if both m and e are unknown then adding an arbitrary vector to m and subtracting it from e leads to the same value of y, y=m+e=(m+b)+(eb)=m+e. The assignment of an error distribution and a model equation basically says which arbitrary vectors are more plausible than others.


"This seems strange because you will only observe y once and only once (y is the complete vector/matrix/etc. of responses). How can this be "distributed"? In my view it can only be distributed in some imaginary ensemble, nothing to do with your actual observed response. At the very least, any such presumption of the response "being distributed" is untestable" I'm confused; are you saying we can't test H0:yf0 vs H1:yf1?
JMS

no, sorry, that can't be what you're saying. I'm still confused though. Maybe it's slightly imprecise, but the way I read it he's got n samples of yi from Y with fixed xi, his model is Y=Xβ+ϵ, and he's wondering what the assumed distribution of ϵ implies about the distribution of Y|β,X under his model. Here it would imply that it's normal; we can test that with our sample
JMS

@JMS - I think I might delete that first paragraph. I don't think it adds anything to my answer (besides confusion).
probabilityislogic

one of my favorite things to add to my answers :)
JMS
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.