简单的线性回归输出解释


20

我对2个变量的自然对数进行了简单的线性回归,以确定它们是否相关。我的输出是这样的:

R^2 = 0.0893

slope = 0.851

p < 0.001

我很困惑。查看R2值,我会说两个变量相关,因为它非常接近。但是,回归线的斜率几乎为(尽管看起来在图中几乎是水平的),并且p值表明回归非常显着。01

这是否意味着这两个变量高度相关?如果是这样,值表示什么?R2

我应该补充一点,Durbin-Watson统计数据已在我的软件中进行了测试,并且没有拒绝原假设(等于)。我认为这测试了变量之间的独立性。在这种情况下,我希望变量是相关的,因为它们是单个鸟的测量。我将这种回归作为确定个人身体状况的已发布方法的一部分,因此我认为以这种方式使用回归是有意义的。但是,考虑到这些输出,我想也许对这些鸟来说,这种方法不合适。这似乎是一个合理的结论吗?1.35722


1
DW统计量是用于串行相关性的试验:即,以查看是否相邻误差项相互关联。它并没有说明您的X和Y之间的相关性!测试失败表明应谨慎解释斜率和p值。
ub

喔好吧。这比两个变量本身是否相关更有意义……毕竟,我认为这就是我试图使用回归找到的内容。测试未通过表明我应该谨慎解释斜率和p值,在这种情况下更有意义!谢谢@whuber!
Mog

1
我只想添加一个斜率会非常显着(p值<.001),即使这种关系很弱,尤其是在样本量较大的情况下。大多数答案都暗示了这一点,因为斜率(即使很重要)也没有说明这种关系的强度。
格伦

您需要来确定关系的强度。另请参阅stats.stackexchange.com/a/265924/99274n
卡尔

Answers:


22

斜率的估计值本身并不能告诉您关系的强度。关系的强度取决于误差方差的大小以及预测变量的范围。而且,值很大不一定表示您存在强关系。的p -值被简单地测试所述斜率是否恰好0。对于一个足够大的样本大小,即使是从该假设小偏离(例如那些不具有实际重要性)将产生显著p -值。ppp

你所呈现的三个量的,时,判定系数,给人的关系的强度的最大指示。在您的情况下,R 2 = .089,意味着响应变量中8.9 的变化可以解释为与预测变量的线性关系。构成“大” R 2的取决于学科。例如,在社会科学中,R 2 = .2可能是“大”的,但在工厂设置等受控环境中,R 2 > .9R2R2=.0898.9%R2R2=.2R2>.9可能需要说有一种“牢固”的关系。在大多数情况下,R 2很小,因此您得出的线性关系弱的结论可能是合理的。.089R2


谢谢宏。很有帮助的答案。我很高兴您加入了有关p值正在测试的部分。考虑到斜率接近1,p值这么低在很大程度上是有意义的。在我看来,根据您的回答和@jedfrancis',r ^ 2值描述了围绕回归线的数据点的“云”。优秀!现在更清楚了!
莫格

@Macro(+1),好的答案。但是,“关系的强度”如何取决于“截距的大小”?在AFAIK中,截距完全没有说明线性关系的相关性或“强度”。

@whuber,您是对的-截距无关紧要,而且绝对不会改变相关性-我在考虑回归函数 vs. y = x,并以某种方式认为第二个函数关系更紧密(所有其他条件都保持相等),因为在后一种情况下,y的数量更多是由于x引起的。现在我考虑一下并没有多大意义。我已经编辑了帖子。y=10000+xy=xyx

4
@macro很好的答案,但是(对于那些刚接触该主题的人)我会强调,即使该关系是非线性的,尤其是非单调的关系,R ^ 2也会非常低。我最喜欢的例子是压力和考试成绩之间的关系。非常低的压力和非常高的压力往往比中等压力要差。
彼得·弗洛姆

1
@macro是的,您的回答很好,但我与不了解很多统计信息的人一起工作,我已经看到发生了什么……有时我们所说的不是他们听到的!
彼得·弗洛姆

14

告诉您如何因变量太多变化是由模型解释。但是,人们可以解释R 2以及因变量的原始值和拟合值之间的相关性。在这里可以找到对确定系数R 2的精确解释和推导。R2R2R2

证明该确定的系数是等效的观测值之间的平方Pearson相关系数的和拟合值ÿ可以找到这里yiy^i

确定或系数表示模型的解释中的因变量的强度。在您的情况下,R 2 = 0.089。您的模型能够解释您因变量变化的8.9%。或者,你之间的相关系数Ÿ 和你的拟合值Ÿ是0.089。构成良好的R 2的方法取决于学科。R2R2=0.089yiy^iR2

最后,到问题的最后一部分。您无法得到Durbin-Watson检验来说明因变量和自变量之间的相关性。Durbin-Watson测试测试序列相关性。用于检查您的错误术语是否相互关联。


9

的值告诉您如何在数据太多变化是通过拟合模型的解释。R2

The low R2 value in your study suggests that your data is probably spread widely around the regression line, meaning that the regression model can only explain (very little) 8.9% of the variation in the data.

Have you checked to see whether a linear model is appropriate? Have a look at the distribution of your residuals, as you can use this to assess the fit of the model to your data. Ideally, your residuals should not show a relation with your x values, and if it does, you may want to think of rescaling your variables in a suitable way, or fitting a more appropriate model.


谢谢@杰德。是的,我检查了残差的正态性,一切都很好。您的建议是,数据在该回归线周围广泛散布是完全正确的-数据点看起来像软件绘制的回归线周围的云。
莫格

1
欢迎来到我们的网站@jed,并感谢您的答复!请注意,斜率本身除了符号以外几乎没有说任何相关性,因为相关性并不取决于测量X和Y的单位,而斜率却取决于。
ub

1
@whuber is saying that the value of the slope does not tell you anything about the strength of the association unless variables are standardized. See shabbychefs answer.
wolf.rauch

@wolf.rauch gotcha
jedfrancis

@jed It would be good if you were to correct your reply.
whuber

7

For a linear regression, the fitted slope is going to be the correlation (which, when squared, gives the coefficient of determination, the R2) times the empirical standard deviation of the regressand (the y) divided by the empirical standard deviation of the regressor (the x). Depending on the scaling of the x and y, you can have a fit slope equal to one but an arbitrarily small R2 value.

In short, the slope is not a good indicator of model 'fit' unless you are certain that the scales of the dependent and independent variables must be equal to each other.


1

I like the answers already given, but let me complement them with a different (and more tongue-in-cheek) approach.

Suppose we collect a bunch of observation from 1000 random people trying to find out if punches in the face are associated with headaches:

Headaches=β0+β1Punch_in_the_face+ε

ε contains all the omitted variables that produce headaches in the general population: stress, how contaminated your city is, lack of sleep, coffee consumption, etc.

For this regression, the β1 might be very significant and very big, but the R2 will be low. Why? For the vast majority of the population, headaches won't be explained much by punches in the face. In other words, most of the variation in the data (i.e. whether people have few or a lot of headaches) will be left unexplained if you only include punches in the face, but punches in the face are VERY important for headaches.

Graphically, this probably looks like a steep slope but with a very big variation around this slope.


0

@Macro had a great answer.

The estimated value of the slope does not, by itself, tell you the strength of the relationship. The strength of the relationship depends on the size of the error variance, and the range of the predictor. Also, a significant pp-value doesn't tell you necessarily that there is a strong relationship; the pp-value is simply testing whether the slope is exactly 0.

I just want to add a numerical example to show what is looks like to have a case OP described.

  • Low R2
  • Significant on p-value
  • Slope close to 1.0

    set.seed(6)
    y=c(runif(100)*50,runif(100)*50+10)
    x=c(rep(1,100),rep(10,100))
    plot(x,y)
    
    fit=lm(y~x)
    summary(fit)
    abline(fit)
    
    
    > summary(lm(y~x))
    
    Call:
    lm(formula = y ~ x)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -24.68 -13.46  -0.87  14.21  25.14 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  25.6575     1.7107  14.998  < 2e-16 ***
    x             0.9164     0.2407   3.807 0.000188 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 15.32 on 198 degrees of freedom
    Multiple R-squared:  0.0682,    Adjusted R-squared:  0.06349 
    F-statistic: 14.49 on 1 and 198 DF,  p-value: 0.0001877
    

enter image description here

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.