二次项或交互项在单独意义上都是重要的,但两者都不在一起


15

作为作业的一部分,我必须对具有两个预测变量的模型进行拟合。然后,我不得不针对所包含的预测变量之一绘制模型残差的图,并根据该残差进行更改。该图显示了曲线趋势,因此我为该预测变量包括了一个二次项。新模型显示二次项很重要。到目前为止一切都很好。

但是,数据表明交互也很有意义。在原始模型中添加一个交互项也可以“固定”曲线趋势,并且在添加到模型中时也非常重要(没有二次项)。问题是,当将二次项和交互项都添加到模型中时,其中一项不重要。

我应该在模型中包括哪个术语(二次方或相互作用),为什么?

Answers:


21

概要

当预测变量相关时,二次项和交互项将携带相似的信息。这可能导致二次模型或交互模型很重要。但是当同时包含两个术语时,因为它们是如此相似,所以两者都不重要。多重共线性的标准诊断程序(例如VIF)可能无法检测到其中任何一个。即使是专门设计用来检测使用二次模型代替交互作用的诊断图,也可能无法确定哪个模型是最好的。


分析

该分析的重点及其主要作用是描述问题中描述的情况。有了这样的特征描述,就可以轻松地模拟表现相应的数据。

考虑两个预测变量X 2(我们将自动对其进行标准化,以使每个变量在数据集中均具有单位方差),并假设随机响应Y由这些预测变量及其相互作用以及独立的随机误差确定:X1个X2Y

ÿ=β1个X1个+β2X2+β1个2X1个X2+ε

在许多情况下,预测变量是相关的。 数据集可能看起来像这样:

散点图矩阵

与产生这些采样数据β 1 2 = 0.1X 1X 2之间的相关性是0.85β1个=β2=1个β1个2=0.1X1个X20.85

这并不一定意味着我们将X 2视为随机变量的实现:它可以包括X 1X 2均为设计实验中的设置的情况,但由于某些原因,这些设置不是正交的。X1个X2X1个X2

不管相关如何产生,一种描述它的好方法是根据预测变量与其平均值相差多少。这些差异将非常小(在某种意义上,它们的方差小于1);X 1X 2之间的相关性越大,这些差异就越小。写,那么,X 1 = X 0 + δ 1X 2 = X 0 + δX0=(X1+X2)/21X1X2X1=X0+δ1,我们可以重新表达(比方说) X 2中的术语 X 1 X 2 = X 1 + δ 2 - δ 1。仅将其插入交互项中,模型为X2=X0+δ2X2X1X2=X1+(δ2δ1)

Y=β1X1+β2X2+β1,2X1(X1+[δ2δ1])+ε=(β1+β1,2[δ2δ1])X1+β2X2+β1,2X12+ε

提供的值只有一点点相比变化β 1,我们可以收集这些变化与真正的随机而言,写作β1,2[δ2δ1]β1

Y=β1X1+β2X2+β1,2X12+(ε+β1,2[δ2δ1]X1)

因此,如果我们将对于X 1X 2X 2 1进行回归,则会产生错误:残差的变化将取决于X 1(即,它是异方差的)。这可以通过简单的方差计算看出:YX1,X2X12X1

var(ε+β1,2[δ2δ1]X1)=var(ε)+[β1,22var(δ2δ1)]X12.

εβ1,2[δ2δ1]X1, that heteroscedasticity will be so low as to be undetectable (and should yield a fine model). (As shown below, one way to look for this violation of regression assumptions is to plot the absolute value of the residuals against the absolute value of X1--remembering first to standardize X1 if necessary.) This is the characterization we were seeking.

Remembering that X1 and X2 were assumed to be standardized to unit variance, this implies the variance of δ2δ1 will be relatively small. To reproduce the observed behavior, then, it should suffice to pick a small absolute value for β1,2, but make it large enough (or use a large enough dataset) so that it will be significant.

In short, when the predictors are correlated and the interaction is small but not too small, a quadratic term (in either predictor alone) and an interaction term will be individually significant but confounded with each other. Statistical methods alone are unlikely to help us decide which is better to use.


Example

Let's check this out with the sample data by fitting several models. Recall that β1,2 was set to 0.1 when simulating these data. Although that is small (the quadratic behavior is not even visible in the previous scatterplots), with 150 data points we have a chance of detecting it.

First, the quadratic model:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.03363    0.03046   1.104  0.27130    
x1           0.92188    0.04081  22.592  < 2e-16 ***
x2           1.05208    0.04085  25.756  < 2e-16 ***
I(x1^2)      0.06776    0.02157   3.141  0.00204 ** 

Residual standard error: 0.2651 on 146 degrees of freedom
Multiple R-squared: 0.9812, Adjusted R-squared: 0.9808 

The quadratic term is significant. Its coefficient, 0.068, underestimates β1,2=0.1, but it's of the right size and right sign. As a check for multicollinearity (correlation among the predictors) we compute the variance inflation factors (VIF):

      x1       x2  I(x1^2) 
3.531167 3.538512 1.009199 

Any value less than 5 is usually considered just fine. These are not alarming.

Next, the model with an interaction but no quadratic term:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02887    0.02975    0.97 0.333420    
x1           0.93157    0.04036   23.08  < 2e-16 ***
x2           1.04580    0.04039   25.89  < 2e-16 ***
x1:x2        0.08581    0.02451    3.50 0.000617 ***

Residual standard error: 0.2631 on 146 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9811

      x1       x2    x1:x2 
3.506569 3.512599 1.004566 

All the results are similar to the previous ones. Both are about equally good (with a very tiny advantage to the interaction model).

Finally, let's include both the interaction and quadratic terms:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02572    0.03074   0.837    0.404    
x1           0.92911    0.04088  22.729   <2e-16 ***
x2           1.04771    0.04075  25.710   <2e-16 ***
I(x1^2)      0.01677    0.03926   0.427    0.670    
x1:x2        0.06973    0.04495   1.551    0.123    

Residual standard error: 0.2638 on 145 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.981 

      x1       x2  I(x1^2)    x1:x2 
3.577700 3.555465 3.374533 3.359040

Now, neither the quadratic term nor the interaction term are significant, because each is trying to estimate a part of the interaction in the model. Another way to see this is that nothing was gained (in terms of reducing the residual standard error) when adding the quadratic term to the interaction model or when adding the interaction term to the quadratic model. It is noteworthy that the VIFs do not detect this situation: although the fundamental explanation for what we have seen is the slight collinearity between X1 and X2, which induces a collinearity between X12 and X1X2, neither is large enough to raise flags.

If we had tried to detect the heteroscedasticity in the quadratic model (the first one), we would be disappointed:

诊断图

In the loess smooth of this scatterplot there is ever so faint a hint that the sizes of the residuals increase with |X1|, but nobody would take this hint seriously.


9

What makes the most sense based on the source of the data?

We cannot answer this question for you, the computer cannot answer this question for you. The reason that we still need statisticians instead of just statistical programs is because of questions like this. Statistics is about more than just crunching the numbers, it is about understanding the question and the source of the data and being able to make decisions based on the science and background and other information outside the data that the computer looks at. Your teacher is probably hoping that you will contemplate this as part of the assignment. If I had assigned a problem like this (and I have before) I would be more interested in the justification of your answer than which you actually chose.

It is probably beyond your current class, but one approach if there is not a clear scientific reason for prefering one model over the other is model averaging, you fit both models (and maybe several other models as well), then you average together the predictions (often weighted by the goodness of fit of the different models).

Another option, when possible, is to collect more data and if possible choosing the x values so that it becomes more clear what the non-linear vs. interaction effects are.

There are some tools for comparing the fit of non-nested models (AIC, BIC, etc.), but for this case they probably will not show enough difference to overrule understanding of where the data comes from and what makes the most sense.


1

除了@Greg之外,另一种可能性是同时包含两个术语,即使其中一个并不重要。仅包括具有统计意义的术语不是宇宙定律。


谢谢彼得和@格雷格。我想我在研究的这个阶段正在寻找对至少需要定性推理的问题的绝对答案。由于添加了二次项或交互项都“固定”了残差与预测变量的关系图,因此我不确定应该包括哪一个。令我惊讶的是,二次项的包含使交互项变得不重要。我本以为,如果存在相互作用,那么无论是否包含二次项都将是有意义的。
塔尔巴山

1
嗨@TalBashan著名的统计学家唐纳德·考克斯(Donald Cox)曾经说过:“没有常规的统计问题,只有可疑的统计常规”
彼得·弗洛姆(Peter Flom)-恢复莫妮卡(Monica)

@PeterFlom也许您是说David Cox爵士?
Michael R. Chernick

哎呀 是的,大卫,不是唐纳德。抱歉。
彼得·弗洛姆
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.