Answers:
当预测变量相关时,二次项和交互项将携带相似的信息。这可能导致二次模型或交互模型很重要。但是当同时包含两个术语时,因为它们是如此相似,所以两者都不重要。多重共线性的标准诊断程序(例如VIF)可能无法检测到其中任何一个。即使是专门设计用来检测使用二次模型代替交互作用的诊断图,也可能无法确定哪个模型是最好的。
该分析的重点及其主要作用是描述问题中描述的情况。有了这样的特征描述,就可以轻松地模拟表现相应的数据。
考虑两个预测变量和X 2(我们将自动对其进行标准化,以使每个变量在数据集中均具有单位方差),并假设随机响应Y由这些预测变量及其相互作用以及独立的随机误差确定:
在许多情况下,预测变量是相关的。 数据集可能看起来像这样:
与产生这些采样数据和β 1 ,2 = 0.1。X 1和X 2之间的相关性是0.85。
这并不一定意味着我们将和X 2视为随机变量的实现:它可以包括X 1和X 2均为设计实验中的设置的情况,但由于某些原因,这些设置不是正交的。
不管相关如何产生,一种描述它的好方法是根据预测变量与其平均值相差多少。这些差异将非常小(在某种意义上,它们的方差小于1);X 1和X 2之间的相关性越大,这些差异就越小。写,那么,X 1 = X 0 + δ 1和X 2 = X 0 + δ,我们可以重新表达(比方说) X 2中的术语 X 1为 X 2 = X 1 + (δ 2 - δ 1)。仅将其插入交互项中,模型为
提供的值只有一点点相比变化β 1,我们可以收集这些变化与真正的随机而言,写作
因此,如果我们将对于X 1,X 2和X 2 1进行回归,则会产生错误:残差的变化将取决于X 1(即,它是异方差的)。这可以通过简单的方差计算看出:
, that heteroscedasticity will be so low as to be undetectable (and should yield a fine model). (As shown below, one way to look for this violation of regression assumptions is to plot the absolute value of the residuals against the absolute value of --remembering first to standardize if necessary.) This is the characterization we were seeking.
Remembering that and were assumed to be standardized to unit variance, this implies the variance of will be relatively small. To reproduce the observed behavior, then, it should suffice to pick a small absolute value for , but make it large enough (or use a large enough dataset) so that it will be significant.
In short, when the predictors are correlated and the interaction is small but not too small, a quadratic term (in either predictor alone) and an interaction term will be individually significant but confounded with each other. Statistical methods alone are unlikely to help us decide which is better to use.
Let's check this out with the sample data by fitting several models. Recall that was set to when simulating these data. Although that is small (the quadratic behavior is not even visible in the previous scatterplots), with data points we have a chance of detecting it.
First, the quadratic model:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03363 0.03046 1.104 0.27130
x1 0.92188 0.04081 22.592 < 2e-16 ***
x2 1.05208 0.04085 25.756 < 2e-16 ***
I(x1^2) 0.06776 0.02157 3.141 0.00204 **
Residual standard error: 0.2651 on 146 degrees of freedom
Multiple R-squared: 0.9812, Adjusted R-squared: 0.9808
The quadratic term is significant. Its coefficient, , underestimates , but it's of the right size and right sign. As a check for multicollinearity (correlation among the predictors) we compute the variance inflation factors (VIF):
x1 x2 I(x1^2)
3.531167 3.538512 1.009199
Any value less than is usually considered just fine. These are not alarming.
Next, the model with an interaction but no quadratic term:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02887 0.02975 0.97 0.333420
x1 0.93157 0.04036 23.08 < 2e-16 ***
x2 1.04580 0.04039 25.89 < 2e-16 ***
x1:x2 0.08581 0.02451 3.50 0.000617 ***
Residual standard error: 0.2631 on 146 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9811
x1 x2 x1:x2
3.506569 3.512599 1.004566
All the results are similar to the previous ones. Both are about equally good (with a very tiny advantage to the interaction model).
Finally, let's include both the interaction and quadratic terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02572 0.03074 0.837 0.404
x1 0.92911 0.04088 22.729 <2e-16 ***
x2 1.04771 0.04075 25.710 <2e-16 ***
I(x1^2) 0.01677 0.03926 0.427 0.670
x1:x2 0.06973 0.04495 1.551 0.123
Residual standard error: 0.2638 on 145 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.981
x1 x2 I(x1^2) x1:x2
3.577700 3.555465 3.374533 3.359040
Now, neither the quadratic term nor the interaction term are significant, because each is trying to estimate a part of the interaction in the model. Another way to see this is that nothing was gained (in terms of reducing the residual standard error) when adding the quadratic term to the interaction model or when adding the interaction term to the quadratic model. It is noteworthy that the VIFs do not detect this situation: although the fundamental explanation for what we have seen is the slight collinearity between and , which induces a collinearity between and , neither is large enough to raise flags.
If we had tried to detect the heteroscedasticity in the quadratic model (the first one), we would be disappointed:
In the loess smooth of this scatterplot there is ever so faint a hint that the sizes of the residuals increase with , but nobody would take this hint seriously.
What makes the most sense based on the source of the data?
We cannot answer this question for you, the computer cannot answer this question for you. The reason that we still need statisticians instead of just statistical programs is because of questions like this. Statistics is about more than just crunching the numbers, it is about understanding the question and the source of the data and being able to make decisions based on the science and background and other information outside the data that the computer looks at. Your teacher is probably hoping that you will contemplate this as part of the assignment. If I had assigned a problem like this (and I have before) I would be more interested in the justification of your answer than which you actually chose.
It is probably beyond your current class, but one approach if there is not a clear scientific reason for prefering one model over the other is model averaging, you fit both models (and maybe several other models as well), then you average together the predictions (often weighted by the goodness of fit of the different models).
Another option, when possible, is to collect more data and if possible choosing the x values so that it becomes more clear what the non-linear vs. interaction effects are.
There are some tools for comparing the fit of non-nested models (AIC, BIC, etc.), but for this case they probably will not show enough difference to overrule understanding of where the data comes from and what makes the most sense.
除了@Greg之外,另一种可能性是同时包含两个术语,即使其中一个并不重要。仅包括具有统计意义的术语不是宇宙定律。