如何检测回归模型何时过拟合？

当您正在做某事时，意识到自己在做什么，就会对何时过度拟合模型产生一种感觉。一方面，您可以在模型的“调整后的R平方”中跟踪趋势或劣化。您还可以在主要变量的回归系数的p值中跟踪类似的恶化。

但是，当您阅读其他人的研究并且对他们自己的内部模型开发过程一无所知时，如何清楚地确定模型是否过拟合。

regression multivariate-analysis overfitting

— Sympa
source

仅就这个问题提出一些想法，如果该研究公开了标准回归统计量，那么您可以关注系数的t统计量和p值。如果模型的RSquare高；但是，其中一个或多个变量的stat <2.0；这可能是一个危险信号。同样，如果某些变量上系数的符号违反逻辑，那可能是另一个危险信号。如果研究未披露该模型的保留期，则可能是另一个危险信号。希望您会有其他更好的想法。

— Sympa 2010年

一种方法是查看模型如何处理其他（但相似）数据。

— Shane 2010年

Answers:

交叉验证和正则化是防止过度拟合的相当普遍的技术。为了快速了解，我建议安德鲁·摩尔（Andrew Moore）使用交叉验证（镜像）的教程幻灯片-特别注意警告。有关更多详细信息，请务必阅读EOSL的第3章和第7章，深入介绍了该主题和相关事项。

— 阿尔斯
source

哇，谢谢安德鲁·摩尔（Andrew Moore）有关交叉验证的教程，这是世界一流的。

— Sympa 2010年

当我自己拟合模型时，通常会在拟合过程中使用信息准则，例如AIC或BIC，或者基于最大似然或F 检验对模型进行拟合的似然比检验对基于最小二乘拟合的模型进行。

所有这些在概念上都是相似的，因为它们会惩罚其他参数。他们为添加到模型中的每个新参数设置了“附加说明能力”的阈值。它们都是正规化形式。

对于其他模型，我查看了方法部分，以查看是否使用了此类技术，并且还使用了经验法则，例如每个参数的观察次数-如果每个参数大约有5个（或更少）观察，我开始怀疑。

始终记住，变量在模型中变得重要并不必需“重要”。我可能是一个混杂因素，如果您的目标是估计其他变量的影响，则应该以此为基础。

— 甲壳动物
source

感谢您与AIC和BIC测试的链接。与通过调整添加变量的模型惩罚的情况类似的调整后的R平方相比，它们是否会增加很多价值？

— Sympa 2010年

@Gaeten，当前后模型的F检验显着时，调整后的R平方将增加，因此它们是等效的，除非通常计算调整后的R平方不会返回p值。

— Thylacoleo

@Gaeten-AIC和BIC比F检验和经过调整的R平方更为通用，R平方通常仅限于最小二乘拟合的模型。AIC和BIC可以用于任何可以计算似然性并且可以知道（或估计）自由度的模型。

— Thylacoleo

测试一组变量不是正则化（收缩）的形式。测试给人一种去掉变量的诱惑，这与减少过度拟合无关。

— 弗兰克·哈雷尔

@FrankHarrell您能详细说明您的旧评论吗？在我看来，在其他所有条件相同的情况下，删除变量将减少过度拟合，因为减少了过度拟合的自由度。我确定我在这里缺少一些细微差别。

— 鳞翅目

我建议这是如何报告结果的问题。不是“击败贝叶斯鼓”，而是从贝叶斯角度处理模型不确定性作为推理问题将在这里大有帮助。而且也不必有很大的变化。如果报告仅包含该模型为真的可能性，则将非常有帮助。这是使用BIC估算的容易量。征集的第m个模型的BIC 。然后，假设有模型拟合（并且其中一个模型为真），则第m个模型为“真”模型的概率为： $BIC_{m}$ $M$

P (model m is true | one of the M models is true) \approx \frac{w_{m} \exp (- \frac{1}{2} B I C_{m})}{\sum_{j = 1}^{M} w_{j} \exp (- \frac{1}{2} B I C_{j})}

$P(\text{model m is true}|\text{one of the M models is true})\approx\frac{w_{m}\exp\left(-\frac{1}{2}BIC_{m}\right)}{\sum_{j=1}^{M}w_{j}\exp\left(-\frac{1}{2}BIC_{j}\right)}$

= \frac{1}{1 + \sum_{j \neq m}^{M} \frac{w_{j}}{w_{m}} \exp (- \frac{1}{2} (B I C_{j} - B I C_{m}))}

$=\frac{1}{1+\sum_{j\neq m}^{M}\frac{w_{j}}{w_{m}}\exp\left(-\frac{1}{2}(BIC_{j}-BIC_{m})\right)}$

Where $w_{j}$ is proportional to the prior probability for the jth model. Note that this includes a "penalty" for trying to many models - and the penalty depends on how well the other models fit the data. Usually you will set $w_{j}=1$ , however, you may have some "theoretical" models within your class that you would expect to be better prior to seeing any data.

Now if somebody else doesn't report all the BIC's from all the models, then I would attempt to infer the above quantity from what you have been given. Suppose you are given the BIC from the model - note that BIC is calculable from the mean square error of the regression model, so you can always get BIC for the reported model. Now if we take the basic premise that the final model was chosen from the smallest BIC then we have $BIC_{final}<BIC_{j}$ . Now, suppose you were told that "forward" or "forward stepwise" model selection was used, starting from the intercept using $p$ potential variables. If the final model is of dimension $d$ , then the procedure must have tried at least

M \geq 1 + p + (p - 1) + \dots + (p - d + 1) = 1 + \frac{p (p - 1) - (p - d) (p - d - 1)}{2}

$M\geq 1+p+(p-1)+\dots+(p-d+1)=1+\frac{p(p-1)-(p-d)(p-d-1)}{2}$

different models (exact for forward selection), If the backwards selection was used, then we know at least

M \geq 1 + p + (p - 1) + \dots + (d + 1) = 1 + \frac{p (p - 1) - d (d - 1)}{2}

$M\geq 1+p+(p-1)+\dots+(d+1)=1+\frac{p(p-1)-d(d-1)}{2}$

Models were tried (the +1 comes from the null model or the full model). Now we could try an be more specific, but these are "minimal" parameters which a standard model selection must satisfy. We could specify a probability model for the number of models tried $M$ and the sizes of the $BIC_{j}$ - but simply plugging in some values may be useful here anyway. For example suppose that all the BICs were $\lambda$ bigger than the one of the model chosen so that $BIC_{m}=BIC_{j}-\lambda$ , then the probability becomes:

\frac{1}{1 + (M - 1) \exp (- \frac{λ}{2})}

$\frac{1}{1+(M-1)\exp\left(-\frac{\lambda}{2}\right)}$

So what this means is that unless $\lambda$ is large or $M$ is small, the probability will be small also. From an "over-fitting" perspective, this would occur when the BIC for the bigger model is not much bigger than the BIC for the smaller model - a non-neglible term appears in the denominator. Plugging in the backward selection formula for $M$ we get:

\frac{1}{1 + \frac{p (p - 1) - d (d - 1)}{2} \exp (- \frac{λ}{2})}

$\frac{1}{1+\frac{p(p-1)-d(d-1)}{2}\exp\left(-\frac{\lambda}{2}\right)}$

Now suppose we invert the problem. say $p=50$ and the backward selection gave $d=20$ variables, what would $\lambda$ have to be to make the probability of the model greater than some value $P_{0}$ ? we have

λ > - 2 l o g (\frac{2 (1 - P_{0})}{P_{0} [p (p - 1) - d (d - 1)]})

$\lambda > -2 log\left(\frac{2(1-P_{0})}{P_{0}[p(p-1)-d(d-1)]}\right)$

Setting $P_{0}=0.9$ we get $\lambda > 18.28$ - so BIC of the winning model has to win by a lot for the model to be certain.

— probabilityislogic
source

+1, this is really clever. Is this published somewhere? Is there an 'official' reference for this?

— gung - Reinstate Monica

@gung - why thank you. Unfortunately, this was a "back of the envelope" answer. I'm sure there's problems with it, if you were to investigate in more detail.

— probabilityislogic