逻辑回归:方差卡方检验与系数的显着性(R中的anova()vs summary())


35

我有一个8个变量的逻辑GLM模型。我在R中进行了卡方检验,anova(glm.model,test='Chisq')当在测试的顶部进行排序时,其中两个变量可预测,而在底部的排序中则没有那么多。在summary(glm.model)表明它们的系数不显着(高p值)。在这种情况下,变量似乎并不重要。

我想问问哪个是变量显着性更好的检验-模型摘要中的系数显着性或来自的卡方检验anova()。还有-什么时候一个比另一个更好?

我想这是一个广泛的问题,但是任何有关考虑因素的建议将不胜感激。


4
这类似于在线性模型中测试系数的I型和III型平方和之间的区别。它可能会帮助您在这里阅读我的答案:如何解释I型顺序ANOVA和MANOVA
gung-恢复莫妮卡

Answers:


61

除了@gung的答案外,我还将尝试提供该anova函数实际测试内容的示例。我希望这使您能够决定哪些测试适合您对测试感兴趣的假设。

假设您有一个结果和3个预测变量:x 1x 2x 3。现在,如果您的逻辑回归模型为。运行时,该函数按顺序比较以下模型:ÿX1个X2X3my.mod <- glm(y~x1+x2+x3, family="binomial")anova(my.mod, test="Chisq")

  1. glm(y~1, family="binomial")glm(y~x1, family="binomial")
  2. glm(y~x1, family="binomial")glm(y~x1+x2, family="binomial")
  3. glm(y~x1+x2, family="binomial")glm(y~x1+x2+x3, family="binomial")

因此,通过在每个步骤中添加一个变量,它依次将较小的模型与下一个更复杂的模型进行比较。每个比较都是通过似然比测试(LR测试;请参见下面的示例)完成的。据我所知,这些假设很少引起人们的兴趣,但这必须由您决定。

这是一个示例R

mydata      <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)

my.mod <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(my.mod)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
   ---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1 

# The sequential analysis
anova(my.mod, test="Chisq")

Terms added sequentially (first to last)    

     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                   399     499.98              
gre   1  13.9204       398     486.06 0.0001907 ***
gpa   1   5.7122       397     480.34 0.0168478 *  
rank  3  21.8265       394     458.52 7.088e-05 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

# We can make the comparisons by hand (adding a variable in each step)

  # model only the intercept
mod1 <- glm(admit ~ 1,                data = mydata, family = "binomial") 
  # model with intercept + gre
mod2 <- glm(admit ~ gre,              data = mydata, family = "binomial") 
  # model with intercept + gre + gpa
mod3 <- glm(admit ~ gre + gpa,        data = mydata, family = "binomial") 
  # model containing all variables (full model)
mod4 <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial") 

anova(mod1, mod2, test="LRT")

Model 1: admit ~ 1
Model 2: admit ~ gre
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       399     499.98                          
2       398     486.06  1    13.92 0.0001907 ***

anova(mod2, mod3, test="LRT")

Model 1: admit ~ gre
Model 2: admit ~ gre + gpa
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       398     486.06                       
2       397     480.34  1   5.7122  0.01685 *

anova(mod3, mod4, test="LRT")

Model 1: admit ~ gre + gpa
Model 2: admit ~ gre + gpa + rank
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       397     480.34                          
2       394     458.52  3   21.826 7.088e-05 ***

psummary(my.mod)

  • 对于系数x1glm(y~x2+x3, family="binomial")glm(y~x1+x2+x3, family="binomial")
  • 对于系数x2glm(y~x1+x3, family="binomial")glm(y~x1+x2+x3, family="binomial")
  • 对于系数x3glm(y~x1+x2, family="binomial")glm(y~x1+x2+x3, family="binomial")

因此,每个系数相对于包含所有系数的完整模型。Wald检验是似然比检验的近似值。我们还可以进行似然比检验(LR检验)。方法如下:

mod1.2 <- glm(admit ~ gre + gpa,  data = mydata, family = "binomial")
mod2.2 <- glm(admit ~ gre + rank, data = mydata, family = "binomial")
mod3.2 <- glm(admit ~ gpa + rank, data = mydata, family = "binomial")

anova(mod1.2, my.mod, test="LRT") # joint LR test for rank

Model 1: admit ~ gre + gpa
Model 2: admit ~ gre + gpa + rank
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       397     480.34                          
2       394     458.52  3   21.826 7.088e-05 ***

anova(mod2.2, my.mod, test="LRT") # LR test for gpa

Model 1: admit ~ gre + rank
Model 2: admit ~ gre + gpa + rank
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       395     464.53                       
2       394     458.52  1   6.0143  0.01419 *

anova(mod3.2, my.mod, test="LRT") # LR test for gre

Model 1: admit ~ gpa + rank
Model 2: admit ~ gre + gpa + rank
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       395     462.88                       
2       394     458.52  1   4.3578  0.03684 *

psummary(my.mod)

rankanova(my.mod, test="Chisq")rankanova(mod1.2, my.mod, test="Chisq")p7.08810-5rank


1
+1,这是一个很好的综合说明。1小点:我相信,当test="Chisq"您未进行似然比测试时,需要为此进行设置test="LRT",请参阅?anova.glm
恢复莫妮卡

6
@gung谢谢你的夸奖。test="LRT"并且test="Chisq"是同义词(它在您链接的页面上说了)。
COOLSerdash

2
没问题,但是我认为这实际上是一个好点。test="LRT"更好,因为现在很明显这是一个似然比检验。我改了 谢谢。
COOLSerdash

4
+1给您留下了深刻的印象,您在短短一个月内的快速进步以及您提供精心设计的清晰说明的能力。感谢您的努力!
ub

1
好答案。请问7.088e-05, 0.01419, 00.03684应该如何解释p值()?
TheSimpliFire
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.