将Logistic回归与二进制结果和预测变量一起使用是否有意义?


18

我有一个二进制结果变量{0,1}和一个预测变量{0,1}。我的想法是,除非我包括其他变量并计算优势比,否则进行逻辑物流是没有意义的。

使用一个二元预测变量,计算概率就足以满足优势比吗?

Answers:


26

在这种情况下,您可以将数据折叠到 ,其中是和且的实例数。假设总体上有观测值。

XY010S00S011S10S11
Sijx=iy=ji,j{0,1}n

如果我们拟合模型(其中是我们的链接函数),我们会发现,是成功的比例的分对数时和是成功时的比例的分对数。换句话说, 和 pi=g1(xiTβ)=g1(β0+β11xi=1)gβ^0xi=0β^0+β^1xi=1

β^0=g(S01S00+S01)
β^0+β^1=g(S11S10+S11).

让我们检查一下R

n <- 54
set.seed(123)
x <- rbinom(n, 1, .4)
y <- rbinom(n, 1, .6)

tbl <- table(x=x,y=y)

mod <- glm(y ~ x, family=binomial())

# all the same at 0.5757576
binomial()$linkinv( mod$coef[1])
mean(y[x == 0])
tbl[1,2] / sum(tbl[1,])

# all the same at 0.5714286
binomial()$linkinv( mod$coef[1] + mod$coef[2])
mean(y[x == 1])
tbl[2,2] / sum(tbl[2,])

因此,逻辑回归系数就是表中比例的精确转换。

结果是,如果我们有来自一系列伯努利随机变量的数据,我们当然可以使用逻辑回归分析该数据集,但是事实证明,这与直接分析结果列联表没有什么不同。


我想从理论角度评论为什么这行得通。当我们拟合回归,我们使用的模型。然后,我们决定将平均值建模为或符号线性预测变量的变换。在我们的情况下,我们只有两个唯一值,因此只有两个唯一值,例如和。由于我们的独立性假设,我们有 和 Yi|xiBern(pi)xipi=g1(β0+β1xi)xipip0p1

i:xi=0Yi=S01Bin(n0,p0)
i:xi=1Yi=S11Bin(n1,p1).
Note how we're using the fact that the xi, and in turn n0 and n1, are nonrandom: if this was not the case then these would not necessarily be binomial.

This means that

S01/n0=S01S00+S01pp0 and S11/n1=S11S10+S11pp1.

The key insight here: our Bernoulli RVs are Yi|xi=jBern(pj) while our binomial RVs are Sj1Bin(nj,pj), but both have the same probability of success. That's the reason why these contingency table proportions are estimating the same thing as an observation-level logistic regression. It's not just some coincidence with the table: it's a direct consequence of the distributional assumptions we have made.


1

When you have more than one predictors and all the predictors are binary variables, you could fit a model using Logic Regression [1] (note it's "Logic" not "Logistic"). It's useful when you believe interaction effects among your predictors are prominent. There's an implementation in R (LogicReg package).

[1] Ruczinski, I., Kooperberg, C., & LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics, 12(3), 475-511.


1
The question is specifically about one regressor, thus your answer would better serve as a comment.
Richard Hardy
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.