Answers:
在这种情况下,您可以将数据折叠到 ,其中是和且的实例数。假设总体上有观测值。
如果我们拟合模型(其中是我们的链接函数),我们会发现,是成功的比例的分对数时和是成功时的比例的分对数。换句话说, 和
让我们检查一下R
。
n <- 54
set.seed(123)
x <- rbinom(n, 1, .4)
y <- rbinom(n, 1, .6)
tbl <- table(x=x,y=y)
mod <- glm(y ~ x, family=binomial())
# all the same at 0.5757576
binomial()$linkinv( mod$coef[1])
mean(y[x == 0])
tbl[1,2] / sum(tbl[1,])
# all the same at 0.5714286
binomial()$linkinv( mod$coef[1] + mod$coef[2])
mean(y[x == 1])
tbl[2,2] / sum(tbl[2,])
因此,逻辑回归系数就是表中比例的精确转换。
结果是,如果我们有来自一系列伯努利随机变量的数据,我们当然可以使用逻辑回归分析该数据集,但是事实证明,这与直接分析结果列联表没有什么不同。
我想从理论角度评论为什么这行得通。当我们拟合回归,我们使用的模型。然后,我们决定将平均值建模为或符号线性预测变量的变换。在我们的情况下,我们只有两个唯一值,因此只有两个唯一值,例如和。由于我们的独立性假设,我们有 和
This means that
The key insight here: our Bernoulli RVs are while our binomial RVs are , but both have the same probability of success. That's the reason why these contingency table proportions are estimating the same thing as an observation-level logistic regression. It's not just some coincidence with the table: it's a direct consequence of the distributional assumptions we have made.
When you have more than one predictors and all the predictors are binary variables, you could fit a model using Logic Regression [1] (note it's "Logic" not "Logistic"). It's useful when you believe interaction effects among your predictors are prominent. There's an implementation in R (LogicReg
package).
[1] Ruczinski, I., Kooperberg, C., & LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics, 12(3), 475-511.