将Logistic回归与二进制结果和预测变量一起使用是否有意义？

18

我有一个二进制结果变量{0,1}和一个预测变量{0,1}。我的想法是，除非我包括其他变量并计算优势比，否则进行逻辑物流是没有意义的。

使用一个二元预测变量，计算概率就足以满足优势比吗？

— 基瓦尔
source

26

在这种情况下，您可以将数据折叠到，其中是和且的实例数。假设总体上有观测值。

\begin{array}{ccc} X ∖ Y & 0 & 1 \\ 0 & S_{00} & S_{01} \\ 1 & S_{10} & S_{11} \end{array}

$\begin{array}{c|cc} X \backslash Y & 0 & 1 \\ \hline 0 & S_{00} & S_{01} \\ 1 & S_{10} & S_{11} \end{array}$

S_{i j}

$S_{ij}$

x = i

$x = i$

y = j

$y =j$

i, j \in {0, 1}

$i,j \in \{0,1\}$

n

$n$

如果我们拟合模型（其中是我们的链接函数），我们会发现，是成功的比例的分对数时和是成功时的比例的分对数。换句话说，和 $p_i = g^{-1}(x_i^T \beta) = g^{-1}(\beta_0 + \beta_1 1_{x_i = 1})$ $g$ $\hat \beta_0$ $x_i = 0$ $\hat \beta_0 + \hat \beta_1$ $x_i = 1$

{\hat{β}}_{0} = g (\frac{S_{01}}{S_{00} + S_{01}})

$\hat \beta_0 = g\left(\frac{S_{01}}{S_{00} + S_{01}}\right)$

{\hat{β}}_{0} + {\hat{β}}_{1} = g (\frac{S_{11}}{S_{10} + S_{11}}) .

$\hat \beta_0 + \hat \beta_1 = g\left(\frac{S_{11}}{S_{10} + S_{11}}\right).$

让我们检查一下R。

n <- 54
set.seed(123)
x <- rbinom(n, 1, .4)
y <- rbinom(n, 1, .6)

tbl <- table(x=x,y=y)

mod <- glm(y ~ x, family=binomial())

# all the same at 0.5757576
binomial()$linkinv( mod$coef[1])
mean(y[x == 0])
tbl[1,2] / sum(tbl[1,])

# all the same at 0.5714286
binomial()$linkinv( mod$coef[1] + mod$coef[2])
mean(y[x == 1])
tbl[2,2] / sum(tbl[2,])

因此，逻辑回归系数就是表中比例的精确转换。

结果是，如果我们有来自一系列伯努利随机变量的数据，我们当然可以使用逻辑回归分析该数据集，但是事实证明，这与直接分析结果列联表没有什么不同。

我想从理论角度评论为什么这行得通。当我们拟合回归，我们使用的模型。然后，我们决定将平均值建模为或符号线性预测变量的变换。在我们的情况下，我们只有两个唯一值，因此只有两个唯一值，例如和。由于我们的独立性假设，我们有和 $Y_i | x_i \stackrel{\perp}{\sim} \text{Bern}(p_i)$ $x_i$ $p_i = g^{-1}\left( \beta_0 + \beta_1 x_i\right)$ $x_i$ $p_i$ $p_0$ $p_1$

\sum_{i : x_{i} = 0} Y_{i} = S_{01} \sim Bin (n_{0}, p_{0})

$\sum \limits_{i : x_i = 0} Y_i = S_{01} \sim \text{Bin} \left(n_0, p_0\right)$

\sum_{i : x_{i} = 1} Y_{i} = S_{11} \sim Bin (n_{1}, p_{1}) .

$\sum \limits_{i : x_i = 1} Y_i = S_{11} \sim \text{Bin} \left(n_1, p_1\right).$ Note how we're using the fact that the

x_{i}

$x_i$ , and in turn

n_{0}

$n_0$ and

n_{1}

$n_1$ , are nonrandom: if this was not the case then these would not necessarily be binomial.

This means that

S_{01} / n_{0} = \frac{S_{01}}{S_{00} + S_{01}} \to_{p} p_{0} and S_{11} / n_{1} = \frac{S_{11}}{S_{10} + S_{11}} \to_{p} p_{1} .

$S_{01} / n_0 = \frac{S_{01}}{S_{00} + S_{01}} \to_p p_0 \hspace{2mm} \text{ and } \hspace{2mm} S_{11} / n_1 = \frac{S_{11}}{S_{10} + S_{11}} \to_p p_1.$

The key insight here: our Bernoulli RVs are $Y_i | x_i = j \sim \text{Bern}(p_j)$ while our binomial RVs are $S_{j1} \sim \text{Bin}(n_j, p_j)$ , but both have the same probability of success. That's the reason why these contingency table proportions are estimating the same thing as an observation-level logistic regression. It's not just some coincidence with the table: it's a direct consequence of the distributional assumptions we have made.

— jld
source

1

When you have more than one predictors and all the predictors are binary variables, you could fit a model using Logic Regression [1] (note it's "Logic" not "Logistic"). It's useful when you believe interaction effects among your predictors are prominent. There's an implementation in R (LogicReg package).

[1] Ruczinski, I., Kooperberg, C., & LeBlanc, M. (2003). Logic regression. Journal of Computational and graphical Statistics, 12(3), 475-511.

— horaceT
source

1

The question is specifically about one regressor, thus your answer would better serve as a comment.

— Richard Hardy