帮助我了解逻辑回归中的调整后的优势比

我一直很难理解一篇论文中逻辑回归的用法。此处可用的论文使用逻辑回归来预测白内障手术期间发生并发症的可能性。

令我感到困惑的是，该论文提出了一个将比值比1分配给基线的模型，描述如下：

对于所有风险指标而言，其风险状况均属于参考组的患者（即，表1中所有风险指标均已调整为OR = 1.00）可被视为具有“基准风险状况”，而逻辑回归模型表明其具有“基准预测概率” PCR或VL或两者均= 0.736％。

因此，以0.00的比值比表示0.00736的概率。基于从概率到比值比的转换：，它不能等于1：。 $o=\frac{p}{1-p}$ $0.00741=\frac{0.00736}{1-0.00736}$

它变得更加混乱。代表多个协变量的复合比值比具有不同于基线的值，用于计算预测风险。

...表1的复合OR为1.28 X 1.58 X 2.99 X 2.46 X 1.45 X 1.60 = 34.5，从图1的图中可以看出，该OR与预测的PCR或VL或两者的预测概率相对应大约20％

得出示例中给出的值的唯一方法是将基线概率乘以这样的复合赔率：。 $0.2025=\frac{(34.50\ \times\ 0.00736)}{1\ +\ (34.50\ \times\ 0.00736)}$

那么这是怎么回事？将比值比1分配给非0.5的基线概率有什么逻辑？我上面提出的更新公式提供了本文中示例的正确概率，但这并不是我期望的比值比的直接乘积。之后怎么样了？

logistic odds-ratio

— 马洪亚
source

您可能对术语有一个简单的困惑：是几率，而不是几率。比值比是一个这样的表达方式除以另一个表达方式。

p / (1 - p)

$p/(1-p)$

— ub

赔率是表达机会的一种方式。赔率就是：一个赔率除以另一个。 这意味着优势比是您将一个优势乘以另一个而得到的。让我们看看它们在这种常见情况下如何工作。

在赔率和概率之间转换

二进制响应 $Y$ 的几率是发生（用编码 $1$ ） $\Pr(Y=1)$ 发生的机会与没有（用编码 $0$ ）发生的机会的比率 $\Pr(Y=0)$ ：

Odds (Y) = \frac{Pr (Y = 1)}{Pr (Y = 0)} = \frac{Pr (Y = 1)}{1 - Pr (Y = 1)} .

$\text{Odds}(Y) = \frac{\Pr(Y=1)}{\Pr(Y=0)} = \frac{\Pr(Y=1)}{1 - \Pr(Y=1)}.$

右边的等价表达式表明，足以对建立模型以找到赔率。相反，请注意，我们可以解决 $\Pr(Y=1)$

Pr (Y = 1) = \frac{Odds (Y)}{1 + Odds (Y)} = 1 - \frac{1}{1 + Odds (Y)} .

$\Pr(Y=1) = \frac{\text{Odds}(Y)}{1 + \text{Odds}(Y)} = 1 - \frac{1}{1 + \text{Odds}(Y)}.$

逻辑回归

逻辑回归将的几率对数建模为解释变量的线性函数。最通常，书写这些变量为，并包括在所述线性函数的可能的常数项，我们可以命名的系数（其是要被从数据中估计）为和。正式生成模型 $Y$ $x_1, \ldots, x_p$ $\beta_1,\ldots, \beta_p$ $\beta_0$

\log (Odds (Y)) = β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p} .

$\log\left(\text{Odds}(Y)\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p.$

The odds themselves can be recovered by undoing the logarithm:

Odds (Y) = \exp (β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}) .

$\text{Odds}(Y) = \exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p).$

Using categorical variables

Categorical variables, such as age group, gender, presence of Glaucoma, etc., are incorporated by means of "dummy coding." To show that how the variable is coded does not matter, I will provide a simple example of one small group; its generalization to multiple groups should be obvious. In this study one variable is "pupil size," with three categories, "Large", "Medium", and "Small". (The study treats these as purely categorical, apparently paying no attention to their inherent order.) Intuitively, each category has its own odds, say $\alpha_L$ for "Large", $\alpha_M$ for "Medium", and $\alpha_S$ for "Small". This means that, all other things equal,

Odds (Y) = \exp (α_{L} + β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p})

$\text{Odds}(Y) = \exp(\color{Blue}{\alpha_L + \beta_0} + \beta_1 x_1 + \cdots + \beta_p x_p)$

for anybody in the "Large" category,

Odds (Y) = \exp (α_{M} + β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p})

$\text{Odds}(Y) = \exp(\color{Blue}{\alpha_M + \beta_0} + \beta_1 x_1 + \cdots + \beta_p x_p)$

for anybody in the "Medium" category, and

Odds (Y) = \exp (α_{S} + β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p})

$\text{Odds}(Y) = \exp(\color{Blue}{\alpha_S + \beta_0} + \beta_1 x_1 + \cdots + \beta_p x_p)$

for those in the "Small" category.

Creating identifiable coefficients

I have colored the first two coefficients to highlight them, because I want you to notice that they allow a simple change to occur: we could pick any number $\gamma$ and, by adding it to $\beta_0$ and subtracting it from each of $\alpha_L$ , $\alpha_M$ , and $\alpha_S$ , we would not change any predicted odds. This is because of the obvious equivalences of the form

α_{L} + β_{0} = (α_{L} - γ) + (γ + β_{0}),

$\alpha_L + \beta_0 = (\alpha_L - \gamma) + (\gamma + \beta_0 ),$

etc. Although this presents no problems for the model--it still predicts exactly the same things--it shows that the parameters are not in themselves interpretable. What stays the same when we do this addition-subtraction maneuver are the differences between the coefficients. Conventionally, to address this lack of identifiability, people (and by default, software) choose one of the categories in each variable as the "base" or "reference" and simply stipulate that its coefficient will be zero. This removes the ambiguity.

The paper lists reference categories first; "Large" in this case. Thus, $\alpha_L$ is subtracted from each of $\alpha_L, \alpha_M,$ and $\alpha_S$ , and added to $\beta_0$ to compensate.

The log odds for a hypothetical individual falling into all the base categories therefore equals $\beta_0$ plus a bunch of terms associated with all other "covariates"--the non-categorical variables:

Odds(Base category) = \exp (β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}) .

$\text{Odds(Base category)} = \exp(\beta_0 + \beta_1X_1 + \cdots + \beta_p X_p).$

No terms associated with any categorical variables appear here. (I have slightly changed the notation at this point: the betas $\beta_i$ now are the coefficients only of the covariates, while the full model includes the alphas $\alpha_j$ for the various categories.)

Comparing odds

Let us compare odds. Suppose a hypothetical individual is a

male patient aged 80–89 with a white cataract, no fundal view, and a small pupil being operated on by a specialist registrar, ...

Associated with this patient (let's call him Charlie) are estimated coefficients for each category: $\alpha_\text{80-89}$ for his age group, $\alpha_\text{male}$ for being male, and so on. Wherever his attribute is the base for its category, the coefficient is zero by convention, as we have seen. Because this is a linear model, the coefficients add. Thus, to the base log odds given above, the log odds for this patient are obtained by adding in

α_{80-89} + α_{male} + α_{no Glaucoma} + \dots + α_{specialist registrar} .

$\alpha_\text{80-89}+\alpha_\text{male}+\alpha_\text{no Glaucoma}+ \cdots + \alpha_\text{specialist registrar}.$

This is precisely the amount by which the log odds of this patient vary from the base. To convert from log odds, undo the logarithm and recall that this turns addition into multiplication. Therefore, the base odds must be multiplied by

\exp (α_{80-89}) \exp (α_{male}) \exp (α_{no Glaucoma}) \dots \exp (α_{specialist registrar}) .

$\exp(\alpha_\text{80-89})\exp(\alpha_\text{male})\exp(\alpha_\text{no Glaucoma}) \cdots \exp(\alpha_\text{specialist registrar}).$

These are the numbers given in the table under "Adjusted OR" (adjusted odds ratio). (It is called "adjusted" because covariates $x_1, \ldots, x_p$ were included in the model. They play no role in any of our calculations, as you will see. It is called a "ratio" because it is precisely the amount by which the base odds must be multiplied to produce the patient's predicted odds: see the first paragraph of this post.) In order in the table, they are $\exp(\alpha_\text{80-89})=1.58$ , $\exp(\alpha_\text{male})=1.28$ , $\exp(\alpha_\text{no Glaucoma})=1.00$ , and so on. According to the article, their product works out to $34.5$ . Therefore

Odds(Charlie) = 34.5 \times Odds(Base) .

$\text{Odds(Charlie)} = 34.5\times \text{Odds(Base)}.$

(Notice that the base categories all have odds ratios of $1.00=\exp(0)$ , because including $1$ in the product leaves it unchanged. That's how you can spot the base categories in the table.)

Restating the results as probabilities

Finally, let us convert this result to probabilities. We were told the baseline predicted probability is $0.736\%=0.00736$ . Therefore, using the formulas relating odds and probabilities derived at the outset, we may compute

Odds(Base) = \frac{0.00736}{1 - 0.00736} = 0.00741.

$\text{Odds(Base)} = \frac{0.00736}{1 - 0.00736} = 0.00741.$

Consequently Charlie's odds are

Odds(Charlie) = 34.5 \times 0.00741 = 0.256.

$\text{Odds(Charlie)} = 34.5\times 0.00741 = 0.256.$

Finally, converting this back to probabilities gives

Pr (Y (Charlie) = 1) = 1 - \frac{1}{1 + 0.256} = 0.204.

$\Pr(Y(\text{Charlie})=1) = 1 - \frac{1}{1 + 0.256} = 0.204.$

— whuber
source

whuber: getting in front of my computer after a very tiring previous day and finding this extraordinary response from you is simply brilliant. You have helped me a lot in a very tight situation. Many thanks. (somehow @ whuber won't show up...)

— mahonya