为什么将牛顿法用于逻辑回归优化称为迭代重加权最小二乘?
对我来说似乎不清楚,因为逻辑损失和最小二乘损失是完全不同的东西。
为什么将牛顿法用于逻辑回归优化称为迭代重加权最小二乘?
对我来说似乎不清楚,因为逻辑损失和最小二乘损失是完全不同的东西。
Answers:
简介:GLM通过费舍尔评分来拟合,正如Dimitriy V. Masterov所指出的那样,牛顿-拉夫森与预期的Hessian相匹配(即,我们使用Fisher信息的估计值而不是观察到的信息)。如果我们使用规范链接函数,则发现观察到的Hessian等于预期的Hessian,因此在这种情况下NR和Fisher评分相同。无论哪种方式,我们都会看到Fisher评分实际上是在拟合加权最小二乘线性模型,并且由此估计的系数估计值在对数回归可能性最大时收敛。除了减少对已经解决的问题进行逻辑回归拟合之外,我们还获得了能够在最终WLS拟合中使用线性回归诊断程序以了解我们的逻辑回归的好处。
我将重点放在逻辑回归上,但是对于GLM中最大似然的更一般的观点,我建议本章的 15.3节介绍这一点,并在更一般的情况下得出IRLS(我认为这是从John Fox的Applied回归分析和广义线性模型()。
见尾注
我们将通过迭代形式的东西来拟合我们的GLM
我们的链接函数是一个函数映射条件均值到我们的线性预测器,所以我们对均值模型是。令为将线性预测变量映射到均值的逆链接函数。
对于Logistic回归,我们具有独立观察的伯努利似然,因此
现在,假设我们使用规范链接函数。那么g − 1 c(x ):= h c(x )= 1所以ħ ' C ^ =ħÇ⋅(1-H ^C ^),这意味着这简化为 ∂&ℓ ,以便 ▿ℓ(b;Ý)=XŤ(Ý - Ý)。 此外,仍然使用ħÇ, ∂2ℓ
令 然后我们有 H=− X T WX, 并记下它如何不再包含 y i,所以E(H)=H(我们将其视为b的函数,所以唯一的随机数是y本身)。因此,我们证明了在逻辑回归中使用规范链接时,费舍尔评分与牛顿-拉夫森等效。也凭借
在检查此R
:
set.seed(123)
p <- 5
n <- 500
x <- matrix(rnorm(n * p), n, p)
betas <- runif(p, -2, 2)
hc <- function(x) 1 /(1 + exp(-x)) # inverse canonical link
p.true <- hc(x %*% betas)
y <- rbinom(n, 1, p.true)
# fitting with our procedure
my_IRLS_canonical <- function(x, y, b.init, hc, tol=1e-8) {
change <- Inf
b.old <- b.init
while(change > tol) {
eta <- x %*% b.old # linear predictor
y.hat <- hc(eta)
h.prime_eta <- y.hat * (1 - y.hat)
z <- (y - y.hat) / h.prime_eta
b.new <- b.old + lm(z ~ x - 1, weights = h.prime_eta)$coef # WLS regression
change <- sqrt(sum((b.new - b.old)^2))
b.old <- b.new
}
b.new
}
my_IRLS_canonical(x, y, rep(1,p), hc)
# x1 x2 x3 x4 x5
# -1.1149687 2.1897992 1.0271298 0.8702975 -1.2074851
glm(y ~ x - 1, family=binomial())$coef
# x1 x2 x3 x4 x5
# -1.1149687 2.1897992 1.0271298 0.8702975 -1.2074851
他们同意。
in our Fisher scoring.
Here's how this will go: we already worked out the general so the Hessian will be the main difficulty. We need
Via the linearity of expectation all we need to do to get is replace each occurrence of with its mean under our model which is . Each term in the summand will therefore contain a factor of the form
Now let
We have
All together we are iterating
I've written it out this way to emphasize the connection to Newton-Raphson, but frequently people will factor the updates so that each new point is itself the WLS solution, rather than a WLS solution added to the current point . If we wanted to do this, we can do the following:
Let's confirm that this works by using it to perform a probit regression on the same simulated data as before (and this is not the canonical link, so we need this more general form of IRLS).
my_IRLS_general <- function(x, y, b.init, h, h.prime, tol=1e-8) {
change <- Inf
b.old <- b.init
while(change > tol) {
eta <- x %*% b.old # linear predictor
y.hat <- h(eta)
h.prime_eta <- h.prime(eta)
w_star <- h.prime_eta^2 / (y.hat * (1 - y.hat))
z_star <- (y - y.hat) / h.prime_eta
b.new <- b.old + lm(z_star ~ x - 1, weights = w_star)$coef # WLS
change <- sqrt(sum((b.new - b.old)^2))
b.old <- b.new
}
b.new
}
# probit inverse link and derivative
h_probit <- function(x) pnorm(x, 0, 1)
h.prime_probit <- function(x) dnorm(x, 0, 1)
my_IRLS_general(x, y, rep(0,p), h_probit, h.prime_probit)
# x1 x2 x3 x4 x5
# -0.6456508 1.2520266 0.5820856 0.4982678 -0.6768585
glm(y~x-1, family=binomial(link="probit"))$coef
# x1 x2 x3 x4 x5
# -0.6456490 1.2520241 0.5820835 0.4982663 -0.6768581
and again the two agree.
Finally, a few quick comments on convergence (I'll keep this brief as this is getting really long and I'm no expert at optimization). Even though theoretically each is negative definite, bad initial conditions can still prevent this algorithm from converging. In the probit example above, changing the initial conditions to b.init=rep(1,p)
results in this, and that doesn't even look like a suspicious initial condition. If you step through the IRLS procedure with that initialization and these simulated data, by the second time through the loop there are some that round to exactly and so the weights become undefined. If we're using the canonical link in the algorithm I gave we won't ever be dividing by to get undefined weights, but if we've got a situation where some are approaching or , such as in the case of perfect separation, then we'll still get non-convergence as the gradient dies without us reaching anything.