为什么将牛顿法用于逻辑回归优化称为迭代重加权最小二乘?


19

为什么将牛顿法用于逻辑回归优化称为迭代重加权最小二乘?

对我来说似乎不清楚,因为逻辑损失和最小二乘损失是完全不同的东西。


3
我不认为他们是一样的。IRLS是具有预期的Hessian而非观察到的Hessian的牛顿-拉夫森。
Dimitriy五Masterov

@ DimitriyV.Masterov谢谢,您能告诉我更多有关Expected Hessian vs Observed吗?另外,你想想这样的解释
杜海涛

Answers:


25

简介:GLM通过费舍尔评分来拟合,正如Dimitriy V. Masterov所指出的那样,牛顿-拉夫森与预期的Hessian相匹配(即,我们使用Fisher信息的估计值而不是观察到的信息)。如果我们使用规范链接函数,则发现观察到的Hessian等于预期的Hessian,因此在这种情况下NR和Fisher评分相同。无论哪种方式,我们都会看到Fisher评分实际上是在拟合加权最小二乘线性模型,并且由此估计的系数估计值在对数回归可能性最大时收敛。除了减少对已经解决的问题进行逻辑回归拟合之外,我们还获得了能够在最终WLS拟合中使用线性回归诊断程序以了解我们的逻辑回归的好处。

我将重点放在逻辑回归上,但是对于GLM中最大似然的更一般的观点,我建议本章的 15.3节介绍这一点,并在更一般的情况下得出IRLS(我认为这是从John Fox的Applied回归分析和广义线性模型()。

见尾注


可能性和得分函数

我们将通过迭代形式的东西来拟合我们的GLM

b(m+1)=b(m)J(m)1(b(m))
,其中是对数似然和Jm将或者对数似然的观测或预期的Hessian。

我们的链接函数是一个函数g映射条件均值μi=E(yi|xi)到我们的线性预测器,所以我们对均值模型是Gμ一世=X一世Ťβ。令H为将线性预测变量映射到均值的逆链接函数。

对于Logistic回归,我们具有独立观察的伯努利似然,因此

b;ÿ=一世=1个ñÿ一世日志HX一世Ťb+1个-ÿ一世日志1个-HX一世Ťb
取导数, = n i=1xijh'x T i b y i
bĴ=一世=1个ñÿ一世HX一世ŤbHX一世ŤbX一世Ĵ-1个-ÿ一世1个-HX一世ŤbHX一世ŤbX一世Ĵ
=ixijh'x T i b
=一世=1个ñX一世ĴHX一世Ťbÿ一世HX一世Ťb-1个-ÿ一世1个-HX一世Ťb
=一世X一世ĴHX一世ŤbHX一世Ťb1个-HX一世Ťbÿ一世-HX一世Ťb

使用规范链接

现在,假设我们使用规范链接函数。那么g 1 cx = h cx = 1GC=Logit所以ħ ' C ^ =ħÇ1-H ^C ^,这意味着这简化为 ∂&gc1(x):=hc(x)=11+exhc=hc(1hc) ,以便 ▿b;Ý=XŤÝ - Ý 此外,仍然使用ħÇ2

bj=ixij(yihc(xiTb))
(b;y)=XT(yy^).
hc
2bkbj=ixijbkhc(xiTb)=ixijxik[hc(xiTb)(1hc(xiTb))].

然后我们有 H= X T WX, 并记下它如何不再包含 y i,所以EH=H(我们将其视为b的函数,所以唯一的随机数是y本身)。因此,我们证明了在逻辑回归中使用规范链接时,费舍尔评分与牛顿-拉夫森等效。也凭借

W=diag(hc(x1Tb)(1hc(x1Tb)),,hc(xnTb)(1hc(xnTb)))=diag(y^1(1y^1),,y^n(1y^n)).
H=XTWX
yiE(H)=Hby-XŤW¯¯X将总是严格负定的,虽然数值如果 ý太接近01则我们可以具有权重四舍五入到0,它可以使ħ负半定,因此计算单数。y^i(0,1) XTWXy^i010H

z=W1(yy^)

=XT(yy^)=XTWz.

b(m+1)=b(m)+(XTW(m)X)1XTW(m)z(m)
(XTW(m)X)1XTW(m)z(m)β^z(m)X

在检查此R

set.seed(123)
p <- 5
n <- 500
x <- matrix(rnorm(n * p), n, p)
betas <- runif(p, -2, 2)
hc <- function(x) 1 /(1 + exp(-x)) # inverse canonical link
p.true <- hc(x %*% betas)
y <- rbinom(n, 1, p.true)

# fitting with our procedure
my_IRLS_canonical <- function(x, y, b.init, hc, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- hc(eta)
    h.prime_eta <- y.hat * (1 - y.hat)
    z <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z ~ x - 1, weights = h.prime_eta)$coef  # WLS regression
    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

my_IRLS_canonical(x, y, rep(1,p), hc)
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851

glm(y ~ x - 1, family=binomial())$coef
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851 

他们同意。


非规范链接功能

hh(1h)=1HE(H) in our Fisher scoring.

Here's how this will go: we already worked out the general so the Hessian will be the main difficulty. We need

2bkbj=ixijbkh(xiTb)(yih(xiTb)1yi1h(xiTb))
=ixijxik[h(xiTb)(yih(xiTb)1yi1h(xiTb))h(xiTb)2(yih(xiTb)2+1yi(1h(xiTb))2)]

Via the linearity of expectation all we need to do to get E(H) is replace each occurrence of yi with its mean under our model which is μi=h(xiTβ). Each term in the summand will therefore contain a factor of the form

h(xiTb)(h(xiTβ)h(xiTb)1h(xiTβ)1h(xiTb))h(xiTb)2(h(xiTβ)h(xiTb)2+1h(xiTβ)(1h(xiTb))2).
But to actually do our optimization we'll need to estimate each β, and at step m b(m) is the best guess we have. This means that this will reduce to
h(xiTb)(h(xiTb)h(xiTb)1h(xiTb)1h(xiTb))h(xiTb)2(h(xiTb)h(xiTb)2+1h(xiTb)(1h(xiTb))2)
=h(xiTb)2(1h(xiTb)+11h(xiTb))
=h(xiTb)2h(xiTb)(1h(xiTb)).
This means we will use J with
Jjk=ixijxikh(xiTb)2h(xiTb)(1h(xiTb)).

Now let

W=diag(h(x1Tb)2h(x1Tb)(1h(x1Tb)),,h(xnTb)2h(xnTb)(1h(xnTb)))
and note how under the canonical link hc=hc(1hc) reduces W to W from the previous section. This lets us write
J=XTWX
except this is now E^(H) rather than necessarily being H itself, so this can differ from Newton-Raphson. For all i Wii>0 so aside from numerical issues J will be negative definite.

We have

bj=ixijh(xiTb)h(xiTb)(1h(xiTb))(yih(xiTb))
so letting our new working response be z=D1(yy^) with D=diag(h(x1Tb),,h(xnTb)), we have =XTWz.

All together we are iterating

b(m+1)=b(m)+(XTW(m)X)1XTW(m)z(m)
so this is still a sequence of WLS regressions except now it's not necessarily Newton-Raphson.

I've written it out this way to emphasize the connection to Newton-Raphson, but frequently people will factor the updates so that each new point b(m+1) is itself the WLS solution, rather than a WLS solution added to the current point b(m). If we wanted to do this, we can do the following:

b(m+1)=b(m)+(XTW(m)X)1XTW(m)z(m)
=(XTW(m)X)1(XTW(m)Xb(m)+XTW(m)z(m))
=(XTW(m)X)1XTW(m)(Xb(m)+z(m))
so if we're going this way you'll see the working response take the form η(m)+D(m)1(yy^(m)), but it's the same thing.

Let's confirm that this works by using it to perform a probit regression on the same simulated data as before (and this is not the canonical link, so we need this more general form of IRLS).

my_IRLS_general <- function(x, y, b.init, h, h.prime, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- h(eta)
    h.prime_eta <- h.prime(eta)
    w_star <- h.prime_eta^2 / (y.hat * (1 - y.hat))
    z_star <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z_star ~ x - 1, weights = w_star)$coef  # WLS

    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

# probit inverse link and derivative
h_probit <- function(x) pnorm(x, 0, 1)
h.prime_probit <- function(x) dnorm(x, 0, 1)

my_IRLS_general(x, y, rep(0,p), h_probit, h.prime_probit)
# x1         x2         x3         x4         x5 
# -0.6456508  1.2520266  0.5820856  0.4982678 -0.6768585 

glm(y~x-1, family=binomial(link="probit"))$coef
# x1         x2         x3         x4         x5 
# -0.6456490  1.2520241  0.5820835  0.4982663 -0.6768581 

and again the two agree.


Comments on convergence

Finally, a few quick comments on convergence (I'll keep this brief as this is getting really long and I'm no expert at optimization). Even though theoretically each J(m) is negative definite, bad initial conditions can still prevent this algorithm from converging. In the probit example above, changing the initial conditions to b.init=rep(1,p) results in this, and that doesn't even look like a suspicious initial condition. If you step through the IRLS procedure with that initialization and these simulated data, by the second time through the loop there are some y^i that round to exactly 1 and so the weights become undefined. If we're using the canonical link in the algorithm I gave we won't ever be dividing by y^i(1y^i) to get undefined weights, but if we've got a situation where some y^i are approaching 0 or 1, such as in the case of perfect separation, then we'll still get non-convergence as the gradient dies without us reaching anything.


5
+1。我喜欢您的答案经常有多详尽。
变形虫说恢复莫妮卡

You stated "the coefficient estimates from this converge on a maximum of the logistic regression likelihood." Is that necessarily so, from any initial values?
Mark L. Stone,

2
@MarkL.Stone ah I was being too casual there, didn't mean to offend the optimization people :) I'll add some more details (and would appreciate your thoughts on them when I do)
JLD

any chance you watched the link I posted? Seems that video is talking from machine learning perspective, just optimize logistic loss, without talking about Hessain expectation?
海涛杜

1
@hxd1011 in that pdf i linked to (link again: sagepub.com/sites/default/files/upm-binaries/…) on page 24 of it the author goes into the theory and explains what exactly makes a link function canonical. I found that pdf extremely helpful when I first came across this (although it took me a while to get through).
jld
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.