R:glm函数,族=“二项式”和“重量”规格


14

我对体重与family =“ binomial”在glm中的工作方式非常困惑。在我的理解中,具有family =“ binomial”的glm的可能性指定如下: ,其中y是“观察到的成功比例”,n是已知的试验次数。yn

f(y)=(nny)pny(1p)n(1y)=exp(n[ylogp1p(log(1p))]+log(nny))
yn

以我的理解,成功概率p由一些线性系数\ beta参数β化为p=p(β)并且glm函数带有family =“ binomial”搜索:

argmaxβilogf(yi).
然后可以将此优化问题简化为:

argmaxβilogf(yi)=argmaxβini[yilogp(β)1p(β)(log(1p(β)))]+log(niniyi)=argmaxβini[yilogp(β)1p(β)(log(1p(β)))]

因此,如果我们让ni=nic所有i=1,...,N对于某一常数c,那么它也必须是真实的:
argmaxβilogf(yi)=argmaxβini[yilogp(β)1p(β)(log(1p(β)))]
由此,我认为按比例缩放试验次数ni给定成功y_i的比例,具有常数的变量不会影响\ beta的最大似然估计βyi

glm帮助文件显示:

 "For a binomial GLM prior weights are used to give the number of trials 
  when the response is the proportion of successes" 

因此,鉴于成功的响应比例,我期望权重的缩放不会影响估计的β。但是,以下两个代码返回不同的系数值:

 Y <- c(1,0,0,0) ## proportion of observed success
 w <- 1:length(Y) ## weight= the number of trials
 glm(Y~1,weights=w,family=binomial)

这产生:

 Call:  glm(formula = Y ~ 1, family = "binomial", weights = w)

 Coefficients:
 (Intercept)  
      -2.197     

而如果我将所有权重乘以1000,估计的系数就会不同:

 glm(Y~1,weights=w*1000,family=binomial)

 Call:  glm(formula = Y ~ 1, family = binomial, weights = w * 1000)

 Coefficients:
 (Intercept)  
    -3.153e+15  

即使权重适度缩放,我也看到了许多其他这样的示例。这里发生了什么?


3
就其价值而言,weights参数最终通过C函数(在family.c中)结束于glm.fit函数内部的两个位置(在glm.R中),这在R:1中的异常残差binomial_dev_resids起作用。和2)通过Cdqrls(在lm.c中)在IWLS逐步执行。我不知道足够的C来帮助您追踪逻辑
Shadowtalker

3
此处查看答复。
2015年

@ssdecontrol我正在阅读您给我的链接中的glm.fit,但找不到在glm.fit中调用C函数“ binomial_dev_resids”的位置。您介意指出吗?
FairyOnIce 2015年

@ssdecontrol哦,对不起,我想我明白了。每个“家族”都是一个列表,其中一个元素是“ dev.resids”。在R控制台中键入二项式时,我看到了二项式对象的定义,并且有一行:dev.resids <-function(y,mu,wt).Call(C_binomial_dev_resids,y,mu,wt)
FairyOnIce

Answers:


4

您的示例仅导致R中的舍入误差。大的权重在中表现不佳glm。的确,w几乎任何较小的数字(如100)进行缩放都会导致与未缩放的结果相同的估算w

如果您希望使用weights参数获得更可靠的行为,请尝试使用包中的svyglm函数survey

看这里:

    > svyglm(Y~1, design=svydesign(ids=~1, weights=~w, data=data.frame(w=w*1000, Y=Y)), family=binomial)
Independent Sampling design (with replacement)
svydesign(ids = ~1, weights = ~w, data = data.frame(w = w * 1000, 
    Y = Y))

Call:  svyglm(formula = Y ~ 1, design = svydesign(ids = ~1, weights = ~w2, 
    data = data.frame(w2 = w * 1000, Y = Y)), family = binomial)

Coefficients:
(Intercept)  
     -2.197  

Degrees of Freedom: 3 Total (i.e. Null);  3 Residual
Null Deviance:      2.601 
Residual Deviance: 2.601    AIC: 2.843

1

我认为它归结到在使用的初始值glm.fitfamily$initialize这使得该方法divergere。据我所知,glm.fit通过形成的QR分解解决问题,其中是设计矩阵和是如所描述的对角与平方根条目这里。也就是说,使用牛顿-拉夫森方法。XWXXW

相关$intialize代码为:

if (NCOL(y) == 1) {
    if (is.factor(y)) 
        y <- y != levels(y)[1L]
    n <- rep.int(1, nobs)
    y[weights == 0] <- 0
    if (any(y < 0 | y > 1)) 
        stop("y values must be 0 <= y <= 1")
    mustart <- (weights * y + 0.5)/(weights + 1)
    m <- weights * y
    if (any(abs(m - round(m)) > 0.001)) 
        warning("non-integer #successes in a binomial glm!")
}

这是其中的简化版本glm.fit,显示了我的观点

> #####
> # setup
> y <- matrix(c(1,0,0,0), ncol = 1)
> weights <- 1:nrow(y) * 1000
> nobs <- length(y)
> family <- binomial()
> X <- matrix(rep(1, nobs), ncol = 1) # design matrix used later
> 
> # set mu start as with family$initialize
> if (NCOL(y) == 1) {
+   n <- rep.int(1, nobs)
+   y[weights == 0] <- 0
+   mustart <- (weights * y + 0.5)/(weights + 1)
+   m <- weights * y
+   if (any(abs(m - round(m)) > 0.001)) 
+     warning("non-integer #successes in a binomial glm!")
+ }
> 
> mustart # starting value
             [,1]
[1,] 0.9995004995
[2,] 0.0002498751
[3,] 0.0001666111
[4,] 0.0001249688
> (eta <- family$linkfun(mustart))
          [,1]
[1,]  7.601402
[2,] -8.294300
[3,] -8.699681
[4,] -8.987322
> 
> #####
> # Start loop to fit
> mu <- family$linkinv(eta)
> mu_eta <- family$mu.eta(eta)
> z <- drop(eta + (y - mu) / mu_eta)
> w <- drop(sqrt(weights * mu_eta^2 / family$variance(mu = mu)))
> 
> # code is simpler here as (X^T W X) is a scalar
> X_w <- X * w
> (.coef <- drop(crossprod(X_w)^-1 * ((w * z) %*% X_w)))
[1] -5.098297
> (eta <- .coef * X)
          [,1]
[1,] -5.098297
[2,] -5.098297
[3,] -5.098297
[4,] -5.098297
> 
> # repeat a few times from "start loop to fit"

我们可以重复最后一部分两次,以查看牛顿-拉夫森方法的不同之处:

> #####
> # Start loop to fit
> mu <- family$linkinv(eta)
> mu_eta <- family$mu.eta(eta)
> z <- drop(eta + (y - mu) / mu_eta)
> w <- drop(sqrt(weights * mu_eta^2 / family$variance(mu = mu)))
> 
> # code is simpler here as (X^T W X) is a scalar
> X_w <- X * w
> (.coef <- drop(crossprod(X_w)^-1 * ((w * z) %*% X_w)))
[1] 10.47049
> (eta <- .coef * X)
         [,1]
[1,] 10.47049
[2,] 10.47049
[3,] 10.47049
[4,] 10.47049
> 
> 
> #####
> # Start loop to fit
> mu <- family$linkinv(eta)
> mu_eta <- family$mu.eta(eta)
> z <- drop(eta + (y - mu) / mu_eta)
> w <- drop(sqrt(weights * mu_eta^2 / family$variance(mu = mu)))
> 
> # code is simpler here as (X^T W X) is a scalar
> X_w <- X * w
> (.coef <- drop(crossprod(X_w)^-1 * ((w * z) %*% X_w)))
[1] -31723.76
> (eta <- .coef * X)
          [,1]
[1,] -31723.76
[2,] -31723.76
[3,] -31723.76
[4,] -31723.76

如果您以weights <- 1:nrow(y)或开头,则不会发生这种情况weights <- 1:nrow(y) * 100

注意,可以通过设置mustart参数来避免分歧。例如做

> glm(Y ~ 1,weights = w * 1000, family = binomial, mustart = rep(0.5, 4))

Call:  glm(formula = Y ~ 1, family = binomial, weights = w * 1000, mustart = rep(0.5, 
    4))

Coefficients:
(Intercept)  
     -2.197  

Degrees of Freedom: 3 Total (i.e. Null);  3 Residual
Null Deviance:      6502 
Residual Deviance: 6502     AIC: 6504

我认为权重影响的不仅仅是初始化参数。通过逻辑回归,牛顿·拉夫森(Newton Raphson)估计了存在的最大可能性,并且在不分离数据时它是唯一的。向优化器提供不同的起始值不会得出不同的值,但是可能需要更长的时间才能到达目标值。
AdamO '17

“为优化器提供不同的起始值将不会达到不同的值...”。牛顿方法不会发散,而是在我设置初始值的最后一个示例中找到唯一的最大值(请参见提供mustart 参数的示例)。似乎与初步估计不佳有关。
本杰明·克里斯托弗森
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.