R如何处理lm中的缺失值？

32

我想针对矩阵A中的每一列对向量B进行回归。如果没有缺失数据，这是微不足道的，但是如果矩阵A包含缺失值，那么我对A的回归就被约束为仅包含所有存在值（默认的na.omit行为）。对于没有缺失数据的列，这会产生不正确的结果。我可以将列矩阵B相对于矩阵A的各个列进行回归，但是我要完成数千次回归，这是缓慢而乏味的。该na.exclude功能似乎是专为这种情况下，但我不能让它工作。我在这里做错了什么？如果重要，请在OSX上使用R 2.13。

A = matrix(1:20, nrow=10, ncol=2)
B = matrix(1:10, nrow=10, ncol=1)
dim(lm(A~B)$residuals)
# [1] 10 2 (the expected 10 residual values)

# Missing value in first column; now we have 9 residuals
A[1,1] = NA  
dim(lm(A~B)$residuals)
#[1]  9 2 (the expected 9 residuals, given na.omit() is the default)

# Call lm with na.exclude; still have 9 residuals
dim(lm(A~B, na.action=na.exclude)$residuals)
#[1]  9 2 (was hoping to get a 10x2 matrix with a missing value here)

A.ex = na.exclude(A)
dim(lm(A.ex~B)$residuals)
# Throws an error because dim(A.ex)==9,2
#Error in model.frame.default(formula = A.ex ~ B, drop.unused.levels = TRUE) : 
#  variable lengths differ (found for 'B')

r missing-data linear-model

— 大卫·奎格利
source

1

“我可以分别计算每一行”是什么意思？

— chl

抱歉，意思是说“我可以分别将列矩阵B相对于A中的列进行回归”，意思是一次调用lm。编辑以反映这一点。

— 大卫·奎格利

1

一次调用lm / regression并不是进行回归的好方法（按照回归的定义进行，即在给定其他预测器状态的情况下，查找每个预测变量对响应/结果的部分影响。变量）

— KarthikS 2015年

23

编辑：我误会了你的问题。有两个方面：

一）na.omit和na.exclude两个做casewise缺失相对于这两个预测结果和准则。它们的不同之处仅在于，对于使用省略的情况，提取器函数像residuals()或fitted()将用NAs 填充其输出na.exclude，因此具有与输入变量相同长度的输出。

> N    <- 20                               # generate some data
> y1   <- rnorm(N, 175, 7)                 # criterion 1
> y2   <- rnorm(N,  30, 8)                 # criterion 2
> x    <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
> y1[c(1, 3,  5)] <- NA                    # some NA values
> y2[c(7, 9, 11)] <- NA                    # some other NA values
> Y    <- cbind(y1, y2)                    # matrix for multivariate regression
> fitO <- lm(Y ~ x, na.action=na.omit)     # fit with na.omit
> dim(residuals(fitO))                     # use extractor function
[1] 14  2

> fitE <- lm(Y ~ x, na.action=na.exclude)  # fit with na.exclude
> dim(residuals(fitE))                     # use extractor function -> = N
[1] 20  2

> dim(fitE$residuals)                      # access residuals directly
[1] 14  2

b）真正的问题不在于na.omit和之间的区别na.exclude，您似乎不希望按条件删除，但同时考虑了标准变量。

> X <- model.matrix(fitE)                  # design matrix
> dim(X)                                   # casewise deletion -> only 14 complete cases
[1] 14  2

回归结果取决于矩阵（设计矩阵伪逆，系数）和帽子矩阵，拟合值）。如果您不希望逐案删除，则需要为每一列使用不同的设计矩阵，因此无法为每个条件拟合单独的回归。您可以通过执行以下操作来尝试避免开销： $X^{+} = (X' X)^{-1} X'$ $X$ $\hat{\beta} = X^{+} Y$ $H = X X^{+}$ $\hat{Y} = H Y$ $X$ $Y$ lm()

> Xf <- model.matrix(~ x)                    # full design matrix (all cases)
# function: manually calculate coefficients and fitted values for single criterion y
> getFit <- function(y) {
+     idx   <- !is.na(y)                     # throw away NAs
+     Xsvd  <- svd(Xf[idx , ])               # SVD decomposition of X
+     # get X+ but note: there might be better ways
+     Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
+     list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
+ }

> res <- apply(Y, 2, getFit)    # get fits for each column of Y
> res$y1$coefs
                   [,1]
(Intercept) 113.9398761
x             0.7601234

> res$y2$coefs
                 [,1]
(Intercept) 91.580505
x           -0.805897

> coefficients(lm(y1 ~ x))      # compare with separate results from lm()
(Intercept)           x 
113.9398761   0.7601234 

> coefficients(lm(y2 ~ x))
(Intercept)           x 
  91.580505   -0.805897

请注意，在数值上可能会有更好的方法来计算和，您可以改为检查分解。此处在SE上解释了 SVD方法。对于实际使用，我尚未使用大型矩阵计时上述方法。 $X^{+}$ $H$ $QR$ $Y$ lm()

— 卡拉卡尔
source

鉴于我对na.exclude应该如何工作的理解，这是有道理的。但是，如果调用> X.both = cbind（X1，X2），然后调用> dim（lm（X.both〜Y，na.action = na.exclude）$ residuals），您仍然会得到94个残差，而不是97和97.

— David Quigley

这是一个改进，但是如果查看残差（lm（X.both〜Y，na.action = na.exclude）），即使X的第1列缺少值，您也会看到每列都有六个缺失值。两者都来自与第2列不同的样本。因此na.exclude保留了残差矩阵的形状，但是在引擎盖下，R显然仅与X.both的所有行中存在的值回归。可能有一个很好的统计原因，但是对于我的应用程序来说，这是一个问题。

— 大卫·奎格利

@David我误解了你的问题。我想我现在明白了你的意思，并已经编辑了我的答案以解决这个问题。

— caracal

5

我可以想到两种方式。一种是使用合并数据na.exclude，然后再次分离数据：

A = matrix(1:20, nrow=10, ncol=2)
colnames(A) <- paste("A",1:ncol(A),sep="")

B = matrix(1:10, nrow=10, ncol=1)
colnames(B) <- paste("B",1:ncol(B),sep="")

C <- cbind(A,B)

C[1,1] <- NA
C.ex <- na.exclude(C)

A.ex <- C[,colnames(A)]
B.ex <- C[,colnames(B)]

lm(A.ex~B.ex)

另一种方法是使用data参数并创建公式。

Cd <- data.frame(C)
fr <- formula(paste("cbind(",paste(colnames(A),collapse=","),")~",paste(colnames(B),collapse="+"),sep=""))

lm(fr,data=Cd)

Cd[1,1] <-NA

lm(fr,data=Cd,na.action=na.exclude)

如果您要进行大量回归，则第一种方法应该更快，因为执行的背景魔术较少。尽管如果我只需要系数和残差，我建议使用lsfit，这比快得多lm。第二种方法更好一些，但是在我的笔记本电脑上尝试对结果回归进行汇总会引发错误。我将尝试看看这是否是一个错误。

— mpiktas
source

谢谢，但是代码中的lm（A.ex〜B.ex）对A1（正确）适合9点，对A2（不需要）适合9点。B1和A2都有10个测量点。我在针对A2的B1回归中抛出了一个点，因为A1中缺少相应的点。如果那只是它的工作方式，我可以接受，但是那不是我要让R做的事情。

— 大卫·奎格利

@David，哦，看来我误解了你的问题。稍后再发布修复程序。

— mpiktas

1

以下示例说明如何进行符合原始数据帧的预测和残差（使用lm（）中的“ na.action = na.exclude”选项以指定应将NA放置在原始数据帧所在的残差和预测矢量中）包含缺失值。它还显示了如何指定预测是仅包含解释变量和因变量都完整的观察（即严格按照样本内预测）还是包含解释变量完整的观察（因此可以进行Xb预测），（即，包括具有完整解释变量但缺少因变量的观测值的样本外预测。

我使用cbind将预测变量和残差变量添加到原始数据集中。

## Set up data with a linear model
N <- 10
NXmissing <- 2 
X <- runif(N, 0, 10)
Y <- 6 + 2*X + rnorm(N, 0, 1)
## Put in missing values (missing X, missing Y, missing both)
X[ sample(1:N , NXmissing) ] <- NA
Y[ sample(which(is.na(X)), 1)]  <- NA
Y[ sample(which(!is.na(X)), 1)]  <- NA
(my.df <- data.frame(X,Y))

## Run the regression with na.action specified to na.exclude
## This puts NA's in the residual and prediction vectors
my.lm  <- lm( Y ~ X, na.action=na.exclude, data=my.df)

## Predict outcome for observations with complete both explanatory and
## outcome variables, i.e. observations included in the regression
my.predict.insample  <- predict(my.lm)

## Predict outcome for observations with complete explanatory
## variables.  The newdata= option specifies the dataset on which
## to apply the coefficients
my.predict.inandout  <- predict(my.lm,newdata=my.df)

## Predict residuals 
my.residuals  <- residuals(my.lm)

## Make sure that it binds correctly
(my.new.df  <- cbind(my.df,my.predict.insample,my.predict.inandout,my.residuals))

## or in one fell swoop

(my.new.df  <- cbind(my.df,yhat=predict(my.lm),yhato=predict(my.lm,newdata=my.df),uhat=residuals(my.lm)))

— 迈克尔·阿什
source