有没有一种方法可以使用交叉验证在R中进行变量/特征选择？

10

我有一个约70个要减少的变量的数据集。我想要做的是使用CV以下列方式查找最有用的变量。

1）随机选择说20个变量。

2）使用stepwise/ LASSO/ lars/ etc选择最重要的变量。

3）重复〜50x，查看最常选择（未消除）的变量。

这与a的randomForest做法类似，但是该rfVarSel软件包似乎仅适用于因子/分类，我需要预测一个连续的因变量。

我正在使用R，因此任何建议都可以在此处理想地实现。

— 尖叫猫头鹰
source

所有功能都重要吗？你有几个样本？如果我对问题的理解正确，则可以尝试做一些增强的变体-重复选择样本子集，然后将所有变量拟合到它们中，然后查看弹出次数更大的一个。

— Ofelia '02

1

我认为您的过程不太可能在LASSO上得到改进，在LASSO中其实现（例如glmnet和刑罚）在默认情况下会采用交叉验证来找到“最佳”正则化参数。您可能要考虑的一件事是重复LASSO搜索此参数几次，以应对交叉验证（重复的CV）可能存在的巨大差异。当然，没有任何算法可以胜过您针对特定学科的先验知识。

— miura

9

我相信您所描述的已经在caret软件包中实现了。在rfe此处查看功能或插图：http : //cran.r-project.org/web/packages/caret/vignettes/caretSelection.pdf

话虽如此，为什么您需要减少功能数量？从70降到20并不是一个数量级的下降。我认为您需要70多个功能，然后才可以确信某些功能确实没有关系。但是再说一次，我想这就是主观先验的地方。

— 乳木果派克斯
source

5

没有理由为什么变量选择频率会提供您从初始模型中变量的表面重要性中尚未获得的任何信息。这本质上是初始统计意义的重播。在尝试决定选择频率的临界值时，您还将添加一个新的任意程度。除其他问题外，共线性还严重破坏了重采样变量选择。

— 弗兰克·哈雷尔
source

2

我已经修改了今天早些时候的答案。现在，我生成了一些示例数据，可在这些数据上运行代码。其他人正确地建议您研究使用插入符号包，我同意。但是，在某些情况下，您可能会发现有必要编写自己的代码。下面，我试图说明如何在R中使用sample（）函数将观察结果随机分配给交叉验证折叠。我还使用for循环对10个训练集执行变量预选择（使用单变量线性回归，其最大p值截止值为0.1）和模型构建（使用逐步回归）。然后，您可以编写自己的代码，以将结果模型应用于验证折叠。希望这可以帮助！

################################################################################
## Load the MASS library, which contains the "stepAIC" function for performing
## stepwise regression, to be used later in this script
library(MASS)
################################################################################


################################################################################
## Generate example data, with 100 observations (rows), 70 variables (columns 1
## to 70), and a continuous dependent variable (column 71)
Data <- NULL
Data <- as.data.frame(Data)

for (i in 1:71) {
for (j in 1:100) {
Data[j,i]  <- rnorm(1) }}

names(Data)[71] <- "Dependent"
################################################################################


################################################################################
## Create ten folds for cross-validation. Each observation in your data will
## randomly be assigned to one of ten folds.
Data$Fold <- sample(c(rep(1:10,10)))

## Each fold will have the same number of observations assigned to it. You can
## double check this by typing the following:
table(Data$Fold)

## Note: If you were to have 105 observations instead of 100, you could instead
## write: Data$Fold <- sample(c(rep(1:10,10),rep(1:5,1)))
################################################################################


################################################################################
## I like to use a "for loop" for cross-validation. Here, prior to beginning my
## "for loop", I will define the variables I plan to use in it. You have to do
## this first or R will give you an error code.
fit <- NULL
stepw <- NULL
training <- NULL
testing <- NULL
Preselection <- NULL
Selected <- NULL
variables <- NULL
################################################################################


################################################################################
## Now we can begin the ten-fold cross validation. First, we open the "for loop"
for (CV in 1:10) {

## Now we define your training and testing folds. I like to store these data in
## a list, so at the end of the script, if I want to, I can go back and look at
## the observations in each individual fold
training[[CV]] <- Data[which(Data$Fold != CV),]
testing[[CV]]  <- Data[which(Data$Fold == CV),]

## We can preselect variables by analyzing each variable separately using
## univariate linear regression and then ranking them by p value. First we will
## define the container object to which we plan to output these data.
Preselection[[CV]] <- as.data.frame(Preselection[CV])

## Now we will run a separate linear regression for each of our 70 variables.
## We will store the variable name and the coefficient p value in our object
## called "Preselection".
for (i in 1:70) {
Preselection[[CV]][i,1]  <- i
Preselection[[CV]][i,2]  <- summary(lm(Dependent ~ training[[CV]][,i] , data = training[[CV]]))$coefficients[2,4]
}

## Now we will remove "i" and also we will name the columns of our new object.
rm(i)
names(Preselection[[CV]]) <- c("Variable", "pValue")

## Now we will make note of those variables whose p values were less than 0.1.
Selected[[CV]] <- Preselection[[CV]][which(Preselection[[CV]]$pValue <= 0.1),] ; row.names(Selected[[CV]]) <- NULL

## Fit a model using the pre-selected variables to the training fold
## First we must save the variable names as a character string
temp <- NULL
for (k in 1:(as.numeric(length(Selected[[CV]]$Variable)))) {
temp[k] <- paste("training[[CV]]$V",Selected[[CV]]$Variable[k]," + ",sep="")}
variables[[CV]] <- paste(temp, collapse = "")
variables[[CV]] <- substr(variables[[CV]],1,(nchar(variables[[CV]])-3))

## Now we can use this string as the independent variables list in our model
y <- training[[CV]][,"Dependent"]
form <- as.formula(paste("y ~", variables[[CV]]))

## We can build a model using all of the pre-selected variables
fit[[CV]] <- lm(form, training[[CV]])

## Then we can build new models using stepwise removal of these variables using
## the MASS package
stepw[[CV]] <- stepAIC(fit[[CV]], direction="both")

## End for loop
}

## Now you have your ten training and validation sets saved as training[[CV]]
## and testing[[CV]]. You also have results from your univariate pre-selection
## analyses saved as Preselection[[CV]]. Those variables that had p values less
## than 0.1 are saved in Selected[[CV]]. Models built using these variables are
## saved in fit[[CV]]. Reduced versions of these models (by stepwise selection)
## are saved in stepw[[CV]].

## Now you might consider using the predict.lm function from the stats package
## to apply your ten models to their corresponding validation folds. You then
## could look at the performance of the ten models and average their performance
## statistics together to get an overall idea of how well your data predict the
## outcome.
################################################################################

在执行交叉验证之前，重要的是您了解其正确用法。这两个参考文献提供了关于交叉验证的出色讨论：

Simon RM，Subramanian J，Li MC，MenezesS。使用交叉验证评估基于高维数据的生存风险分类器的预测准确性。简短的生物信息。2011年5月； 12（3）：203-14。电子版2011年2月15 http://bib.oxfordjournals.org/content/12/3/203.long
理查德·西蒙（Richard Simon），迈克尔·D·拉德马赫（Michael D.Radmacher），凯文·多宾（Kevin Dobbin）和丽莎·麦克沙恩（Lisa M. 使用DNA微阵列数据进行诊断和预后分类的陷阱。JNCI J Natl癌症研究所（2003）95（1）：14-18。http://jnci.oxfordjournals.org/content/95/1/14.long

这些论文是针对生物统计学家的，但是对任何人都有用。

另外，请始终记住使用逐步回归是危险的（尽管使用交叉验证应有助于减轻过度拟合）。有关逐步回归的详细讨论，请参见：http : //www.stata.com/support/faqs/stat/stepwise.html。

如果您还有其他问题，请告诉我！

— 亚力山大
source

0

我在这里发现了一些不错的东西：http : //cran.r-project.org/web/packages/Causata/vignettes/Causata-vignette.pdf

尝试使用glmnet软件包时尝试此操作

# extract nonzero coefficients
coefs.all <- as.matrix(coef(cv.glmnet.obj, s="lambda.min"))
idx <- as.vector(abs(coefs.all) > 0)
coefs.nonzero <- as.matrix(coefs.all[idx])
rownames(coefs.nonzero) <- rownames(coefs.all)[idx]

— 西蒙·尼尔斯
source