插入符glmnet与cv.glmnet

在glmnet内部caret使用搜索最佳lambda和cv.glmnet执行相同任务的比较中似乎有很多困惑。

提出了许多问题，例如：

使用`caret`交叉验证`glmnet`

但是没有给出答案，这可能是由于问题的可重复性。在第一个问题之后，我给出了一个非常相似的示例，但确实存在相同的问题：为什么估计的lambda如此不同？

library(caret)
library(glmnet)
set.seed(849)
training <- twoClassSim(50, linearVars = 2)
set.seed(849)
testing <- twoClassSim(500, linearVars = 2)
trainX <- training[, -ncol(training)]
testX <- testing[, -ncol(testing)]
trainY <- training$Class

# Using glmnet to directly perform CV
set.seed(849)
cvob1=cv.glmnet(x=as.matrix(trainX),y=trainY,family="binomial",alpha=1, type.measure="auc", nfolds = 3,lambda = seq(0.001,0.1,by = 0.001),standardize=FALSE)

cbind(cvob1$lambda,cvob1$cvm)

# best parameter
cvob1$lambda.mi

# best coefficient
coef(cvob1, s = "lambda.min")


# Using caret to perform CV
cctrl1 <- trainControl(method="cv", number=3, returnResamp="all",classProbs=TRUE,summaryFunction=twoClassSummary)
set.seed(849)
test_class_cv_model <- train(trainX, trainY, method = "glmnet", trControl = cctrl1,metric = "ROC",
                             tuneGrid = expand.grid(alpha = 1,lambda = seq(0.001,0.1,by = 0.001)))


test_class_cv_model 

# best parameter
test_class_cv_model$bestTune

# best coefficient
coef(test_class_cv_model$finalModel, test_class_cv_model$bestTune$lambda)

总而言之，最优lambda为：

通过使用0.055 cv.glmnet()
0.001通过使用 train()

我知道不建议使用standardize=FALSEin cv.glmnet()，但是我真的想比较使用相同先决条件的两种方法。作为主要解释，我认为每个褶皱的采样方法可能都是一个问题-但我使用相同的种子，结果却大不相同。

因此，我真的坚持为什么两种方法如此不同，而它们却应该非常相似？-我希望社区对此有所了解

r caret glmnet machine-learning neural-networks maximum softmax probability distributions mathematical-statistics random-variable cdf statistical-significance variance expected-value ratio sample-size reliability tolerance-interval wilcoxon-signed-rank self-study variance sampling mean machine-learning svm libsvm self-study sampling ranks data-visualization histogram machine-learning classification normal-distribution mathematical-statistics maximum-likelihood mixture predictive-models prediction seasonality

— 若木
source

我在这里看到两个问题。首先，您的训练集相对于测试集而言太小。通常，我们希望训练集的大小至少与测试集相当。另一个注意事项是，对于交叉验证，您根本不需要使用测试集，因为该算法基本上是使用“训练集”为您创建测试集的。因此，最好将更多的数据用作初始训练集。

其次，3折对于您的简历而言太小了，无法使其可靠。通常，建议使用5到10折（nfolds = 5for cv.glmnet和number=5for caret）。经过这些更改，我在两种方法中得到了相同的lambda值，并且估算值几乎相同：

set.seed(849)
training <- twoClassSim(500, linearVars = 2)
set.seed(849)
testing <- twoClassSim(50, linearVars = 2)
trainX <- training[, -ncol(training)]
testX <- testing[, -ncol(testing)]
trainY <- training$Class

# Using glmnet to directly perform CV
set.seed(849)
cvob1=cv.glmnet(x=as.matrix(trainX), y=trainY,family="binomial",alpha=1, 
                type.measure="auc", nfolds = 5, lambda = seq(0.001,0.1,by = 0.001),
                standardize=FALSE)

cbind(cvob1$lambda,cvob1$cvm)

# best parameter
cvob1$lambda.min

# best coefficient
coef(cvob1, s = "lambda.min")


# Using caret to perform CV
cctrl1 <- trainControl(method="cv", number=5, returnResamp="all",
                       classProbs=TRUE, summaryFunction=twoClassSummary)
set.seed(849)
test_class_cv_model <- train(trainX, trainY, method = "glmnet", 
                             trControl = cctrl1,metric = "ROC",
                             tuneGrid = expand.grid(alpha = 1,
                                                    lambda = seq(0.001,0.1,by = 0.001)))

test_class_cv_model 

# best parameter
test_class_cv_model$bestTune

# best coefficient
coef(test_class_cv_model$finalModel, test_class_cv_model$bestTune$lambda)

结果：

> cvob1$lambda.min
[1] 0.001

> coef(cvob1, s = "lambda.min")
8 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.781015706
TwoFactor1  -1.793387005
TwoFactor2   1.850588656
Linear1      0.009341356
Linear2     -1.213777391
Nonlinear1   1.158009360
Nonlinear2   0.609911748
Nonlinear3   0.246029667

> test_class_cv_model$bestTune
alpha lambda
1     1  0.001

> coef(test_class_cv_model$finalModel, test_class_cv_model$bestTune$lambda)
8 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.845792624
TwoFactor1  -1.786976586
TwoFactor2   1.844767690
Linear1      0.008308165
Linear2     -1.212285068
Nonlinear1   1.159933335
Nonlinear2   0.676803555
Nonlinear3   0.309947442

— 状态
source

非常感谢您的回答-这对我来说非常有意义。由于我是CV的新手，因此我没有考虑a）样品的大小和b）褶皱。

— Jogi

感谢您的帖子！因此，如果我做对了，通常是将数据集分成一个大的训练集和一个较小的测试集（=保持），然后对训练集执行k倍CV。最后，使用CV的结果对测试集进行验证，对吗？

— Jogi

@Jogi就是这样做的方式。如果不需要进一步验证，也可以只使用整个数据集进行CV，因为CV已经根据模型在每次测试集迭代中的平均性能来选择最佳参数。

— StAtS