我有一个班级不平衡的数据,我想使用xgboost调整增强发束的超参数。
问题
- xgboost是否有与之等效的gridsearchcv或randomsearchcv?
- 如果不是,建议使用什么方法来调整xgboost的参数?
xgboost(max.depth)
还是xgb.train(max_depth)
?xgboost是否在不同位置不一致地使用点号和下划线作为参数?还是他们converted依了?
我有一个班级不平衡的数据,我想使用xgboost调整增强发束的超参数。
问题
xgboost(max.depth)
还是xgb.train(max_depth)
?xgboost是否在不同位置不一致地使用点号和下划线作为参数?还是他们converted依了?
Answers:
由于xgboost
in 的界面caret
最近已更改,因此以下是一个脚本,该脚本提供了使用注释caret
优化xgboost
超参数的完整注释演练。
为此,我将使用Kaggle竞赛“给予我一些荣誉”中的训练数据。
xgboost
模型在本节中,我们:
xgboost
具有任意超参数的模型xgb.cv
)评估损失(AUC-ROC ) 这是一些执行此操作的代码。
library(caret)
library(xgboost)
library(readr)
library(dplyr)
library(tidyr)
# load in the training data
df_train = read_csv("04-GiveMeSomeCredit/Data/cs-training.csv") %>%
na.omit() %>% # listwise deletion
select(-`[EMPTY]`) %>%
mutate(SeriousDlqin2yrs = factor(SeriousDlqin2yrs, # factor variable for classification
labels = c("Failure", "Success")))
# xgboost fitting with arbitrary parameters
xgb_params_1 = list(
objective = "binary:logistic", # binary classification
eta = 0.01, # learning rate
max.depth = 3, # max tree depth
eval_metric = "auc" # evaluation/loss metric
)
# fit the model with the arbitrary parameters specified above
xgb_1 = xgboost(data = as.matrix(df_train %>%
select(-SeriousDlqin2yrs)),
label = df_train$SeriousDlqin2yrs,
params = xgb_params_1,
nrounds = 100, # max number of trees to build
verbose = TRUE,
print.every.n = 1,
early.stop.round = 10 # stop if no improvement within 10 trees
)
# cross-validate xgboost to get the accurate measure of error
xgb_cv_1 = xgb.cv(params = xgb_params_1,
data = as.matrix(df_train %>%
select(-SeriousDlqin2yrs)),
label = df_train$SeriousDlqin2yrs,
nrounds = 100,
nfold = 5, # number of folds in K-fold
prediction = TRUE, # return the prediction using the final model
showsd = TRUE, # standard deviation of loss across folds
stratified = TRUE, # sample is unbalanced; use stratified sampling
verbose = TRUE,
print.every.n = 1,
early.stop.round = 10
)
# plot the AUC for the training and testing samples
xgb_cv_1$dt %>%
select(-contains("std")) %>%
mutate(IterationNum = 1:n()) %>%
gather(TestOrTrain, AUC, -IterationNum) %>%
ggplot(aes(x = IterationNum, y = AUC, group = TestOrTrain, color = TestOrTrain)) +
geom_line() +
theme_bw()
这是测试与培训AUC的外观:
train
对于超参数搜索,我们执行以下步骤:
data.frame
我们想要训练模型的参数的唯一组合。 这是一些代码,显示了如何执行此操作。
# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4, 6, 8, 10),
gamma = 1
)
# pack the training control parameters
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
classProbs = TRUE, # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)
# train the model for each parameter combination in the grid,
# using CV to evaluate
xgb_train_1 = train(
x = as.matrix(df_train %>%
select(-SeriousDlqin2yrs)),
y = as.factor(df_train$SeriousDlqin2yrs),
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
# scatter plot of the AUC against max_depth and eta
ggplot(xgb_train_1$results, aes(x = as.factor(eta), y = max_depth, size = ROC, color = ROC)) +
geom_point() +
theme_bw() +
scale_size_continuous(guide = "none")
最后,您可以根据eta
和的变化为AUC创建气泡图max_depth
:
scale_pose_weight
用于不平衡分类。您能否提供有关操作方法的详细信息?谢谢!
脱字号软件包已合并xgboost。
cv.ctrl <- trainControl(method = "repeatedcv", repeats = 1,number = 3,
#summaryFunction = twoClassSummary,
classProbs = TRUE,
allowParallel=T)
xgb.grid <- expand.grid(nrounds = 1000,
eta = c(0.01,0.05,0.1),
max_depth = c(2,4,6,8,10,14)
)
set.seed(45)
xgb_tune <-train(formula,
data=train,
method="xgbTree",
trControl=cv.ctrl,
tuneGrid=xgb.grid,
verbose=T,
metric="Kappa",
nthread =3
)
样品输出
eXtreme Gradient Boosting
32218 samples
41 predictor
2 classes: 'N', 'Y'
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 1 times)
Summary of sample sizes: 21479, 21479, 21478
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.9324911 0.1094426 0.0009742774 0.008972911
我看到的一个缺点是插入符号当前不支持xgboost的其他参数,例如子样本等。
编辑
现在(2017年6月)可以使用Caret直接调整Gamma,colsample_bytree,min_child_weight和子样本等。只需将它们添加到上述代码的网格部分即可使其工作。感谢usεr11852在评论中将其突出显示。
caret
现在(2月- 2017年)支持额外的参数gamma
,colsample_bytree
,min_child_weight
和subsample
。(因此,有效地,你可以调整所有的事情-给定的时间)
我知道这是一个老问题,但是我使用的方法与上述方法不同。我使用贝叶斯优化程序包中的贝叶斯优化函数来查找最佳参数。为此,您首先要创建交叉验证折叠,然后创建一个函数xgb.cv.bayes
,该函数具有要更改的增强超参数作为参数。在这个例子中,我正在调整max.depth, min_child_weight, subsample, colsample_bytree, gamma
。然后xgb.cv
,使用在输入参数中设置为的超级参数来调用该函数xgb.cv.bayes
。然后,使用boosting hyper参数BayesianOptimization
的xgb.cv.bayes
和期望范围进行调用。 init_points
是从指定范围内随机抽取的带有超参数的初始模型的数量,并且n_iter
是初始点之后的模型轮数。该功能输出所有升压参数和测试AUC。
cv_folds <- KFold(as.matrix(df.train[,target.var]), nfolds = 5,
stratified = TRUE, seed = 50)
xgb.cv.bayes <- function(max.depth, min_child_weight, subsample, colsample_bytree, gamma){
cv <- xgv.cv(params = list(booster = 'gbtree', eta = 0.05,
max_depth = max.depth,
min_child_weight = min_child_weight,
subsample = subsample,
colsample_bytree = colsample_bytree,
gamma = gamma,
lambda = 1, alpha = 0,
objective = 'binary:logistic',
eval_metric = 'auc'),
data = data.matrix(df.train[,-target.var]),
label = as.matrix(df.train[, target.var]),
nround = 500, folds = cv_folds, prediction = TRUE,
showsd = TRUE, early.stop.round = 5, maximize = TRUE,
verbose = 0
)
list(Score = cv$dt[, max(test.auc.mean)],
Pred = cv$pred)
}
xgb.bayes.model <- BayesianOptimization(
xgb.cv.bayes,
bounds = list(max.depth = c(2L, 12L),
min_child_weight = c(1L, 10L),
subsample = c(0.5, 1),
colsample_bytree = c(0.1, 0.4),
gamma = c(0, 10)
),
init_grid_dt = NULL,
init_points = 10, # number of random points to start search
n_iter = 20, # number of iterations after initial random points are set
acq = 'ucb', kappa = 2.576, eps = 0.0, verbose = TRUE
)
这是一个比较老的问题,但是我想分享一下如何调整xgboost参数。我原本以为我会使用插入符号,但是最近发现在处理所有参数以及缺少值时出现问题。我也在考虑通过参数的不同组合编写一个迭代循环,但希望它并行运行,并且需要太多时间。使用NMOF软件包中的gridSearch提供了两个方面的最佳选择(所有参数以及并行处理)。这是二进制分类的示例代码(可在Windows和Linux上运行):
# xgboost task parameters
nrounds <- 1000
folds <- 10
obj <- 'binary:logistic'
eval <- 'logloss'
# Parameter grid to search
params <- list(
eval_metric = eval,
objective = obj,
eta = c(0.1,0.01),
max_depth = c(4,6,8,10),
max_delta_step = c(0,1),
subsample = 1,
scale_pos_weight = 1
)
# Table to track performance from each worker node
res <- data.frame()
# Simple cross validated xgboost training function (returning minimum error for grid search)
xgbCV <- function (params) {
fit <- xgb.cv(
data = data.matrix(train),
label = trainLabel,
param =params,
missing = NA,
nfold = folds,
prediction = FALSE,
early.stop.round = 50,
maximize = FALSE,
nrounds = nrounds
)
rounds <- nrow(fit)
metric = paste('test.',eval,'.mean',sep='')
idx <- which.min(fit[,fit[[metric]]])
val <- fit[idx,][[metric]]
res <<- rbind(res,c(idx,val,rounds))
colnames(res) <<- c('idx','val','rounds')
return(val)
}
# Find minimal testing error in parallel
cl <- makeCluster(round(detectCores()/2))
clusterExport(cl, c("xgb.cv",'train','trainLabel','nrounds','res','eval','folds'))
sol <- gridSearch(
fun = xgbCV,
levels = params,
method = 'snow',
cl = cl,
keepNames = TRUE,
asList = TRUE
)
# Combine all model results
comb=clusterEvalQ(cl,res)
results <- ldply(comb,data.frame)
stopCluster(cl)
# Train model given solution above
params <- c(sol$minlevels,objective = obj, eval_metric = eval)
xgbModel <- xgboost(
data = xgb.DMatrix(data.matrix(train),missing=NaN, label = trainLabel),
param = params,
nrounds = results[which.min(results[,2]),1]
)
print(params)
print(results)