如何使用R gbm和distribution =“ adaboost”？

9

文档指出具有分布=“ adaboost”的R gbm可以用于0-1分类问题。考虑以下代码片段：

gbm_algorithm <- gbm(y ~ ., data = train_dataset, distribution = "adaboost", n.trees = 5000)
gbm_predicted <- predict(gbm_algorithm, test_dataset, n.trees = 5000)

可以在predict.gbm文档中找到

返回预测向量。默认情况下，预测的尺度为f（x）。

但是，对于分配=“ adaboost”的情况，具体比例尚不清楚。

任何人都可以帮助解释predict.gbm返回值并提供转换为0-1输出的想法吗？

r gbm

— 阿列克谢·拉赫诺（Alexey Lakhno）
source

这个问题似乎仅与如何解释R输出有关，而与相关的统计问题无关（尽管这并没有使它成为不好的Q）。因此，最好在Stack Overflow而不是此处进行询问，并可能回答。请不要交叉发布（强烈建议不要这样做），如果您希望Q迁移得更快，请标记它以引起主持人注意。

— gung-恢复莫妮卡

4

@gung在我看来似乎是一个合理的统计问题。GBM软件包提供了用于adaboost的Deviance，但我既不知道f（x）是什么，又不知道如何逆变换为概率标度（也许必须使用普拉特标度）。cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf

— B_Miner 2012年

11

adaboost方法给出了logit规模的预测。您可以将其转换为0-1输出：

gbm_predicted<-plogis(2*gbm_predicted)

注意Logis里面的2 *

— 拉贡
source

10

您也可以直接从predict.gbm函数中获取概率；

predict(gbm_algorithm, test_dataset, n.trees = 5000, type = 'response')

— 埃德温
source

3

这里描述了 adaboost链接功能。本示例提供了有关计算的详细说明：

library(gbm);
set.seed(123);
n          <- 1000;
sim.df     <- data.frame(x.1 = sample(0:1, n, replace=TRUE), 
                         x.2 = sample(0:1, n,    replace=TRUE));
prob.array <- c(0.9, 0.7, 0.2, 0.8);
df$y       <- rbinom(n, size = 1, prob=prob.array[1+sim.df$x.1+2*sim.df$x.2])
n.trees    <- 10;
shrinkage  <- 0.01;

gbmFit <- gbm(
  formula           = y~.,
  distribution      = "bernoulli",
  data              = sim.df,
  n.trees           = n.trees,
  interaction.depth = 2,
  n.minobsinnode    = 2,
  shrinkage         = shrinkage,
  bag.fraction      = 0.5,
  cv.folds          = 0,
  # verbose         = FALSE
  n.cores           = 1
);

sim.df$logods  <- predict(gbmFit, sim.df, n.trees = n.trees);  #$
sim.df$prob    <- predict(gbmFit, sim.df, n.trees = n.trees, type = 'response');  #$
sim.df$prob.2  <- plogis(predict(gbmFit, sim.df, n.trees = n.trees));  #$
sim.df$logloss <- sim.df$y*log(sim.df$prob) + (1-sim.df$y)*log(1-sim.df$prob);  #$


gbmFit <- gbm(
  formula           = y~.,
  distribution      = "adaboost",
  data              = sim.df,
  n.trees           = n.trees,
  interaction.depth = 2,
  n.minobsinnode    = 2,
  shrinkage         = shrinkage,
  bag.fraction      = 0.5,
  cv.folds          = 0,
  # verbose         = FALSE
  n.cores           = 1
);

sim.df$exp.scale  <- predict(gbmFit, sim.df, n.trees = n.trees);  #$
sim.df$ada.resp   <- predict(gbmFit, sim.df, n.trees = n.trees, type = 'response');  #$
sim.df$ada.resp.2 <- plogis(2*predict(gbmFit, sim.df, n.trees = n.trees));  #$
sim.df$ada.error  <- -exp(-sim.df$y * sim.df$exp.scale);  #$

sim.df[1:20,]

— 莉兰·卡齐尔（Liran Katzir）
source

我将无法更改，因为我更改得太少了。“ df y”。

y ´ s h o u l d b e ´ s i m . d f

$y´ should be ´sim.df$

— Ric