随机森林回归预测不高于训练数据

12

我注意到，在建立随机森林回归模型时，至少在中R，预测值永远不会超过训练数据中看到的目标变量的最大值。例如，请参见下面的代码。我正在建立一个回归模型以mpg根据mtcars数据进行预测。我建立了OLS和随机森林模型，并使用它们来预测mpg假设的汽车应该具有非常好的燃油经济性。OLS预计会mpg达到预期的高，但随机森林则不会。我在更复杂的模型中也注意到了这一点。为什么是这样？

> library(datasets)
> library(randomForest)
> 
> data(mtcars)
> max(mtcars$mpg)
[1] 33.9
> 
> set.seed(2)
> fit1 <- lm(mpg~., data=mtcars) #OLS fit
> fit2 <- randomForest(mpg~., data=mtcars) #random forest fit
> 
> #Hypothetical car that should have very high mpg
> hypCar <- data.frame(cyl=4, disp=50, hp=40, drat=5.5, wt=1, qsec=24, vs=1, am=1, gear=4, carb=1)
> 
> predict(fit1, hypCar) #OLS predicts higher mpg than max(mtcars$mpg)
      1 
37.2441 
> predict(fit2, hypCar) #RF does not predict higher mpg than max(mtcars$mpg)
       1 
30.78899

r random-forest

— 高拉夫·班萨尔（Gaurav Bansal）
source

人们将线性回归称为OLS是否很常见？我一直认为OLS是一种方法。

— 浩业

1

我相信OLS是线性回归的默认方法，至少在R.

— 拉夫邦萨尔

对于随机树木/森林，预测是相应节点中训练数据的平均值。因此，它不能大于训练数据中的值。

— 杰森

1

我同意，但至少有其他三个用户回答了它。

— HelloWorld

12

正如前面的答案中已经提到的那样，用于回归/回归树的随机森林不会对超出训练数据范围的数据点产生预期的预测，因为它们无法外推（很好）。回归树由节点的层次结构组成，其中每个节点指定要对属性值执行的测试，每个叶（终端）节点指定用于计算预测输出的规则。在您的情况下，测试观察结果流经树到叶节点，例如“如果x> 335，则y = 15”，然后由随机森林平均。

这是一个R脚本，使用随机森林和线性回归可视化情况。在随机森林的情况下，对于测试数据点的预测是恒定的，这些数据点低于最低训练数据x值或高于最高训练数据x值。

library(datasets)
library(randomForest)
library(ggplot2)
library(ggthemes)

# Import mtcars (Motor Trend Car Road Tests) dataset
data(mtcars)

# Define training data
train_data = data.frame(
    x = mtcars$hp,  # Gross horsepower
    y = mtcars$qsec)  # 1/4 mile time

# Train random forest model for regression
random_forest <- randomForest(x = matrix(train_data$x),
                              y = matrix(train_data$y), ntree = 20)
# Train linear regression model using ordinary least squares (OLS) estimator
linear_regr <- lm(y ~ x, train_data)

# Create testing data
test_data = data.frame(x = seq(0, 400))

# Predict targets for testing data points
test_data$y_predicted_rf <- predict(random_forest, matrix(test_data$x)) 
test_data$y_predicted_linreg <- predict(linear_regr, test_data)

# Visualize
ggplot2::ggplot() + 
    # Training data points
    ggplot2::geom_point(data = train_data, size = 2,
                        ggplot2::aes(x = x, y = y, color = "Training data")) +
    # Random forest predictions
    ggplot2::geom_line(data = test_data, size = 2, alpha = 0.7,
                       ggplot2::aes(x = x, y = y_predicted_rf,
                                    color = "Predicted with random forest")) +
    # Linear regression predictions
    ggplot2::geom_line(data = test_data, size = 2, alpha = 0.7,
                       ggplot2::aes(x = x, y = y_predicted_linreg,
                                    color = "Predicted with linear regression")) +
    # Hide legend title, change legend location and add axis labels
    ggplot2::theme(legend.title = element_blank(),
                   legend.position = "bottom") + labs(y = "1/4 mile time",
                                                      x = "Gross horsepower") +
    ggthemes::scale_colour_colorblind()

— Tuomastik
source

16

无法像OLS一样，对随机森林进行推断。原因很简单：随机森林的预测是通过对几棵树中获得的结果进行平均来完成的。树本身输出每个终端节点（叶子）中样本的平均值。结果不可能超出训练数据的范围，因为平均值始终在其组成范围之内。

换句话说，平均值不可能大于（或低于）每个样本，Random Forests回归基于平均值。

— 萤火虫
source

11

决策树/ Forest Forrest不能外推训练数据。尽管OLS可以做到这一点，但应谨慎对待此类预测；因为识别出的模式可能不会继续超出观察范围。

— 弗罗斯特
source