随机森林概率预测与多数投票

Scikit学习似乎使用概率预测而不是多数票赞成模型聚合技术，而没有解释其原因（1.9.2.1。Random Forests）。

为什么有明确的解释？此外，对于可用于“随机森林装箱”的各种模型聚合技术，是否有一篇不错的论文或评论文章？

谢谢！

— 用户名
source

如果您精通Python，始终可以通过查看代码来最好地回答此类问题。

RandomForestClassifier.predict，至少在当前版本0.16.1中，由给出预测概率最高的类predict_proba。（此行）

文档predict_proba说明：

计算输入样本的预测类别概率，作为森林中树木的平均预测类别概率。一棵树的类别概率是叶子中同一类别的样本的分数。

与原始方法的差异可能只是使得与的predict预测相符predict_proba。结果有时称为“软投票”，而不是原始Breiman论文中使用的“硬”多数票。我无法在快速搜索中找到这两种方法的性能的适当比较，但是在这种情况下它们似乎都非常合理。

该predict文档充其量是极具误导性的。我已提交拉动请求以对其进行修复。

如果您想进行多数投票预测，则可以使用以下功能。称它为predict_majvote(clf, X)而不是clf.predict(X)。（基于predict_proba；仅经过轻微测试，但我认为它应该可以工作。）

from scipy.stats import mode
from sklearn.ensemble.forest import _partition_estimators, _parallel_helper
from sklearn.tree._tree import DTYPE
from sklearn.externals.joblib import Parallel, delayed
from sklearn.utils import check_array
from sklearn.utils.validation import check_is_fitted

def predict_majvote(forest, X):
    """Predict class for X.

    Uses majority voting, rather than the soft voting scheme
    used by RandomForestClassifier.predict.

    Parameters
    ----------
    X : array-like or sparse matrix of shape = [n_samples, n_features]
        The input samples. Internally, it will be converted to
        ``dtype=np.float32`` and if a sparse matrix is provided
        to a sparse ``csr_matrix``.
    Returns
    -------
    y : array of shape = [n_samples] or [n_samples, n_outputs]
        The predicted classes.
    """
    check_is_fitted(forest, 'n_outputs_')

    # Check data
    X = check_array(X, dtype=DTYPE, accept_sparse="csr")

    # Assign chunk of trees to jobs
    n_jobs, n_trees, starts = _partition_estimators(forest.n_estimators,
                                                    forest.n_jobs)

    # Parallel loop
    all_preds = Parallel(n_jobs=n_jobs, verbose=forest.verbose,
                         backend="threading")(
        delayed(_parallel_helper)(e, 'predict', X, check_input=False)
        for e in forest.estimators_)

    # Reduce
    modes, counts = mode(all_preds, axis=0)

    if forest.n_outputs_ == 1:
        return forest.classes_.take(modes[0], axis=0)
    else:
        n_samples = all_preds[0].shape[0]
        preds = np.zeros((n_samples, forest.n_outputs_),
                         dtype=forest.classes_.dtype)
        for k in range(forest.n_outputs_):
            preds[:, k] = forest.classes_[k].take(modes[:, k], axis=0)
        return preds

在我尝试过的哑合成案例中，predict每次的预测都与该方法一致。

— 杜加尔
source

好答案，道格拉尔！感谢您抽出宝贵的时间仔细解释这一点。请考虑也要解决堆栈溢出问题并在那里回答这个问题。

— 2015年

还有一份文件，在这里，它解决的概率预测。

— user1745038 2015年