嵌套交叉验证的实现

我试图弄清楚我对嵌套交叉验证的理解是否正确，因此我写了这个玩具示例来看看我是否正确：

import operator
import numpy as np
from sklearn import cross_validation
from sklearn import ensemble
from sklearn.datasets import load_boston

# set random state
state = 1

# load boston dataset
boston = load_boston()

X = boston.data
y = boston.target

outer_scores = []

# outer cross-validation
outer = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state)
for fold, (train_index_outer, test_index_outer) in enumerate(outer):
    X_train_outer, X_test_outer = X[train_index_outer], X[test_index_outer]
    y_train_outer, y_test_outer = y[train_index_outer], y[test_index_outer]

    inner_mean_scores = []

    # define explored parameter space.
    # procedure below should be equal to GridSearchCV
    tuned_parameter = [1000, 1100, 1200]
    for param in tuned_parameter:

        inner_scores = []

        # inner cross-validation
        inner = cross_validation.KFold(len(X_train_outer), n_folds=3, shuffle=True, random_state=state)
        for train_index_inner, test_index_inner in inner:
            # split the training data of outer CV
            X_train_inner, X_test_inner = X_train_outer[train_index_inner], X_train_outer[test_index_inner]
            y_train_inner, y_test_inner = y_train_outer[train_index_inner], y_train_outer[test_index_inner]

            # fit extremely randomized trees regressor to training data of inner CV
            clf = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1)
            clf.fit(X_train_inner, y_train_inner)
            inner_scores.append(clf.score(X_test_inner, y_test_inner))

        # calculate mean score for inner folds
        inner_mean_scores.append(np.mean(inner_scores))

    # get maximum score index
    index, value = max(enumerate(inner_mean_scores), key=operator.itemgetter(1))

    print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])

    # fit the selected model to the training set of outer CV
    # for prediction error estimation
    clf2 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1)
    clf2.fit(X_train_outer, y_train_outer)
    outer_scores.append(clf2.score(X_test_outer, y_test_outer))

# show the prediction error estimate produced by nested CV
print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))

# finally, fit the selected model to the whole dataset
clf3 = ensemble.ExtraTreesRegressor(tuned_parameter[index], n_jobs=-1, random_state=1)
clf3.fit(X, y)

任何想法表示赞赏。

cross-validation python scikit-learn

— 阿布迪斯
source

您还可以为那些不阅读Python的人提供对文本交叉验证的理解版本吗？

— gung-恢复莫妮卡

scikit-learn自己的版本：scikit-learn.org/stable/auto_examples/model_selection/…–

— ayorgo

Answers:

UPS，代码是错误的，但是以非常细微的方式！

a）将训练集分为内部训练集和测试集是可以的。

b）问题是最后两行，这反映了对嵌套交叉验证的目的的细微误解。嵌套CV的目的不是选择参数，而是对算法的预期精度进行无偏的评估，在这种情况下ensemble.ExtraTreesRegressor，此数据中的超参数可能是最佳的。

这是您的代码正确计算出的结果：

    print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))

它使用嵌套CV来计算分类器的无偏预测。但是请注意，正如您在编写该行时所知道的那样，外循环的每一遍都可能生成不同的最佳超参数：

   print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])

因此，现在您需要一个标准的CV循环以使用折叠来选择最终的最佳超参数：

tuned_parameter = [1000, 1100, 1200]
for param in tuned_parameter:

    scores = []

    # normal cross-validation
    kfolds = cross_validation.KFold(len(y), n_folds=3, shuffle=True, random_state=state)
    for train_index, test_index in kfolds:
        # split the training data
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # fit extremely randomized trees regressor to training data
        clf2_5 = ensemble.ExtraTreesRegressor(param, n_jobs=-1, random_state=1)
        clf2_5.fit(X_train, y_train)
        scores.append(clf2_5.score(X_test, y_test))

    # calculate mean score for folds
    mean_scores.append(np.mean(scores))

# get maximum score index
index, value = max(enumerate(mean_scores), key=operator.itemgetter(1))

print 'Best parameter : %i' % (tuned_parameter[index])

这是您的代码，但删除了对inner的引用。

现在最好的参数是tuned_parameter[index]，现在您可以clf3像在代码中那样学习最终的分类器。

— 雅克·怀纳
source

谢谢！我确实考虑过我可能会选择不同best倍数的不同参数，但是我不知道如何选择最佳参数。stats.stackexchange.com/questions/65128/…-在此处的答案中，提到从外部k个模型中选择最佳模型实际上是不可取的。也许我仍然误会了一些东西，但是我认为内部CV循环的想法是选择性能最好的模型，而外部CV循环的想法是估计性能。您能否提供完整的修改后的代码？

— 2015年

好吧，我想我明白了。不过，我想看看完整的修改后的代码。谢谢。

— 2015年

对于雅克·怀纳（Jacques Wainer）的回答，我感到困惑，我认为值得澄清。那么，Wainer建议标准的CV循环应遵循初始问题提供的代码，还是应该替换初始的“内部”零件代码？thanx

标准的CV循环遵循嵌套的CV循环

— Jacques Wainer

第一部分是计算误差的无偏预测。如果要测试许多不同的算法，则应仅执行第一部分，然后选择误差最小的算法，仅针对那一部分，执行第二部分以选择超参数。如果只使用一种算法，那么第一部分的重要性就不那么重要了，除非您要向老板或客户说明对分类器未来误差的最佳预测是x，并且必须使用第一算法来计算x嵌套的简历。

— 雅克·怀纳

总结雅克的答案，

嵌套CV是模型无偏误差估计所必需的。我们可以用这种方式比较不同模型的得分。然后，使用这些信息，我们可以执行单独的K折CV循环，以对所选模型进行参数调整。

— 莎兰·纳里伯（Sharan Naribole）
source