如何分割数据集以进行交叉验证,学习曲线和最终评估?
分割数据集的合适策略是什么? 我要求反馈对以下方法(不是像个别参数test_size或n_iter,但如果我用X,y,X_train,y_train,X_test,和y_test适当的,如果顺序是有道理的): (从scikit-learn文档扩展此示例) 1.加载数据集 from sklearn.datasets import load_digits digits = load_digits() X, y = digits.data, digits.target 2.分为训练和测试集(例如80/20) from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 3.选择估算器 from sklearn.svm import SVC estimator = SVC(kernel='linear') 4.选择交叉验证迭代器 from sklearn.cross_validation import ShuffleSplit cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0) 5.调整超参数 …