方法“ train_test_split”中的参数“ stratify”（scikit学习）

94

我正在尝试train_test_split从scikit Learn软件包中使用，但是我在使用parameter时遇到了麻烦stratify。以下是代码：

from sklearn import cross_validation, datasets 

X = iris.data[:,:2]
y = iris.target

cross_validation.train_test_split(X,y,stratify=y)

但是，我一直遇到以下问题：

raise TypeError("Invalid parameters passed: %s" % str(options))
TypeError: Invalid parameters passed: {'stratify': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}

有人知道发生了什么吗？以下是功能文档。

[...]

分层：类似数组或无（默认为无）

如果不为None，则将数据用作标签数组以分层方式拆分。

0.17版中的新功能：分层拆分

[...]

— 丹妮尔·奥利瓦（Daneel Olivaw）
source

不，全部解决。

— Daneel Olivaw，

57

Scikit-Learn只是告诉您它无法识别参数“分层”，而不是您使用了错误的参数。这是因为该参数是在您引用的文档中指出的版本0.17中添加的。

因此，您只需要更新Scikit-Learn。

— Borja
source

尽管我有scikit-learn的0.21.2版本，但我遇到了同样的错误。 scikit-learn 0.21.2 py37h2a6a0b8_0 conda-forge

— 卡里姆·杰洛迪

325

该stratify参数进行拆分，以使生成的样本中的值的比例与提供给parameter的值的比例相同stratify。

例如，如果变量y是值的二进制分类变量0和1并有零点的25％和一的75％，stratify=y将确保您的随机分割时有25％0的和75％1的。

— 法佐利尼
source

117

这并不能真正回答问题，但是对于了解其工作原理非常有用。万分感谢。

— 里德·耶森

6

我仍然很难理解，为什么要进行这种分层：如果数据中存在类内不平衡，那么在对数据进行随机拆分时是否会平均保留它？

— Holger Brandl

14

@HolgerBrandl它将平均保存；使用分层，将确保将其保留。

— Yonatan

7

@HolgerBrandl具有非常小的数据集或非常不平衡的数据集，随机拆分很可能会从其中一个拆分中完全消除一个类。

— cddt

1

@HolgerBrandl好问题！也许我们可以首先添加它，您必须使用划分为训练和测试集stratify。其次，要纠正不平衡，您最终需要对训练集进行过采样或欠采样。许多Sklearn分类器都有一个称为class-weight的参数，您可以将其设置为balance。最后，对于不平衡数据集，您还可以采用比准确性更合适的指标。尝试F1或ROC下的区域。

— 克洛德·库洛姆贝

62

对于我将来通过Google来到这里的自己：

train_test_split现在位于中model_selection，因此：

from sklearn.model_selection import train_test_split

# given:
# features: xs
# ground truth: ys

x_train, x_test, y_train, y_test = train_test_split(xs, ys,
                                                    test_size=0.33,
                                                    random_state=0,
                                                    stratify=ys)

是使用它的方式。random_state为重现性设置一个理想的设置。

— 马丁·托马
source

这应该是答案:)谢谢

— SwimBikeRun

15

在这种情况下，分层意味着train_test_split方法返回与输入数据集具有相同类别标签比例的训练和测试子集。

— X.王
source

3

尝试运行此代码，它“有效”：

from sklearn import cross_validation, datasets 

iris = datasets.load_iris()

X = iris.data[:,:2]
y = iris.target

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X,y,train_size=.8, stratify=y)

y_test

array([0, 0, 0, 0, 2, 2, 1, 0, 1, 2, 2, 0, 0, 1, 0, 1, 1, 2, 1, 2, 0, 2, 2,
       1, 2, 1, 1, 0, 2, 1])

— 谢尔盖·布什曼诺夫（Sergey Bushmanov）
source

@ user5767535您可能会看到它正在我的Ubuntu计算机上运行，sklearn版本为“ 0.17”，适用于Python 3,5的Anaconda发行版。如果您正确输入代码并更新软件，我只能建议再检查一次。

— 谢尔盖·布什曼诺夫

2

@ user5767535 BTW，“ 0.17版的新功能：分层拆分”使我几乎可以肯定，您必须更新您的sklearn...

— Sergey Bushmanov 2016年