Sklearn中的训练/测试/验证集拆分

58

如何使用Sklearn将数据矩阵和相应的标签向量随机分为X_train，X_test，X_val，y_train，y_test，y_val？据我所知，sklearn.cross_validation.train_test_split只能分解为两个，不能分解为三个...

machine-learning scikit-learn

— 亨德里克
source

79

您可以使用sklearn.model_selection.train_test_split两次。首先拆分训练，测试，然后再将训练拆分为验证和训练。像这样：

 X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.2, random_state=1)

— hh32
source

1

是的，这当然可以，但是我希望有一些更优雅的东西；）没关系，我接受这个答案。

— 亨德里克

1

我想补充一下，如果您想使用验证集搜索最佳的超参数，则可以在拆分后执行以下操作：gist.github.com/albertotb/1bad123363b186267e3aeaa26610b54b

— skd

12

那么在此示例中，最终的训练，测试，验证比例是多少？因为在第二个上train_test_split ，您是在之前的80/20拆分中执行此操作。因此，您的val是80％的20％。分割比例不是很简单。

— 莫妮卡·赫德内克

1

我同意@Monica Heddneck的观点，即64％的训练，16％的验证和20％的测试splt可能会更清晰。您必须对此解决方案做出令人讨厌的推断。

— 佩里

32

在使用numpy和pandas的SO上，这个问题有一个很好的答案。

命令（请参见讨论的答案）：

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

产生60％，20％，20％的比例用于训练，验证和测试集。

— 0_0
source

2

我可以看到.660％的意思...但是什么.8意思呢？

— 汤姆·黑尔

1

@TomHale np.split将按改组后的数组长度的60％进行拆分，然后按80％的长度进行拆分（这是另外20％的数据），从而剩下剩余的20％的数据。这是由于功能的定义。您可以使用以下命令进行测试/播放：x = np.arange(10.0)，其次是np.split(x, [ int(len(x)*0.6), int(len(x)*0.8)])

— 0_0

3

大多数情况下，您不会一次分裂数据，但第一步是将数据分裂为训练和测试集。随后，您将执行参数搜索，其中包含更复杂的拆分，例如使用“拆分k倍”或“留一法（LOO）”算法的交叉验证。

— JLT
source

3

您可以使用train_test_split两次。我认为这是最直接的。

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=1)

以这种方式，train，val，test集合将是60％，20％，分别为数据集的20％。

— 荣格
source

2

上面的最佳答案没有提到通过使用train_test_split不更改分区大小来分隔两次不会得到最初想要的分区：

x_train, x_remain = train_test_split(x, test_size=(val_size + test_size))

然后，验证和测试集中x_remain中的部分发生变化，可以算作

new_test_size = np.around(test_size / (val_size + test_size), 2)
# To preserve (new_test_size + new_val_size) = 1.0 
new_val_size = 1.0 - new_test_size

x_val, x_test = train_test_split(x_remain, test_size=new_test_size)

在这种情况下，将保存所有初始分区。

— 阿梅托夫
source

1

这是另一种方法（假设等于三分）：

# randomly shuffle the dataframe
df = df.reindex(np.random.permutation(df.index))

# how many records is one-third of the entire dataframe
third = int(len(df) / 3)

# Training set (the top third from the entire dataframe)
train = df[:third]

# Testing set (top half of the remainder two third of the dataframe)
test = df[third:][:third]

# Validation set (bottom one third)
valid = df[-third:]

可以使其更加简洁，但出于解释目的，我将其保持冗长。

— 维沙尔
source

0

给定train_frac=0.8，此函数将创建80％/ 10％/ 10％的比例：

import sklearn

def data_split(examples, labels, train_frac, random_state=None):
    ''' https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
    param data:       Data to be split
    param train_frac: Ratio of train set to whole dataset

    Randomly split dataset, based on these ratios:
        'train': train_frac
        'valid': (1-train_frac) / 2
        'test':  (1-train_frac) / 2

    Eg: passing train_frac=0.8 gives a 80% / 10% / 10% split
    '''

    assert train_frac >= 0 and train_frac <= 1, "Invalid training set fraction"

    X_train, X_tmp, Y_train, Y_tmp = sklearn.model_selection.train_test_split(
                                        examples, labels, train_size=train_frac, random_state=random_state)

    X_val, X_test, Y_val, Y_test   = sklearn.model_selection.train_test_split(
                                        X_tmp, Y_tmp, train_size=0.5, random_state=random_state)

    return X_train, X_val, X_test,  Y_train, Y_val, Y_test

— 汤姆·黑尔
source

0

添加到@ hh32的答案中，同时遵守任何预定义的比例，例如（75，15，10）：

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

# train is now 75% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)

# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

print(x_train, x_val, x_test)

— 安德烈·弗洛瑞亚（Andrei Florea）
source

0

@ hh32的答案的扩展与保留的比率。

# Defines ratios, w.r.t. whole dataset.
ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1

# Produces test split.
x_remaining, x_test, y_remaining, y_test = train_test_split(
    x, y, test_size=test_ratio)

# Adjusts val ratio, w.r.t. remaining dataset.
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining

# Produces train and val splits.
x_train, x_val, y_train, y_val = train_test_split(
    x_remaining, y_remaining, test_size=ratio_val_adjusted)

由于第一次拆分后会减少剩余数据集，因此必须通过求解以下公式来计算相对于缩减后的数据集的新比率：

$R_{remaining} \cdot R_{new} = R_{old}$

— 豪尔赫·巴里奥斯（Jorge Barrios）
source