如何将数据集分割/划分为训练和测试数据集，例如进行交叉验证？

99

将NumPy数组随机分为训练和测试/验证数据集的好方法是什么？与Matlab中的cvpartition或crossvalind函数类似。

— 埃里克
source

125

如果要将数据集分成两半，可以使用numpy.random.shuffle，或者numpy.random.permutation需要跟踪索引：

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

要么

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

有多种方法可以重复分区同一数据集以进行交叉验证。一种策略是从数据集中重复采样：

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

最后，sklearn包含几种交叉验证方法（k折，nave -n-out等）。它还包括更高级的“分层抽样”方法，该方法创建了针对某些功能平衡的数据分区，例如，确保训练和测试集中的正例和负例比例相同。

— 柏克
source

13

感谢您提供这些解决方案。但是，使用randint的最后一种方法难道没有机会为测试集和训练集提供相同的索引吗？

— ggauravr

3

第二种解决方案是有效答案，而第一种和第三种则不是。对于第一种解决方案，不总是需要对数据集进行混洗，在许多情况下，必须保持数据输入的顺序。而且第三个可以很好地产生相同的测试和培训指数（如@ggauravr所指出的）。

— pedram bashiri

您应该不重新取样您的交叉验证集。整个想法是，您的算法从未见过CV集。训练和测试集用于拟合数据，因此，如果将这些包含在简历集中，则当然会获得良好的结果。我想赞成这个答案，因为第二个解决方案是我所需要的，但是这个答案有问题。

— RubberDuck

55

还有另一个选择就是需要使用scikit-learn。如scikit的Wiki所述，您可以按照以下说明进行操作：

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

这样，您就可以将要拆分为训练和测试的数据的标签保持同步。

— 保罗·马尔瓦尔
source

1

这是一个非常实用的答案，这是由于火车设置和标签的实际处理。

— chinnychinchin

38

请注意。如果您想要训练，测试和AND验证集，则可以执行以下操作：

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将为训练提供70％，为测试和验证集各提供15％。希望这可以帮助。

— 灰白色
source

5

应该将其添加到您的代码中：from sklearn.cross_validation import train_test_split明确说明您正在使用的模块

— Radix

这一定是随机的吗？

— liang

也就是说，是否可以根据X和y的给定顺序进行拆分？

— liang

1

@liang不，它不一定是随机的。您可以说训练集，测试集和验证集的大小分别是总数据集大小的a，b和c％。比方说a=0.7，b=0.15，c=0.15，和d = dataset，N=len(dataset)，然后x_train = dataset[0:int(a*N)]，x_test = dataset[int(a*N):int((a+b)*N)]和x_val = dataset[int((a+b)*N):]。

— offwhitelotus

1

弃用：stackoverflow.com/a/34844352/4237080，使用from sklearn.model_selection import train_test_split

— briennakh

14

由于sklearn.cross_validation模块被弃用，你可以使用：

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

— 马海耶
source

5

您还可以考虑将训练和测试集进行分层。初始除法还会随机生成训练集和测试集，但要保留原始班级的比例。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

此代码输出：

[1 2 3]
[1 2 3]

— Apogentus
source

谢谢！命名有点误导，value_inds是真正的索引，但输出不是索引，仅是掩码。

— greenoldman

1

我为自己的项目编写了一个函数来执行此操作（尽管它不使用numpy）：

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

如果您希望将块随机化，则在将列表传递之前先对其进行随机排序。

— 科林
source

0

这是一个以分层方式将数据分成n = 5折的代码

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

— 普拉珊斯
source

0

感谢pberkes的回答。我只是对其进行了修改，以避免（1）在训练和测试中采样（2）重复的实例时进行替换：

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

— 扎兰
source

0

在进行了一些阅读并考虑了（许多..）将数据拆分以进行训练和测试的不同方式之后，我决定计时了！

我使用了4种不同的方法（其中没有一种使用的是sklearn库，我相信它将得到最好的结果，因为它是经过精心设计和测试的代码）：

洗净整个矩阵arr，然后拆分数据以进行训练和测试
随机排列索引，然后将其分配给x和y以拆分数据
与方法2相同，但以更有效的方式进行
使用熊猫数据框进行拆分

方法3赢得的时间最短，仅次于方法1，而方法2和4的确效率很低。

我计时的4种不同方法的代码：

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

在这段时间内，执行1000次循环的3次重复的最短时间为：

方法1：0.35883826200006297秒
方法2：1.7157016959999964秒
方法3：1.7876616719995582秒
方法4：0.07562861499991413秒

希望对您有所帮助！

— 图腾
source

0

可能您不仅需要拆分训练和测试，而且还需要交叉验证以确保模型能够概括。在这里，我假设70％的训练数据，20％的验证和10％的坚持/测试数据。

查看np.split：

如果indexs_or_sections是一维排序的整数数组，则条目指示沿轴在哪里拆分该数组。例如，对于轴= 0，[2，3]将导致

ary [：2] ary [2：3] ary [3：]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])

— 先生
source

0

分为火车测试并有效

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

— 拉杰（Rajat Subhra）Bhowmick
source