Answers:
我只会用numpy的randn
:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
只是看到它起作用了:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
rand
以< 0.8
意义,因为它返回0和1之间均匀分布的随机数
in[12]
,in[13]
,in[14]
?我想在这里了解python代码本身
np.random.rand(len(df))
是一个大小数组,len(df)
其随机且均匀分布的float值在[0,1]范围内。该< 0.8
应用比较逐元素,并将结果存储在适当位置。因此,值<0.8变为True
值> = 0.8变为False
scikit Learn'strain_test_split
是一个不错的选择。
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
kf = KFold(n, n_folds=folds) for train_index, test_index in kf: X_train, X_test = X.ix[train_index], X.ix[test_index]
在此处查看完整示例:quantstart.com/articles/…–
from sklearn.model_selection import train_test_split
。
from sklearn.cross_validation import train_test_split
熊猫随机样本也可以
train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)
random_state
ARG在做什么?
test
集,可以很容易地解决这个限制,如此处stackoverflow.com/questions/29576430/shuffle-dataframe-rows所示。test=df.drop(train.index).sample(frac=1.0)
我将使用scikit-learn自己的training_test_split,并从索引生成它
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
cross_validation
模块现已弃用:DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
创建训练/测试甚至验证样本的方法有很多。
情况1:train_test_split
没有任何选择的经典方式:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
情况2:一个非常小的数据集(小于500行)的情况:为了通过交叉验证获得所有行的结果。最后,您将对可用训练集的每一行都有一个预测。
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
情况3a:用于分类目的的不平衡数据集。根据情况1,这是等效的解决方案:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
情况3b:用于分类目的的不平衡数据集。在案例2之后,这是等效的解决方案:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
案例4:您需要在大数据上创建训练/测试/验证集以调整超参数(训练60%,测试20%和验证20%)。
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
您也可以考虑将训练和测试集分为几部分。初始除法还会随机生成训练和测试集,但要保留原始班级的比例。这使训练和测试集更好地反映了原始数据集的属性。
import numpy as np
def get_train_test_inds(y,train_proportion=0.7):
'''Generates indices, making random stratified split into training set and testing sets
with proportions train_proportion and (1-train_proportion) of initial sample.
y is any iterable indicating classes of each observation in the sample.
Initial proportions of classes inside training and
testing sets are preserved (stratified sampling).
'''
y=np.array(y)
train_inds = np.zeros(len(y),dtype=bool)
test_inds = np.zeros(len(y),dtype=bool)
values = np.unique(y)
for value in values:
value_inds = np.nonzero(y==value)[0]
np.random.shuffle(value_inds)
n = int(train_proportion*len(value_inds))
train_inds[value_inds[:n]]=True
test_inds[value_inds[n:]]=True
return train_inds,test_inds
df [train_inds]和df [test_inds]为您提供原始DataFrame df的训练和测试集。
如果需要根据数据集中的标签列拆分数据,则可以使用以下方法:
def split_to_train_test(df, label_column, train_frac=0.8):
train_df, test_df = pd.DataFrame(), pd.DataFrame()
labels = df[label_column].unique()
for lbl in labels:
lbl_df = df[df[label_column] == lbl]
lbl_train_df = lbl_df.sample(frac=train_frac)
lbl_test_df = lbl_df.drop(lbl_train_df.index)
print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
train_df = train_df.append(lbl_train_df)
test_df = test_df.append(lbl_test_df)
return train_df, test_df
并使用它:
train, test = split_to_train_test(data, 'class', 0.7)
如果您想控制分割随机性或使用某些全局随机种子,也可以传递random_state。
import pandas as pd
from sklearn.model_selection import train_test_split
datafile_name = 'path_to_data_file'
data = pd.read_csv(datafile_name)
target_attribute = data['column_name']
X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)
您可以使用〜(波浪号运算符)排除使用df.sample()采样的行,让熊猫独自处理采样和索引过滤,以获得两组。
train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]
这就是我需要拆分DataFrame时写的内容。我考虑过使用上面的Andy的方法,但不喜欢我无法精确控制数据集的大小(即有时为79,有时为81,等等)。
def make_sets(data_df, test_portion):
import random as rnd
tot_ix = range(len(data_df))
test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
train_ix = list(set(tot_ix) ^ set(test_ix))
test_df = data_df.ix[test_ix]
train_df = data_df.ix[train_ix]
return train_df, test_df
train_df, test_df = make_sets(data_df, 0.2)
test_df.head()
像这样从df中选择范围行
row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]
df
在代码段中添加了(或应该)将其改编,它将改善答案。
上面有很多不错的答案,因此,如果您想仅使用numpy
库为火车和测试集指定确切的样本数量,我想再添加一个示例。
# set the random seed for the reproducibility
np.random.seed(17)
# e.g. number of samples for the training set is 1000
n_train = 1000
# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)
# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]
train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]
test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]
要将培训分为两个类,例如训练,测试和验证,可以执行以下操作:
probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85
df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]
这将把大约70%的数据用于训练,15%的测试和15%的验证。
您需要将pandas数据帧转换为numpy数组,然后将numpy数组转换回dataframe
import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)
这个怎么样?df是我的数据框
total_size=len(df)
train_size=math.floor(0.66*total_size) (2/3 part of my dataset)
#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]
msk
是D型的bool
,df[msk]
,df.iloc[msk]
并且df.loc[msk]
始终返回相同的结果。