160

我刚开始使用R，但不确定如何将我的数据集与以下示例代码合并：

sample(x, size, replace = FALSE, prob = NULL)

我有一个数据集需要进行训练（75％）和测试（25％）。我不确定应该在x和大小中输入哪些信息？x数据集文件是多少，大小是多少？

r sample

— 苏西·亨比（Susie Humby）
source

1

x可以是您的的索引（行/列号说）data。size可以0.75*nrow(data)。尝试sample(1:10, 4, replace = FALSE, prob = NULL)看看它能做什么。

— harkmug

255

有许多方法可以实现数据分区。有关更完整的方法，请查看软件包中的createDataPartition功能caTools。

这是一个简单的示例：

data(mtcars)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)

train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]

— 迪科阿
source

我有些困惑，是什么保证该代码返回唯一的测试并训练df？似乎可行，请不要误解我的意思。只是难以理解如何减去索引会导致独特的观察结果。例如，如果您有一个包含10行和一列的df，并且其中一列包含1,2,3,4,5,6,7,8,9,10，并且您遵循此代码，那么阻止火车索引4并具有-6-> 10-6 = 4的测试？

— goldisfine 2014年

1

谢谢。我尝试了一下mtcars[!train_ind]，虽然它没有失败，但是没有按预期工作。我如何使用!？

— user989762

@ user989762 !用于逻辑（TRUE/FALSE），而不用于索引。如果要使用进行子集化!，请尝试类似mtcars [ !seq_len(nrow(mtcars)) %in% train_ind，]（未测试）。

— dickoa

1

@VedaadShakib当您使用“-”时，它会从数据中忽略train_ind中的所有索引。看看adv-r.had.co.nz/Subsetting.html。希望对您

— 有所

1

是不是createDataPartition在caret不caTools？

— J. Mini

93

可以通过以下方法轻松完成：

set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

通过使用caTools软件包：

require(caTools)
set.seed(101) 
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

— 情报局
source

4

我最近在麻省理工学院上了一门课，他们在整个过程中都使用caTools的方法。谢谢

— Chetan Sharma，

1

sample = sample.split(data[,1], SplitRatio = .75)应该删除需要命名的列。

— Benjamin Ziepert

33

我将为此使用dplyr它，使其超级简单。它确实需要数据集中的id变量，无论如何，这不仅是创建集的好方法，而且还是项目可追溯性的一个好主意。如果尚未包含，请添加。

mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(mtcars, train, by = 'id')

— 埃德温
source

28

这几乎是相同的代码，但外观更漂亮

bound <- floor((nrow(df)/4)*3)         #define % of training and test set

df <- df[sample(nrow(df)), ]           #sample rows 
df.train <- df[1:bound, ]              #get training set
df.test <- df[(bound+1):nrow(df), ]    #get test set

— 卡特琳娜
source

对！好看！

— MeenakshiSundharam

23

library(caret)
intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
training<-m_train[intrain,]
testing<-m_train[-intrain,]

— Pradnya Chavan
source

3

虽然仅代码答案是答案，但最好提供一些解释。

— C8H10N4O2

什么是m_train？我认为您的意思是sub_train原始data.frame。因此，修订后的代码应为training <-sub_train [-intrain，]和测试<-sub_train [-intrain，]。我想知道为什么在过去的五年中没有人发现您遇到的这个主要问题！

— mnm 16'-

21

我将“ a”分为火车（70％）和测试（30％）

    a # original data frame
    library(dplyr)
    train<-sample_frac(a, 0.7)
    sid<-as.numeric(rownames(train)) # because rownames() returns character
    test<-a[-sid,]

做完了

— 郑贤宇
source

4

您需要导入dpyr包，require（dplyr）

— TheMI 2016年

这个答案对我有帮助，但是我确实需要调整它以获得预期的结果。照原样，数据集“火车”的行名=连续整数的sid：1、2、3、4，...，而您希望sid是原始数据集“ a”的行号，因为它们是随机选择的，所以胜出不是连续整数。因此，有必要先在“ a”上创建id变量。

— Scott Murff '16

row.names（mtcars）<-NULL; train <-dplyr :: sample_frac（mtcars，0.5）; test <-mtcars [-as.numeric（row.names（train）），]＃我对我的数据执行了此操作，如果您的行名已设置为数字，则原始代码不起作用

— Christopher John

16

我的解决方案与dickoa的解决方案基本相同，但更易于解释：

data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]

— 亚历克斯
source

瑞士的变量是多少？

— billmccord

7

使用真棒dplyr库的一种更简单，更简单的方法：

library(dplyr)
set.seed(275) #to get repeatable data

data.train <- sample_frac(Default, 0.7)

train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train_index, ]

— Shayan Amani
source

1

您的意思是Default[-train_index,]用于最后一行吗？

— 马特·

5

如果输入：

?sample

如果将启动帮助菜单，以说明示例函数的参数的含义。

我不是专家，但是这里有一些代码：

data <- data.frame(matrix(rnorm(400), nrow=100))
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))
test <- splitdata[[1]]
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])

这将为您提供75％的训练和25％的测试。

— 用户名
source

5

在查看了这里发布的所有不同方法之后，我没有看到任何人利用它TRUE/FALSE来选择和取消选择数据。所以我想我将分享一种利用该技术的方法。

n = nrow(dataset)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25))

training = dataset[split, ]
testing = dataset[!split, ]

说明

有多种从R中选择数据的方式，最常见的是人们分别使用正/负索引来选择/取消选择。但是，通过使用TRUE/FALSE选择/取消选择可以实现相同的功能。

考虑以下示例。

# let's explore ways to select every other element
data = c(1, 2, 3, 4, 5)


# using positive indices to select wanted elements
data[c(1, 3, 5)]
[1] 1 3 5

# using negative indices to remove unwanted elements
data[c(-2, -4)]
[1] 1 3 5

# using booleans to select wanted elements
data[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

# R recycles the TRUE/FALSE vector if it is not the correct dimension
data[c(TRUE, FALSE)]
[1] 1 3 5

— 乔
source

4

我的解决方案是对行进行混洗，然后将前75％的行作为训练，将最后25％的行作为测试。超级简单！

row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]

— 约翰尼五世
source

4

我可以建议使用rsample包：

# choosing 75% of the data to be the training data
data_split <- initial_split(data, prop = .75)
# extracting training data and test data as two seperate dataframes
data_train <- training(data_split)
data_test  <- testing(data_split)

3

scorecard 软件包为此提供了有用的功能，您可以在其中指定比率和种子

library(scorecard)

dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)

测试和训练数据存储在列表中，可以通过调用dt_list$train和访问dt_list$test

— 佳美
source

2

在一个函数的下面，该函数创建一个list大小相同的子样本，这并不是您想要的，但可能对其他人有用。就我而言，在较小的样本上创建多个分类树以测试过度拟合：

df_split <- function (df, number){
  sizedf      <- length(df[,1])
  bound       <- sizedf/number
  list        <- list() 
  for (i in 1:number){
    list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
  }
  return(list)
}

范例：

x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,]    1
# [2,]    2
# [3,]    3
# [4,]    4
# [5,]    5
# [6,]    6
# [7,]    7
# [8,]    8
# [9,]    9
#[10,]   10

x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2

# [[2]]
# [1] 3 4

# [[3]]
# [1] 5 6

# [[4]]
# [1] 7 8

# [[5]]
# [1] 9 10

— 尤汉·奥巴迪亚（Yohan Obadia）
source

2

在R中使用caTools软件包的示例代码如下：-

data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)

— 亚什·沙玛（Yash Sharma）
source

2

使用基数R。函数runif生成从0到1的均匀分布的值。通过更改截止值（在下面的示例中为train.size），在截止值以下您将始终具有大约相同百分比的随机记录。

data(mtcars)
set.seed(123)

#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size

#train
train.df<-mtcars[train.ind,]


#test
test.df<-mtcars[!train.ind,]

— 康斯坦丁·明古林
source

如果它显示出额外的几行内容来实际创建训练和测试集（新手经常会苦苦挣扎），那么这将是一个更好的答案。

— 格雷戈尔·托马斯

2

假设df是您的数据帧，并且您想要创建75％的训练和25％的测试

all <- 1:nrow(df)
train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE))
test_i <- all[-train_i]

然后创建火车并测试数据框

df_train <- df[train_i,]
df_test <- df[test_i,]

— 科伦丁
source

1

require(caTools)

set.seed(101)            #This is used to create same samples everytime

split1=sample.split(data$anycol,SplitRatio=2/3)

train=subset(data,split1==TRUE)

test=subset(data,split1==FALSE)

该sample.split()函数将在数据帧中增加一个额外的列'split1'，而2/3的行将其值设为TRUE，其他值将设为FALSE。现在split1为TRUE的行将被复制到train中，其他行将被复制以进行测试数据框。

— 阿比舍克
source

1

我碰到了这一点，它也可以提供帮助。

set.seed(12)
data = Sonar[sample(nrow(Sonar)),]#reshufles the data
bound = floor(0.7 * nrow(data))
df_train = data[1:bound,]
df_test = data[(bound+1):nrow(data),]

— 用户名
source

1

我们可以将数据划分为特定的比率，这里是训练的80％和测试数据集中的20％。

ind <- sample(2, nrow(dataName), replace = T, prob = c(0.8,0.2))
train <- dataName[ind==1, ]
test <- dataName[ind==2, ]

— 阿达什·帕瓦尔（Adarsh Pawar）
source

0

sample如果您寻找可重复的结果，请当心分裂。如果您的数据变化很小，即使使用，拆分也会有所不同set.seed。例如，假设数据中ID的排序列表是1到10之间的所有数字。如果您仅丢弃一个观测值（例如4），则按位置进行采样将产生不同的结果，因为现在所有移动的位置都为5到10。

一种替代方法是使用哈希函数将ID映射为一些伪随机数，然后对这些数字的mod进行采样。该示例更加稳定，因为分配现在由每个观察值的哈希值决定，而不是由其相对位置决定。

例如：

require(openssl)  # for md5
require(data.table)  # for the demo data

set.seed(1)  # this won't help `sample`

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
sample2 <- sample1[-sample(N, 1)]  # randomly drop one observation from sample1

# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))

[1] 9999

# row splitting yields very different test sets, even though we've set the seed
test <- sample(N-1, N/2, replace = F)

test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

# to fix that, we can use some hash function to sample on the last digit

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

# hash splitting preserves the similarity, because the assignment of test/train 
# is determined by the hash of each obs., and not by its relative location in the data
# which may change 
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

样本大小不完全是5000，因为分配是概率性的，但是由于大数定律，在大样本中这应该不是问题。

另请参阅：http : //blog.richardweiss.org/2016/12/25/hash-splits.html 和/crypto/20742/statistical-properties-of-hash-functions-when计算模

— 戴泽尔
source

作为单独的问题添加：stackoverflow.com/questions/52769681/…–

— dzeltzer

我想从多个时间序列数据中开发auto.arima模型，我希望每两年使用两年的间隔中的1年数据，3年数据，5、7 ...来构建模型并对其进行测试剩余的测试集。如何进行子设置，以使拟合的模型具有所需的功能？感谢您的帮助

— Stackuser

0

set.seed(123)
llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0)) 
wmydata<-mydata[llwork, ]
tmydata<-mydata[-llwork, ]

— XavierJiménezAlbán
source

-2

有一种非常简单的方法可以使用R索引为行和列选择许多行。这使您可以根据给定的行数清晰地拆分数据集-例如，数据的前80％。

在R中，所有行和列都已索引，因此DataSetName [1,1]是分配给“ DataSetName”的第一列和第一行的值。我可以使用[x，]选择行，并使用[，x]选择列

例如：如果我有一个方便地命名为“ data”的数据集，其中包含100行，则可以使用来查看前80行

查看（数据[1:80，]）

以相同的方式，我可以选择这些行并使用以下方法将它们作为子集：

火车=数据[1:80，]

测试=数据[81：100，]

现在，我将数据分为两部分，无法重新采样。快捷方便。

— 丹·布托罗维奇（Dan Butorovich）
source

1

尽管确实可以采用这种方式拆分数据，但不建议这样做。某些数据集由您不知道的变量排序。因此最好采样哪些行将被视为训练，而不是采用前n行。

— user5029763'9

1

如果您在将数据分离到测试集和训练集之前先对其进行混洗，那么您的建议会起作用。

— 哈迪吉

如何使用样本功能将数据分为训练/测试集

说明