将一列中以逗号分隔的字符串拆分为单独的行


109

我有一个数据框,如下所示:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

如您所见,该director列中的某些条目是多个名称,以逗号分隔。我想将这些条目拆分为单独的行,同时保持另一列的值。例如,上面数据框中的第一行应分为两行,在该director列中各有一个名称,在该列中有一个“ A” AB


2
只是想问一个明显的问题:您应该在互联网上发布这些数据吗?
里卡多·萨波特塔

1
他们“不是所有的B电影”。似乎足够无害。
马修·伦德伯格

24
所有这些人都是奥斯卡金像奖的提名人,我几乎认为这不是秘密=)
RoyalTS 2012年

Answers:


79

这个老问题经常被用作欺骗对象(标记为r-faq)。截止到今天,已经提供了6种不同的方法进行了三次回答,但是缺乏基准作为指导,其中哪种方法是最快的1

基准测试解决方案包括

使用该microbenchmark软件包,对6种不同大小的数据帧总共进行了8种不同方法的基准测试(请参见下面的代码)。

OP提供的样本数据仅包含20行。为了创建更大的数据帧,将这20行简单重复1、10、100、1000、10000和100000次,这样问题大小最多可达200​​万行。

基准结果

在此处输入图片说明

基准测试结果表明,对于足够大的数据帧,所有data.table方法都比其他任何方法都快。对于具有约5000行以上的数据帧,Jaap的data.table方法2及其变体DT3是最快的,幅度比最慢的方法快。

值得注意的是,这两种tidyverse方法的时间和splistackshape解决方案是如此相似,以至于很难消除图表中的曲线。在所有数据帧大小中,它们是基准测试方法中最慢的。

对于较小的数据帧,Matt的基本R解决方案和data.table方法4似乎比其他方法具有更少的开销。

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

为问题规模的基准运行定义函数 n

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

针对不同问题规模运行基准测试

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

准备绘图数据

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

创建图表

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

会话信息和软件包版本(节选)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)

1 我的好奇心被这种热烈的评论 激怒了!快几个数量级!一个问题tidyverse答案,该问题已作为该问题的副本被关闭。


真好!看起来cSplit和sepeparate_rows有改进的余地(这是专门为此目的而设计的)。顺便说一句,cSplit还采用了fixed = arg,并且是一个基于data.table的包,因此最好为其指定DT而不是DF。同样,我不认为从factor到char的转换属于基准测试(因为开始时应该是char)。我检查了一下,这些变化都没有对结果产生任何定性的影响。
法兰克(Frank)

1
@Frank感谢您提出的改进基准并检查对结果的影响的建议。下一个版本的发布后做一个更新时,会挑选这件事data.tabledplyr
乌韦

我认为这些方法是不可比的,至少在所有情况下都不可比,因为数据表方法仅生成具有“选定”列的表,而dplyr生成具有所有列的结果(包括未参与分析且没有在功能中写下他们的名字)。
Ferroao

5
@Ferroao错了,data.tables方法会在适当位置修改“表”,所有列均会保留,当然,如果您未进行适当的修改,则只会得到您要求的内容的过滤副本。简而言之,data.table方法不是生成结果数据集,而是更新数据集,这是data.table和dplyr之间的真正区别。
Tensibai

1
真的很好比较!也许您可以在执行操作时添加matt_modjaap_dplyrstrsplit fixed=TRUE。就像其他人一样,这将对时间产生影响。由于R 4.0.0,在创建时的默认data.frame值为stringsAsFactors = FALSE,因此as.character可以删除。
GKi

94

几种选择:

1)两种方式

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2) / 组合:

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3)与 only:使用tidyr 0.5.0(及更高版本),您还可以使用separate_rows

separate_rows(v, director, sep = ",")

您可以使用该convert = TRUE参数将数字自动转换为数字列。

4)以R为基数:

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))

有没有办法一次对多个列执行此操作?例如3列,每列都有用“;”分隔的字符串 每列具有相同数量的字符串。即data.table(id= "X21", a = "chr1;chr1;chr1", b="123;133;134",c="234;254;268")成为data.table(id = c("X21","X21",X21"), a=c("chr1","chr1","chr1"), b=c("123","133","134"), c=c("234","254","268"))
Reilstein '19

1
哇,刚刚意识到它已经可以同时用于多个列了-太神奇了!
Reilstein '19

@Reilstein您可以分享如何针对多个专栏进行调整吗?我有相同的用例,但不确定如何去做。
Moon_Watcher

1
上面答案中的@Moon_Watcher方法1已经适用于多列,这是我认为的惊人之处。 setDT(dt)[,lapply(.SD, function(x) unlist(tstrsplit(x, ";",fixed=TRUE))), by = ID]对我有用。
Reilstein

51

命名您的原始data.frame v,我们有:

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González Iñárritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

请注意使用rep来构建新的AB列。在此,sapply返回每个原始行中名称的数量。


1
我想知道`AB = rep(v $ AB,unlist(sapply(s,FUN = length)))`可能比晦涩难懂vapply吗?有什么vapply更适合这里的吗?
IRTFM 2013年

7
如今sapply(s, length)可以用代替lengths(s)
Rich Scriven

31

晚了,但是另一个通用的替代方法是使用cSplit我的“ splitstackshape”程序包中的一个direction参数。设置它以"long"获得您指定的结果:

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B

2
devtools::install_github("yikeshu0611/onetree")

library(onetree)

dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")

head(dd)
            director AB
1       Aaron Blaise  A
2         Bob Walker  A
3     Akira Kurosawa  B
4     Alan J. Pakula  A
5        Alan Parker  A
6 Alejandro Amenabar  B

0

使用所得的另一基准strsplit基部可以目前被推荐给分割用逗号分隔的字符串在列到不同的行,因为它是最快的在宽范围的尺寸:

s <- strsplit(v$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))

请注意,使用fixed=TRUE时间对计时有重大影响。

曲线显示了行数上的计算时间

比较方法:

met <- alist(base = {s <- strsplit(v$director, ",") #Matthew Lundberg
   s <- data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))}
 , baseLength = {s <- strsplit(v$director, ",") #Rich Scriven
   s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))}
 , baseLeFix = {s <- strsplit(v$director, ",", fixed=TRUE)
   s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))}
 , cSplit = s <- cSplit(v, "director", ",", direction = "long") #A5C1D2H2I1M1N2O1R2T1
 , dt = s <- setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, "," #Jaap
   , fixed=TRUE))), by = AB][!is.na(director)]
#, dt2 = s <- setDT(v)[, strsplit(director, "," #Jaap #Only Unique
#  , fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
 , dplyr = {s <- v %>%  #Jaap
    mutate(director = strsplit(director, ",", fixed=TRUE)) %>%
    unnest(director)}
 , tidyr = s <- separate_rows(v, director, sep = ",") #Jaap
 , stack = s <- stack(setNames(strsplit(v$director, ",", fixed=TRUE), v$AB)) #Jaap
#, dt3 = {s <- setDT(v)[, strsplit(director, ",", fixed=TRUE), #Uwe #Only Unique
#  by = .(AB, director)][, director := NULL][, setnames(.SD, "V1", "director")]}
 , dt4 = {s <- setDT(v)[, .(director = unlist(strsplit(director, "," #Uwe
   , fixed = TRUE))), by = .(AB)]}
 , dt5 = {s <- vT[, .(director = unlist(strsplit(director, "," #Uwe
   , fixed = TRUE))), by = .(AB)]}
   )

库:

library(microbenchmark)
library(splitstackshape) #cSplit
library(data.table) #dt, dt2, dt3, dt4
#setDTthreads(1) #Looks like it has here minor effect
library(dplyr) #dplyr
library(tidyr) #dplyr, tidyr

数据:

v0 <- data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

计算和计时结果:

n <- 10^(0:5)
x <- lapply(n, function(n) {v <- v0[rep(seq_len(nrow(v0)), n),]
  vT <- setDT(v)
  ti <- min(100, max(3, 1e4/n))
  microbenchmark(list = met, times = ti, control=list(order="block"))})

y <- do.call(cbind, lapply(x, function(y) aggregate(time ~ expr, y, median)))
y <- cbind(y[1], y[-1][c(TRUE, FALSE)])
y[-1] <- y[-1] / 1e6 #ms
names(y)[-1] <- paste("n:", n * nrow(v0))
y #Time in ms
#         expr     n: 20    n: 200    n: 2000   n: 20000   n: 2e+05   n: 2e+06
#1        base 0.2989945 0.6002820  4.8751170  46.270246  455.89578  4508.1646
#2  baseLength 0.2754675 0.5278900  3.8066300  37.131410  442.96475  3066.8275
#3   baseLeFix 0.2160340 0.2424550  0.6674545   4.745179   52.11997   555.8610
#4      cSplit 1.7350820 2.5329525 11.6978975  99.060448 1053.53698 11338.9942
#5          dt 0.7777790 0.8420540  1.6112620   8.724586  114.22840  1037.9405
#6       dplyr 6.2425970 7.9942780 35.1920280 334.924354 4589.99796 38187.5967
#7       tidyr 4.0323765 4.5933730 14.7568235 119.790239 1294.26959 11764.1592
#8       stack 0.2931135 0.4672095  2.2264155  22.426373  289.44488  2145.8174
#9         dt4 0.5822910 0.6414900  1.2214470   6.816942   70.20041   787.9639
#10        dt5 0.5015235 0.5621240  1.1329110   6.625901   82.80803   636.1899

注意,方法如

(v <- rbind(v0[1:2,], v0[1,]))
#                 director AB
#1 Aaron Blaise,Bob Walker  A
#2          Akira Kurosawa  B
#3 Aaron Blaise,Bob Walker  A

setDT(v)[, strsplit(director, "," #Jaap #Only Unique
  , fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
#         director AB
#1:   Aaron Blaise  A
#2:     Bob Walker  A
#3: Akira Kurosawa  B

返回strsplitunique 董事,并可能与可比

tmp <- unique(v)
s <- strsplit(tmp$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(tmp$AB, lengths(s)))

但是据我了解,这不是必需的。

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.