在R data.table计算中使用上一行的值

81

我想在data.table中创建一个新列，该列是根据一列的当前值和另一列的前值计算得出的。是否可以访问以前的行？

例如：

> DT <- data.table(A=1:5, B=1:5*10, C=1:5*100)
> DT
   A  B   C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
> DT[, D := C + BPreviousRow] # What is the correct code here?

正确答案应该是

> DT
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

r data.table

— 科隆
source

我通常会为我的data.tables设置一个密钥：DT <- data.table(A=..., key = "A")

— PatrickT

103

使用v1.9.6中的shift()实现，这非常简单。

DT[ , D := C + shift(B, 1L, type="lag")]
# or equivalently, in this case,
DT[ , D := C + shift(B)]

来自新闻：

新功能可shift()快速lead/lag实现vector，list，data.frames或data.tables。它采用的type参数可以是“ lag”（默认）或“ lead”。与:=或一起使用时，使用非常方便set()。例如：DT[, (cols) := shift(.SD, 1L), by=id]。请查看?shift更多信息。

查看历史记录以获取先前的答案。

— 阿伦
source

那会.N保留当前的行号吗？很抱歉在这里提出问题，但我似乎无法在帮助文件中找到它……

— SlowLearner 2013年

7

@SlowLearner：您可能还会发现.I有用，它包含curren组中行的行索引。

— Steve Lianoglou

7

使用seq_len（.N-1）代替1 :(。N-1）。这避免了与1：0相关的问题。

— mnel

1

.SD例子+1，我试图使用alapply并获得时髦的结果。这要简单得多。

— MichaelChirico 2015年

在哪里可以找到包含所有这些新信息的更新的pdf？官方1.9.4小插曲和网络研讨会不包括它。Rmd 1.9.5小插曲不舒服，也不包含它。

— skan 2015年

43

使用dplyr您可以做到：

mutate(DT, D = lag(B) + C)

这使：

#   A  B   C   D
#1: 1 10 100  NA
#2: 2 20 200 210
#3: 3 30 300 320
#4: 4 40 400 430
#5: 5 50 500 540

— 史蒂文·博普雷
source

22

有几个人回答了具体问题。有关在这种情况下可能会有用的通用功能，请参见下面的代码。不仅可以获取上一行，还可以根据需要在“过去”或“未来”中进行任意多行。

rowShift <- function(x, shiftLen = 1L) {
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

# Create column D by adding column C and the value from the previous row of column B:
DT[, D := C + rowShift(B,-1)]

# Get the Old Faithul eruption length from two events ago, and three events in the future:
as.data.table(faithful)[1:5,list(eruptLengthCurrent=eruptions,
                                 eruptLengthTwoPrior=rowShift(eruptions,-2), 
                                 eruptLengthThreeFuture=rowShift(eruptions,3))]
##   eruptLengthCurrent eruptLengthTwoPrior eruptLengthThreeFuture
##1:              3.600                  NA                  2.283
##2:              1.800                  NA                  4.533
##3:              3.333               3.600                     NA
##4:              2.283               1.800                     NA
##5:              4.533               3.333                     NA

— dnlbrky
source

这是一个很棒的答案，令我很烦恼的是我已经赞成其他答案，因为这是一个更为通用的答案。实际上，我将在我的出品包中使用它（如果您不介意的话）。

— 日内瓦

当然可以。我希望获得一些空闲时间，并将其作为请求请求提交给data.table软件包，但是but ...

— dnlbrky 2014年

从1.9.5版shift开始，已添加了一个类似的函数data.table。请参阅@Arun的最新答案。

— dnlbrky 2015年

12

基于以上@Steve Lianoglou的评论，为什么不只是：

DT[, D:= C + c(NA, B[.I - 1]) ]
#    A  B   C   D
# 1: 1 10 100  NA
# 2: 2 20 200 210
# 3: 3 30 300 320
# 4: 4 40 400 430
# 5: 5 50 500 540

并避免使用seq_len或head或任何其他功能。

— 加里·魏斯曼
source

2

很好-但是，如果您想在组中找到前一个，则此方法将无效。

— 马修

1

@马修你是对的。如果按组分组，我将替换.I为seq_len(.N)

— Gary Weissman

9

遵循Arun的解决方案，无需参考即可获得类似的结果。 .N

> DT[, D := C + c(NA, head(B, -1))][]
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

— 良木
source

有理由偏爱一种方法而不是另一种方法吗？还是仅仅是美学上的差异？

— Korone

我认为在这种情况下（即.N随时可用），主要是美学选择。我不知道有什么重要的区别。

— Ryogi

1

我添加了padding参数，并更改了一些名称并将其命名shift。https://github.com/geneorama/geneorama/blob/master/R/shift.R

— 创世记
source

1

非常感谢您的来信。我会一直在寻找它，并且很可能会使用它并弃用我的制版版本。

— 2015年

1

这是我的直观解决方案：

#create data frame
df <- data.frame(A=1:5, B=seq(10,50,10), C=seq(100,500, 100))`
#subtract the shift from num rows
shift  <- 1 #in this case the shift is 1
invshift <- nrow(df) - shift
#Now create the new column
df$D <- c(NA, head(df$B, invshift)+tail(df$C, invshift))`

在这里invshift，行数减1为4。nrow(df)它为您提供了数据帧或向量中的行数。同样，如果您想获取更早的值，请从nrow 2、3，... etc中减去，并将NA相应地放在开头。

— 阿卜杜拉·艾哈迈德（Abdullah Al Mahmud）
source

-2

它可以循环执行。

# Create the column D
DT$D <- 0
# for every row in DT
for (i in 1:length(DT$A)) {
  if(i==1) {
    #using NA at first line
    DT[i,4] <- NA
  } else {
    #D = C + BPreviousRow
    DT[i,4] <- DT[i,3] + DT[(i-1), 2]   
  }
}

使用for，您甚至可以使用此新列的行的先前值 DT[(i-1), 4]

— 拉斐尔·布拉加（Rafael Braga）
source