在data.frame中添加一列

115

我有下面的data.frame。我想添加一列，按照第1列（h_no）对数据进行分类，以使h_no 1,2,3,4的第一个系列为1类，第二个系列h_no（1至7）为2 类，依此类推。如最后一栏所示。

h_no  h_freq  h_freqsq
1     0.09091 0.008264628 1
2     0.00000 0.000000000 1
3     0.04545 0.002065702 1
4     0.00000 0.000000000 1  
1     0.13636 0.018594050 2
2     0.00000 0.000000000 2
3     0.00000 0.000000000 2
4     0.04545 0.002065702 2
5     0.31818 0.101238512 2
6     0.00000 0.000000000 2
7     0.50000 0.250000000 2 
1     0.13636 0.018594050 3 
2     0.09091 0.008264628 3
3     0.40909 0.167354628 3
4     0.04545 0.002065702 3

r dataframe

— 苏珊（Susanne Dreisigacker）
source

155

您可以使用多种技术在数据中添加一列。下面的引号来自相关帮助文本的“详细信息”部分[[.data.frame。

数据帧可以用几种模式编制索引。当[和[[与单个向量索引（x[i]或x[[i]]）一起使用时，它们会将数据帧索引为列表。

my.dataframe["new.col"] <- a.vector
my.dataframe[["new.col"]] <- a.vector

的data.frame方法$视为x列表

my.dataframe$new.col <- a.vector

当[和[[与两个索引（x[i, j]和x[[i, j]]）一起使用时，它们的作用类似于索引矩阵

my.dataframe[ , "new.col"] <- a.vector

由于for的方法data.frame假定如果您未指定要使用列还是行，则将假定您是指列。

对于您的示例，这应该工作：

# make some fake data
your.df <- data.frame(no = c(1:4, 1:7, 1:5), h_freq = runif(16), h_freqsq = runif(16))

# find where one appears and 
from <- which(your.df$no == 1)
to <- c((from-1)[-1], nrow(your.df)) # up to which point the sequence runs

# generate a sequence (len) and based on its length, repeat a consecutive number len times
get.seq <- mapply(from, to, 1:length(from), FUN = function(x, y, z) {
            len <- length(seq(from = x[1], to = y[1]))
            return(rep(z, times = len))
         })

# when we unlist, we get a vector
your.df$group <- unlist(get.seq)
# and append it to your original data.frame. since this is
# designating a group, it makes sense to make it a factor
your.df$group <- as.factor(your.df$group)


   no     h_freq   h_freqsq group
1   1 0.40998238 0.06463876     1
2   2 0.98086928 0.33093795     1
3   3 0.28908651 0.74077119     1
4   4 0.10476768 0.56784786     1
5   1 0.75478995 0.60479945     2
6   2 0.26974011 0.95231761     2
7   3 0.53676266 0.74370154     2
8   4 0.99784066 0.37499294     2
9   5 0.89771767 0.83467805     2
10  6 0.05363139 0.32066178     2
11  7 0.71741529 0.84572717     2
12  1 0.10654430 0.32917711     3
13  2 0.41971959 0.87155514     3
14  3 0.32432646 0.65789294     3
15  4 0.77896780 0.27599187     3
16  5 0.06100008 0.55399326     3

— 罗曼·卢斯特里克（RomanLuštrik）
source

最后两种添加列的方法有什么区别？

— 休恩

2

@ huon-dbaupp带有逗号的方法是显式的，并且也适用于矩阵，而最后一个仅适用于data.frames。如果未提供逗号，则R假定您是指列。

— RomanLuštrik，2015年

12

轻松：您的数据框是A

b <- A[,1]
b <- b==1
b <- cumsum(b)

然后，您将获得列b。

— 用户名
source

好又短。我只需要更改最后一个元素，以cumsum(b) -> b使结果不是直接作为一列添加到原始数据帧中，而不是作为结果A$groups <- cumsum(b)。

— A5C1D2H2I1M1N2O1R2T1

cumsum(b)会给你一个长度为3的向量，还是我错过了什么？

— RomanLuštrik'12

@RomanLuštrik，请参阅dbaupp的解决方案，该解决方案说明了在这种情况下cumsum的工作方式。

— A5C1D2H2I1M1N2O1R2T1 '04年6

2

@RomanLuštrik，此解决方案可以在一行中很好地重写。使用your.df数据，您可以简单your.df$group = cumsum(your.df[, 1]==1)地获取新的组列。

— A5C1D2H2I1M1N2O1R2T1

7

如果我对问题的理解正确，那么您想检测何时h_no不增加，然后增加class。（我将逐步解决这个问题，最后有一个自包含的功能。）

加工

我们目前只关心该h_no列，因此我们可以从数据框中提取该列：

> h_no <- data$h_no

我们想检测何时h_no不上升，这可以通过计算连续元素之间的差为负或零来实现。R提供diff给我们差向量的函数：

> d.h_no <- diff(h_no)
> d.h_no
 [1]  1  1  1 -3  1  1  1  1  1  1 -6  1  1  1

一旦有了这些，找到一个非正数的问题就很简单了：

> nonpos <- d.h_no <= 0
> nonpos
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[13] FALSE FALSE

在R中，TRUE和FALSE基本上与1和相同0，因此，如果我们得到的累加和nonpos，它将在（几乎）适当的位置增加1。的cumsum功能（这是基本上相反diff）可以做到这一点。

> cumsum(nonpos)
 [1] 0 0 0 1 1 1 1 1 1 1 2 2 2 2

但是，有两个问题：数字太小；并且，我们缺少第一个元素（第一堂课应该有四个）。

第一个问题很简单地解决了：1+cumsum(nonpos)。第二个只需要1在向量的前面添加a ，因为第一个元素始终在类中1：

 > classes <- c(1, 1 + cumsum(nonpos))
 > classes
  [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3

现在，我们可以将其附加到我们的数据框架上cbind（通过使用class=语法，我们可以为列指定class标题）：

 > data_w_classes <- cbind(data, class=classes)

而data_w_classes现在包含的结果。

最后结果

我们可以将这些行压缩在一起，并将它们全部包装成一个函数，以使其更易于使用：

classify <- function(data) {
   cbind(data, class=c(1, 1 + cumsum(diff(data$h_no) <= 0)))
}

或者，由于class成为一个因素是有意义的：

classify <- function(data) {
   cbind(data, class=factor(c(1, 1 + cumsum(diff(data$h_no) <= 0))))
}

您可以使用以下任一函数：

> classified <- classify(data) # doesn't overwrite data
> data <- classify(data) # data now has the "class" column

（解决该问题的方法是好的，因为它避免了显式迭代（通常建议R使用该方法，并且避免生成大量中间向量和列表等。而且它还可以整齐地写在一行上：）

— 休恩
source

2

除了罗曼的回答，类似的事情可能会更简单。请注意，我尚未对其进行测试，因为我目前无法访问R。

# Note that I use a global variable here
# normally not advisable, but I liked the
# use here to make the code shorter
index <<- 0
new_column = sapply(df$h_no, function(x) {
  if(x == 1) index = index + 1
  return(index)
})

该函数迭代其中的值，n_ho并始终返回当前值所属的类别。如果1检测到值，我们将增加全局变量index并继续。

— 保罗·希姆斯特拉
source

我喜欢带有全局变量的hack。好吧：P

— RomanLuštrik'12

2

我相信使用“ cbind”是在R中向数据框中添加列的最简单方法。下面是一个示例：

    myDf = data.frame(index=seq(1,10,1), Val=seq(1,10,1))
    newCol= seq(2,20,2)
    myDf = cbind(myDf,newCol)

— 伊曼纽尔·卡塔尼亚（Emanuele Catania）
source

1

Data.frame[,'h_new_column'] <- as.integer(Data.frame[,'h_no'], breaks=c(1, 4, 7))

— 用户名
source

0

基于识别组数（xin mapply）及其长度（yin mapply）的方法

mytb<-read.table(text="h_no  h_freq  h_freqsq group
1     0.09091 0.008264628 1
2     0.00000 0.000000000 1
3     0.04545 0.002065702 1
4     0.00000 0.000000000 1  
1     0.13636 0.018594050 2
2     0.00000 0.000000000 2
3     0.00000 0.000000000 2
4     0.04545 0.002065702 2
5     0.31818 0.101238512 2
6     0.00000 0.000000000 2
7     0.50000 0.250000000 2 
1     0.13636 0.018594050 3 
2     0.09091 0.008264628 3
3     0.40909 0.167354628 3
4     0.04545 0.002065702 3", header=T, stringsAsFactors=F)
mytb$group<-NULL

positionsof1s<-grep(1,mytb$h_no)

mytb$newgroup<-unlist(mapply(function(x,y) 
  rep(x,y),                      # repeat x number y times
  x= 1:length(positionsof1s),    # x is 1 to number of nth group = g1:g3
  y= c( diff(positionsof1s),     # y is number of repeats of groups g1 to penultimate (g2) = 4, 7
        nrow(mytb)-              # this line and the following gives number of repeat for last group (g3)
          (positionsof1s[length(positionsof1s )]-1 )  # number of rows - position of penultimate group (g2) 
      ) ) )
mytb

— 费罗奥
source