Answers:
使用aggregate
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
在上面的示例中,可以在中指定多个尺寸list
。可以通过cbind
以下方式合并同一数据类型的多个汇总指标:
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
(嵌入@thelatemail评论),aggregate
也具有公式界面
aggregate(Frequency ~ Category, x, sum)
或者,如果您想汇总多列,则可以使用.
表示法(也适用于一列)
aggregate(. ~ Category, x, sum)
或tapply
:
tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
使用此数据:
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
您也可以为此使用dplyr软件包:
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
或者,对于多个摘要列(也适用于一列):
x %>%
group_by(Category) %>%
summarise_all(funs(sum))
以下是一些更多示例,说明如何使用内置数据集使用dplyr函数按组汇总数据mtcars
:
# several summary columns with arbitrary names
mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
# summarise all columns except grouping columns using "sum"
mtcars %>%
group_by(cyl) %>%
summarise_all(sum)
# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>%
group_by(cyl) %>%
summarise_all(funs(sum, mean))
# multiple grouping columns
mtcars %>%
group_by(cyl, gear) %>%
summarise_all(funs(sum, mean))
# summarise specific variables, not all
mtcars %>%
group_by(cyl, gear) %>%
summarise_at(vars(qsec, mpg, wt), funs(sum, mean))
# summarise specific variables (numeric columns except grouping columns)
mtcars %>%
group_by(gear) %>%
summarise_if(is.numeric, funs(mean))
有关更多信息(包括%>%
运算符),请参见dplyr简介。
funs()
的参数summarise_all
以及相关函数(summarise_at
,summarise_if
)
rcs提供的答案很简单。但是,如果您要处理更大的数据集并需要提高性能,则可以使用更快的替代方法:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
让我们将其与使用data.frame和上面的内容进行比较:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
如果要保留该列,则语法如下:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
较大的数据集之间的区别将变得更加明显,如以下代码所示:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
对于多个聚合,您可以将合并lapply
,.SD
如下所示
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
data[, sum(Frequency), by = Category]
。您可以使用.N
which代替该sum()
功能。data[, .N, by = Category]
。这是一个有用的
尽管最近我已成为dplyr
大多数这类操作的转换者,但sqldf
对于某些事情,该软件包仍然非常好(恕我直言,更具可读性)。
这是一个如何回答这个问题的例子 sqldf
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")
## Category Frequency
## 1 First 30
## 2 Second 5
## 3 Third 34
ave
当您需要在不同的列上应用不同的聚合函数(并且您必须/想要坚持以R为基础)时,我发现这非常有帮助(高效):
例如
鉴于此输入:
DF <-
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
Categ2=factor(c('X','Y','X','X','X','Y','Y')),
Samples=c(1,2,4,3,5,6,7),
Freq=c(10,30,45,55,80,65,50))
> DF
Categ1 Categ2 Samples Freq
1 A X 1 10
2 A Y 2 30
3 B X 4 45
4 B X 3 55
5 A X 5 80
6 B Y 6 65
7 A Y 7 50
我们要按和分组Categ1
,Categ2
并计算和的Samples
和Freq
。
这是使用以下方法的可能解决方案ave
:
# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]
# add sum of Samples by Categ1,Categ2 to DF2
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)
# add mean of Freq by Categ1,Categ2 to DF2
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)
# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]
结果:
> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1 A X 6 45
2 A Y 9 40
3 B X 7 50
6 B Y 6 65
您可以使用函数group.sum
从包Rfast。
Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
Rfast具有许多组功能,并且group.sum
是其中之一。
使用cast
而不是recast
('Frequency'
现在是'value'
)
df <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))
install.packages("reshape")
result<-cast(df, Category ~ . ,fun.aggregate=sum)
要得到:
Category (all)
First 30
Second 5
Third 34
rowsum
。