如何将一周的分钟数据汇总为小时数？

15

您将如何获得每日多个时段的每小时数据，并在同一图中显示12个“主机”的结果？也就是说，我想绘制一个24小时周期的样子，以获取一周的数据。最终目标是在采样之前和之后比较两组该数据。

            dates         Host CPUIOWait CPUUser CPUSys
1 2011-02-11 23:55:12     db       0      14      8
2 2011-02-11 23:55:10     app1     0       6      1
3 2011-02-11 23:55:09     app2     0       4      1

我已经能够很好地运行xyplot（CPUUser〜date | Host）。但是，我不想显示一周中的每个日期，而是希望将X轴作为一天中的小时数。

尝试将这些数据放入xts对象会导致错误，例如“ order.by需要适当的基于时间的对象”

这是数据帧的str（）：

'data.frame':   19720 obs. of  5 variables:
$ dates    : POSIXct, format: "2011-02-11 23:55:12" "2011-02-11 23:55:10" ...
$ Host     : Factor w/ 14 levels "app1","app2",..: 9 7 5 4 3 10 6 8 2 1 ...  
$ CPUIOWait: int  0 0 0 0 0 0 0 0 0 0 ...
$ CPUUser  : int  14 6 4 4 3 10 4 3 4 4 ...
$ CPUSys   : int  8 1 1 1 1 3 1 1 1 1 ...

更新：仅供参考，我决定使用箱形图来显示中位数和“异常值”。

实质上：

Data$hour <- as.POSIXlt(dates)$hour  # extract hour of the day
boxplot(Data$CPUUser ~ Data$hour)    # for a subset with one host or for all hosts
xyplot(Data$CPUUser ~ Data$hour | Data$Host, panel=panel.bwplot, horizontal=FALSE)

谢谢

r time-series aggregation

— 斯科特·霍夫曼
source

我猜您会从中得到这些错误，xts()因为该dates列是一个因素。

— 约书亚·乌尔里希

我对R真的很新。我从strptime函数创建了dates列。原始数据来自read.csv。

— Scott Hoffman

1

让我们看str()一下data.frame。

— RomanLuštrik2011年

@罗马感谢您的str（）函数，我没有意识到这一点。因此，摆脱Factor列，我可以生成一个xts对象，即x <-xts（d [，3：5]，order.by = d [，1]）。然后，我便可以每小时申请一次，从而将数据从19720个对象缩短到480个。我不确定是否能将我带到想要的位置，但是我想现在离我更近了。

— Scott Hoffman

14

这是一种使用cut（）创建适当的小时因子并从plyr库创建ddply（）来计算均值的方法。

library(lattice)
library(plyr)

## Create a record and some random data for every 5 seconds 
## over two days for two hosts.
dates <- seq(as.POSIXct("2011-01-01 00:00:00", tz = "GMT"),
             as.POSIXct("2011-01-02 23:59:55", tz = "GMT"),
             by = 5)
hosts <- c(rep("host1", length(dates)), rep("host2", 
           length(dates)))
x1    <- sample(0:20, 2*length(dates), replace = TRUE)
x2    <- rpois(2*length(dates), 2)
Data  <- data.frame(dates = dates, hosts = hosts, x1 = x1, 
                    x2 = x2)

## Calculate the mean for every hour using cut() to define 
## the factors and ddply() to calculate the means. 
## getmeans() is applied for each unique combination of the
## hosts and hour factors.
getmeans  <- function(Df) c(x1 = mean(Df$x1), 
                            x2 = mean(Df$x2))
Data$hour <- cut(Data$dates, breaks = "hour")
Means <- ddply(Data, .(hosts, hour), getmeans)
Means$hour <- as.POSIXct(Means$hour, tz = "GMT")

## A plot for each host.
xyplot(x1 ~ hour | hosts, data = Means, type = "o",
       scales = list(x = list(relation = "free", rot = 90)))

— 杰森·摩根（Jason Morgan）
source

谢谢...我想我可能需要改写这个问题，或者问一个新的问题。看着这个问题stats.stackexchange.com/questions/980/…，我现在认为获取手段并不是我所追求的。

— Scott Hoffman

@JVM您能解释一下getmeans函数的工作原理，以及为什么不仅仅使用mean或colMeans函数吗？

— Scott Hoffman

1

ddply（）函数将原始数据集切成由主机和小时定义的子集。然后，将它们作为data.frame传递给getmeans（）。对于您的任务，使用colMeans（）可能会很好，但是您可能需要首先删除不需要的列。这样使用ddply（）的好处是，您可以计算您可能感兴趣的任意统计信息。例如sd（），range（）等

— Jason Morgan，

6

聚合也可以不使用zoo（使用来自2个变量的3天随机数据和来自JWM的4个主机的随机数据）进行工作。我假设您每小时都有来自所有主机的数据。

nHosts <- 4  # number of hosts
dates  <- seq(as.POSIXct("2011-01-01 00:00:00"),
              as.POSIXct("2011-01-03 23:59:30"), by=30)
hosts  <- factor(sample(1:nHosts, length(dates), replace=TRUE),
                 labels=paste("host", 1:nHosts, sep=""))
x1     <- sample(0:20, length(dates), replace=TRUE)  # data from 1st variable
x2     <- rpois(length(dates), 2)                    # data from 2nd variable
Data   <- data.frame(dates=dates, hosts=hosts, x1=x1, x2=x2)

我不确定您是要在每小时内还是在整天的每小时内求平均值。我都会做。

Data$hFac <- droplevels(cut(Data$dates, breaks="hour"))
Data$hour <- as.POSIXlt(dates)$hour  # extract hour of the day

# average both variables over days within each hour and host
# formula notation was introduced in R 2.12.0 I think
res1 <- aggregate(cbind(x1, x2) ~ hour + hosts, data=Data, FUN=mean)
# only average both variables within each hour and host
res2 <- aggregate(cbind(x1, x2) ~ hFac + hosts, data=Data, FUN=mean)

结果看起来像这样：

> head(res1)
  hour hosts        x1       x2
1    0 host1  9.578431 2.049020
2    1 host1 10.200000 2.200000
3    2 host1 10.423077 2.153846
4    3 host1 10.241758 1.879121
5    4 host1  8.574713 2.011494
6    5 host1  9.670588 2.070588

> head(res2)
                 hFac hosts        x1       x2
1 2011-01-01 00:00:00 host1  9.192308 2.307692
2 2011-01-01 01:00:00 host1 10.677419 2.064516
3 2011-01-01 02:00:00 host1 11.041667 1.875000
4 2011-01-01 03:00:00 host1 10.448276 1.965517
5 2011-01-01 04:00:00 host1  8.555556 2.074074
6 2011-01-01 05:00:00 host1  8.809524 2.095238

我也不完全了解您想要的图形类型。这是图形的准系统版本，仅用于第一个变量，每个主机具有单独的数据行。

# using the data that is averaged over days as well
res1L <- split(subset(res1, select="x1"), res1$hosts)
mat1  <- do.call(cbind, res1L)
colnames(mat1) <- levels(hosts)
rownames(mat1) <- 0:23
matplot(mat1, main="x1 per hour, avg. over days", xaxt="n", type="o", pch=16, lty=1)
axis(side=1, at=seq(0, 23, by=2))
legend(x="topleft", legend=colnames(mat1), col=1:nHosts, lty=1)

同一数据图仅在每小时内平均。

res2L <- split(subset(res2, select="x1"), res2$hosts)
mat2  <- do.call(cbind, res2L)
colnames(mat2) <- levels(hosts)
rownames(mat2) <- levels(Data$hFac)
matplot(mat2, main="x1 per hour", type="o", pch=16, lty=1)
legend(x="topleft", legend=colnames(mat2), col=1:nHosts, lty=1)

— 卡拉卡尔
source

不错的回应，我不熟悉的地方很多，所以我需要尝试一下。不过，用您的方法查看我的数据，我想我也需要显示数据中的高点。谢谢

— Scott Hoffman

2

您可以aggregate.zoo从以下软件包中检出该功能zoo：http : //cran.r-project.org/web/packages/zoo/zoo.pdf

查理

— 查理
source

您能帮我理解为什么运行此程序时会得到NA吗？

— Scott Hoffman

嗨，斯科特，aggregate.zoo尽管我已经使用过该zoo软件包，但实际上并没有使用该功能。您是否确定zoo对象首先是对象？我指向的文档应该可以为您提供帮助。

— 查理