我想提出一个(标准)初步分析,以消除以下主要影响:(a)用户之间的差异;(b)所有用户对变更的典型响应;以及(c)从一个时间段到下一个时间段的典型差异。
一种简单的方法(但绝不是最好的方法)是对数据执行几次“中值抛光”迭代以清除用户中位数和时间段中位数,然后随时间平滑残差。确定变化很大的平滑度:它们是您要在图形中强调的用户。
因为这些是计数数据,所以最好使用平方根重新表达它们。
作为可能结果的一个示例,这是一个模拟的60周数据集,其中包含240位用户,这些用户通常每周执行10到20次操作。在第40周后,所有用户都发生了更改。其中三个“被告知”对更改做出负面反应。左图显示了原始数据:一段时间内用户(按用户区分颜色)的操作计数。正如问题中所断言的,这是一团糟。右图以与以前相同的颜色显示了此EDA的结果,并自动识别并突出显示了响应异常的用户。标识(尽管有些特殊)是完整且正确的(在此示例中)。
这是R
产生这些数据并进行分析的代码。可以通过几种方式进行改进,包括
尽管如此,测试表明该解决方案适用于12-240或更多的广泛用户。
n.users <- 240 # Number of users (here limited to 657, the number of colors)
n.periods <- 60 # Number of time periods
i.break <- 40 # Period after which change occurs
n.outliers <- 3 # Number of greatly changed users
window <- 1/5 # Temporal smoothing window, fraction of total period
response.all <- 1.1 # Overall response to the change
threshold <- 2 # Outlier detection threshold
# Create a simulated dataset
set.seed(17)
base <- exp(rnorm(n.users, log(10), 1/2))
response <- c(rbeta(n.users - n.outliers, 9, 1),
rbeta(n.outliers, 5, 45)) * response.all
actual <- cbind(base %o% rep(1, i.break),
base * response %o% rep(response.all, n.periods-i.break))
observed <- matrix(rpois(n.users * n.periods, actual), nrow=n.users)
# ---------------------------- The analysis begins here ----------------------------#
# Plot the raw data as lines
set.seed(17)
colors = sample(colors(), n.users) # (Use a different method when n.users > 657)
par(mfrow=c(1,2))
plot(c(1,n.periods), c(min(observed), max(observed)), type="n",
xlab="Time period", ylab="Number of actions", main="Raw data")
i <- 0
apply(observed, 1, function(a) {i <<- i+1; lines(a, col=colors[i])})
abline(v = i.break, col="Gray") # Mark the last period before a change
# Analyze the data by time period and user by sweeping out medians and smoothing
x <- sqrt(observed + 1/6) # Re-express the counts
mean.per.period <- apply(x, 2, median)
residuals <- sweep(x, 2, mean.per.period)
mean.per.user <- apply(residuals, 1, median)
residuals <- sweep(residuals, 1, mean.per.user)
smooth <- apply(residuals, 1, lowess, f=window) # Smooth the residuals
smooth.y <- sapply(smooth, function(s) s$y) # Extract the smoothed values
ends <- ceiling(window * n.periods / 4) # Prepare to drop near-end values
range <- apply(smooth.y[-(1:ends), ], 2, function(x) max(x) - min(x))
# Mark the apparent outlying users
thick <- rep(1, n.users)
thick[outliers <- which(range >= threshold * median(range))] <- 3
type <- ifelse(thick==1, 3, 1)
cat(outliers) # Print the outlier identifiers (ideally, the last `n.outliers`)
# Plot the residuals
plot(c(1,n.periods), c(min(smooth.y), max(smooth.y)), type="n",
xlab="Time period", ylab="Smoothed residual root", main="Residuals")
i <- 0
tmp <- lapply(smooth,
function(a) {i <<- i+1; lines(a, lwd=thick[i], lty=type[i], col=colors[i])})
abline(v = i.break, col="Gray")