@Ronald的答案是最好的,它广泛适用于许多类似的问题(例如,体重和年龄之间的关系,男女之间在统计学上有显着差异吗?)。但是,我将添加另一个解决方案,该解决方案虽然不那么定量(不提供p值),但却可以很好地图形化显示差异。
编辑:根据这个问题,看起来predict.lm
,用于ggplot2
计算置信区间的函数不会计算回归曲线周围的同时置信带,而只会计算点状置信带。最后的频段不是评估两个拟合线性模型是否在统计上不同的正确方法,或者不是以其他方式表示它们是否可以与同一真实模型兼容的合适频段。因此,它们不是回答您问题的正确曲线。由于显然没有内置R可以同时获得置信带(奇怪!),所以我编写了自己的函数。这里是:
simultaneous_CBs <- function(linear_model, newdata, level = 0.95){
# Working-Hotelling 1 – α confidence bands for the model linear_model
# at points newdata with α = 1 - level
# summary of regression model
lm_summary <- summary(linear_model)
# degrees of freedom
p <- lm_summary$df[1]
# residual degrees of freedom
nmp <-lm_summary$df[2]
# F-distribution
Fvalue <- qf(level,p,nmp)
# multiplier
W <- sqrt(p*Fvalue)
# confidence intervals for the mean response at the new points
CI <- predict(linear_model, newdata, se.fit = TRUE, interval = "confidence",
level = level)
# mean value at new points
Y_h <- CI$fit[,1]
# Working-Hotelling 1 – α confidence bands
LB <- Y_h - W*CI$se.fit
UB <- Y_h + W*CI$se.fit
sim_CB <- data.frame(LowerBound = LB, Mean = Y_h, UpperBound = UB)
}
library(dplyr)
# sample datasets
setosa <- iris %>% filter(Species == "setosa") %>% select(Sepal.Length, Sepal.Width, Species)
virginica <- iris %>% filter(Species == "virginica") %>% select(Sepal.Length, Sepal.Width, Species)
# compute simultaneous confidence bands
# 1. compute linear models
Model <- as.formula(Sepal.Width ~ poly(Sepal.Length,2))
fit1 <- lm(Model, data = setosa)
fit2 <- lm(Model, data = virginica)
# 2. compute new prediction points
npoints <- 100
newdata1 <- with(setosa, data.frame(Sepal.Length =
seq(min(Sepal.Length), max(Sepal.Length), len = npoints )))
newdata2 <- with(virginica, data.frame(Sepal.Length =
seq(min(Sepal.Length), max(Sepal.Length), len = npoints)))
# 3. simultaneous confidence bands
mylevel = 0.95
cc1 <- simultaneous_CBs(fit1, newdata1, level = mylevel)
cc1 <- cc1 %>% mutate(Species = "setosa", Sepal.Length = newdata1$Sepal.Length)
cc2 <- simultaneous_CBs(fit2, newdata2, level = mylevel)
cc2 <- cc2 %>% mutate(Species = "virginica", Sepal.Length = newdata2$Sepal.Length)
# combine datasets
mydata <- rbind(setosa, virginica)
mycc <- rbind(cc1, cc2)
mycc <- mycc %>% rename(Sepal.Width = Mean)
# plot both simultaneous confidence bands and pointwise confidence
# bands, to show the difference
library(ggplot2)
# prepare a plot using dataframe mydata, mapping sepal Length to x,
# sepal width to y, and grouping the data by species
p <- ggplot(data = mydata, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
# add data points
geom_point() +
# add quadratic regression with orthogonal polynomials and 95% pointwise
# confidence intervals
geom_smooth(method ="lm", formula = y ~ poly(x,2)) +
# add 95% simultaneous confidence bands
geom_ribbon(data = mycc, aes(x = Sepal.Length, color = NULL, fill = Species, ymin = LowerBound, ymax = UpperBound),alpha = 0.5)
print(p)
内带是默认情况下通过geom_smooth
以下公式计算的:这些是回归曲线周围的逐点 95%置信带。外部的半透明带(感谢图形提示@Roland)是同时 95%置信带。如您所见,它们比预期的要大。来自两条曲线的同时置信带不重叠的事实可以被认为是两个模型之间的差异具有统计学意义的事实。
当然,对于具有有效p值的假设检验,必须遵循@Roland方法,但是该图形方法可以视为探索性数据分析。此外,该情节可以给我们一些其他的想法。显然,两个数据集的模型在统计上是不同的。但是看起来两个1级模型几乎可以像两个二次模型一样拟合数据。我们可以轻松地检验这个假设:
fit_deg1 <- lm(data = mydata, Sepal.Width ~ Species*poly(Sepal.Length,1))
fit_deg2 <- lm(data = mydata, Sepal.Width ~ Species*poly(Sepal.Length,2))
anova(fit_deg1, fit_deg2)
# Analysis of Variance Table
# Model 1: Sepal.Width ~ Species * poly(Sepal.Length, 1)
# Model 2: Sepal.Width ~ Species * poly(Sepal.Length, 2)
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 96 7.1895
# 2 94 7.1143 2 0.075221 0.4969 0.61
1级模型和2级模型之间的差异不明显,因此我们也可以对每个数据集使用两个线性回归。