我正在使用一个小的数据集(21个观测值),并且在R中具有以下常规QQ图:
看到该图不支持正态性,我可以推断出基础分布如何?在我看来,更偏向右侧的分布会更合适,对吗?此外,我们还可以从数据中得出哪些其他结论?
我正在使用一个小的数据集(21个观测值),并且在R中具有以下常规QQ图:
看到该图不支持正态性,我可以推断出基础分布如何?在我看来,更偏向右侧的分布会更合适,对吗?此外,我们还可以从数据中得出哪些其他结论?
Answers:
如果这些值沿着一条线分布,则该分布具有与我们假设的理论分布相同的形状(取决于位置和比例)。
局部行为:当在y轴上查看排序的样本值并在x轴上查看(近似)预期分位数时,我们可以通过查看图的某些部分中的值与整体线性趋势之间的局部差异来确定值或多或少地集中于该图的该部分中的理论分布:
如我们所见,较少的集中点比假定的增加快得多,而集中的点却比整体线性关系所暗示的增加的快得多,并且在极端情况下,这对应于样本密度的差距(显示为近乎垂直的跳跃)或恒定值的峰值(水平对齐的值)。这使我们可以发现较重的尾巴或较轻的尾巴,因此偏斜度大于或小于理论分布,依此类推。
整体外观:
平均而言,这是QQ图的样子(针对特定的分发选择):
但是随机性往往会掩盖事物,尤其是对于小样本:
请注意,在,结果可能比那里显示的要多得多-我生成了多个这样的六个图集,并选择了一个“不错”的集,您可以在其中同时看到所有六个图中的形状。有时直的关系看起来很弯曲,弯曲的关系看起来很直,重尾巴看起来只是歪斜,依此类推-对于如此小的样本,通常情况可能不太清楚:
可以辨别出比这些特征更多的特征(例如,离散性),但是在,即使这样的基本特征也可能很难发现。我们不应该试图“过度解释”每一个小变化。随着样本数量的增加,通常来说,图变得“稳定”,特征变得更易于解释,而不是代表噪声。[对于某些非常重尾的分布,即使在相当大的样本量下,稀有的大离群值也可能无法很好地稳定图像。]
当您尝试决定应该为特定的弯曲度或摆动度担心多少时,您可能还会发现此处的建议很有用。
通常,更合适的解释指南还应包括越来越小样本量的显示。
我制作了一个闪亮的应用程序来帮助解释正常的QQ情节。试试这个链接。
在此应用程序中,您可以调整数据的偏度,拖尾度(峰度)和模态,并且可以查看直方图和QQ图的变化。相反,您可以按照给定QQ图的方式使用它,然后检查偏斜度应如何。
有关更多详细信息,请参见其中的文档。
我意识到我没有足够的可用空间来在线提供此应用程序。按要求,我公司将提供所有三个代码块:sample.R
,server.R
并ui.R
在这里。那些有兴趣运行此应用程序的人可以将这些文件加载到Rstudio中,然后在自己的PC上运行。
该sample.R
文件中:
# Compute the positive part of a real number x, which is $\max(x, 0)$.
positive_part <- function(x) {ifelse(x > 0, x, 0)}
# This function generates n data points from some unimodal population.
# Input: ----------------------------------------------------
# n: sample size;
# mu: the mode of the population, default value is 0.
# skewness: the parameter that reflects the skewness of the distribution, note it is not
# the exact skewness defined in statistics textbook, the default value is 0.
# tailedness: the parameter that reflects the tailedness of the distribution, note it is
# not the exact kurtosis defined in textbook, the default value is 0.
# When all arguments take their default values, the data will be generated from standard
# normal distribution.
random_sample <- function(n, mu = 0, skewness = 0, tailedness = 0){
sigma = 1
# The sampling scheme resembles the rejection sampling. For each step, an initial data point
# was proposed, and it will be rejected or accepted based on the weights determined by the
# skewness and tailedness of input.
reject_skewness <- function(x){
scale = 1
# if `skewness` > 0 (means data are right-skewed), then small values of x will be rejected
# with higher probability.
l <- exp(-scale * skewness * x)
l/(1 + l)
}
reject_tailedness <- function(x){
scale = 1
# if `tailedness` < 0 (means data are lightly-tailed), then big values of x will be rejected with
# higher probability.
l <- exp(-scale * tailedness * abs(x))
l/(1 + l)
}
# w is another layer option to control the tailedness, the higher the w is, the data will be
# more heavily-tailed.
w = positive_part((1 - exp(-0.5 * tailedness)))/(1 + exp(-0.5 * tailedness))
filter <- function(x){
# The proposed data points will be accepted only if it satified the following condition,
# in which way we controlled the skewness and tailedness of data. (For example, the
# proposed data point will be rejected more frequently if it has higher skewness or
# tailedness.)
accept <- runif(length(x)) > reject_tailedness(x) * reject_skewness(x)
x[accept]
}
result <- filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5)))
# Keep generating data points until the length of data vector reaches n.
while (length(result) < n) {
result <- c(result, filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5))))
}
result[1:n]
}
multimodal <- function(n, Mu, skewness = 0, tailedness = 0) {
# Deal with the bimodal case.
mumu <- as.numeric(Mu %*% rmultinom(n, 1, rep(1, length(Mu))))
mumu + random_sample(n, skewness = skewness, tailedness = tailedness)
}
该server.R
文件中:
library(shiny)
# Need 'ggplot2' package to get a better aesthetic effect.
library(ggplot2)
# The 'sample.R' source code is used to generate data to be plotted, based on the input skewness,
# tailedness and modality. For more information, see the source code in 'sample.R' code.
source("sample.R")
shinyServer(function(input, output) {
# We generate 10000 data points from the distribution which reflects the specification of skewness,
# tailedness and modality.
n = 10000
# 'scale' is a parameter that controls the skewness and tailedness.
scale = 1000
# The `reactive` function is a trick to accelerate the app, which enables us only generate the data
# once to plot two plots. The generated sample was stored in the `data` object to be called later.
data <- reactive({
# For `Unimodal` choice, we fix the mode at 0.
if (input$modality == "Unimodal") {mu = 0}
# For `Bimodal` choice, we fix the two modes at -2 and 2.
if (input$modality == "Bimodal") {mu = c(-2, 2)}
# Details will be explained in `sample.R` file.
sample1 <- multimodal(n, mu, skewness = scale * input$skewness, tailedness = scale * input$kurtosis)
data.frame(x = sample1)})
output$histogram <- renderPlot({
# Plot the histogram.
ggplot(data(), aes(x = x)) +
geom_histogram(aes(y = ..density..), binwidth = .5, colour = "black", fill = "white") +
xlim(-6, 6) +
# Overlay the density curve.
geom_density(alpha = .5, fill = "blue") + ggtitle("Histogram of Data") +
theme(plot.title = element_text(lineheight = .8, face = "bold"))
})
output$qqplot <- renderPlot({
# Plot the QQ plot.
ggplot(data(), aes(sample = x)) + stat_qq() + ggtitle("QQplot of Data") +
theme(plot.title = element_text(lineheight=.8, face = "bold"))
})
})
最后,ui.R
文件:
library(shiny)
# Define UI for application that helps students interpret the pattern of (normal) QQ plots.
# By using this app, we can show students the different patterns of QQ plots (and the histograms,
# for completeness) for different type of data distributions. For example, left skewed heavy tailed
# data, etc.
# This app can be (and is encouraged to be) used in a reversed way, namely, show the QQ plot to the
# students first, then tell them based on the pattern of the QQ plot, the data is right skewed, bimodal,
# heavy-tailed, etc.
shinyUI(fluidPage(
# Application title
titlePanel("Interpreting Normal QQ Plots"),
sidebarLayout(
sidebarPanel(
# The first slider can control the skewness of input data. "-1" indicates the most left-skewed
# case while "1" indicates the most right-skewed case.
sliderInput("skewness", "Skewness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),
# The second slider can control the skewness of input data. "-1" indicates the most light tail
# case while "1" indicates the most heavy tail case.
sliderInput("kurtosis", "Tailedness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),
# This selectbox allows user to choose the number of modes of data, two options are provided:
# "Unimodal" and "Bimodal".
selectInput("modality", label = "Modality",
choices = c("Unimodal" = "Unimodal", "Bimodal" = "Bimodal"),
selected = "Unimodal"),
br(),
# The following helper information will be shown on the user interface to give necessary
# information to help users understand sliders.
helpText(p("The skewness of data is controlled by moving the", strong("Skewness"), "slider,",
"the left side means left skewed while the right side means right skewed."),
p("The tailedness of data is controlled by moving the", strong("Tailedness"), "slider,",
"the left side means light tailed while the right side means heavy tailedd."),
p("The modality of data is controlledy by selecting the modality from", strong("Modality"),
"select box.")
)
),
# The main panel outputs two plots. One plot is the histogram of data (with the nonparamteric density
# curve overlaid), to get a better visualization, we restricted the range of x-axis to -6 to 6 so
# that part of the data will not be shown when heavy-tailed input is chosen. The other plot is the
# QQ plot of data, as convention, the x-axis is the theoretical quantiles for standard normal distri-
# bution and the y-axis is the sample quantiles of data.
mainPanel(
plotOutput("histogram"),
plotOutput("qqplot")
)
)
)
)
教授给出了一个非常有用的(直观的)解释。Philippe Rigollet在MIT MOOC课程中:18.650应用统计,2016年秋季-观看45分钟的视频
https://www.youtube.com/watch?v=vMaKx9fmJHE
我粗略地复制了他的图表,并将其保存在笔记中,因为我发现它非常有用。
在示例1中,在左上图中,我们看到在右尾中,经验(或样本)分位数小于理论分位数
Qe <Qt
这可以使用概率密度函数来解释。对于相同的值,经验分位数位于理论分位数的左侧,这意味着经验分布的右尾比理论分布的右尾“更轻”,即,它的下降速度更快,接近于零。
由于该线程被认为是“如何解释常规qq图”的权威文章,所以我想向读者指出常规qq图和超峰度统计之间的精确,精确的数学关系。
这里是:
https://stats.stackexchange.com/a/354076/102879
简短的摘要(太简单了)如下(请参阅链接以获取更精确的数学陈述):实际上,您可以在正常qq图中看到过量峰度,即数据分位数与相应理论正态分位数之间的平均距离(加权后)从数据到均值的距离。因此,当qq图的尾部中的绝对值通常在极端方向上大大偏离预期的正常值时,您将出现正的过量峰度。
由于峰度是这些偏差的平均值,是通过与均值之间的距离加权的值,因此qq图中心附近的值对峰度的影响很小。因此,过量峰度与“峰值”所在的分布中心无关。相反,过量峰度几乎完全由数据分布的尾部与正态分布的比较确定。