解释R中的Quantile（）函数

Question 1

我整日都被R分位数功能迷住了。

我对分位数的工作方式有一个直观的了解，并且统计数据中有MS，但是天哪，它的文档使我感到困惑。

从文档：

Q [i]（p）=（1-伽玛）x [j] +伽玛x [j + 1]，

到目前为止，我已经接受了。对于类型i分位数，它是x [j]和x [j + 1]之间的插值，基于一些神秘的常数伽玛

其中1 <= i <= 9（jm）/ n <= p <（j-m + 1）/ n，x [j]是j阶统计量，n是样本大小，m是确定的常数根据样本分位数类型。在此，γ取决于g = np + mj的小数部分。

那么，如何计算j？米？

对于连续样本分位数类型（4到9），可以通过在k阶统计量和p（k）之间进行线性插值来获得样本分位数：

p（k）=（k-alpha）/（n-alpha-beta +1），其中α和β是由类型确定的常数。此外，m ＝α+ p（1-α-β），且γ＝ g。

现在我真的迷路了。p，以前是一个常数，现在显然是一个函数。

因此，对于类型7分位数，默认值为...

7型

p（k）=（k-1）/（n-1）。在这种情况下，p（k）=模式[F（x [k]）]。由S使用。

有人要帮我吗？特别是，我对p是函数和常数的概念感到困惑，到底m是什么，现在要为某个特定p计算j 。

我希望基于此处的答案，我们可以提交一些经过修订的文档，以更好地解释此处的情况。

Quantile.R源代码或类型：Quantile.default

Question 2

您很困惑。该文档太糟糕了。我不得不回过头看它的论文，其依据是（Hyndman，RJ； Fan，Y。（1996年11月）。“统计数据包中的样本分位数” 。American Statistician 50（4）：361–365。doi ：10.2307 / 2684934）获得了解。让我们从第一个问题开始。

其中1 <= i <= 9（jm）/ n <= p <（j-m + 1）/ n，x [j]是j阶统计量，n是样本大小，m是确定的常数根据样本分位数类型。在此，γ取决于g = np + mj的小数部分。

第一部分直接来自本文，但是文档作者所省略的是j = int(pn+m)。这意味着Q[i](p)仅取决于最接近p通过（排序的）观测值的方式的分数的两个阶统计量。（对于像我这样不熟悉该术语的人，一系列观察值的“顺序统计”就是排序后的序列。）

另外，最后一句话是错误的。它应该读

此处的γ取决于np + m的小数部分，g = np + mj

至于m那很简单。 m取决于选择了9种算法中的哪一种。因此，就像Q[i]分位数函数一样，也m应予以考虑m[i]。对于算法1和2，m为0，对于算法3，m为-1/2，对于其他算法，在下一部分。

对于连续样本分位数类型（4到9），可以通过在k阶统计量和p（k）之间进行线性插值来获得样本分位数：

p（k）=（k-alpha）/（n-alpha-beta +1），其中α和β是由类型确定的常数。此外，m ＝α+ p（1-α-β），且γ＝ g。

这真是令人困惑。文档调用p(k)的内容p与之前的内容不同。 p(k)是绘图位置。在本文中，作者将其写为p_k，这很有帮助。特别是因为在for的表达中m，thep是原始的p，the是m = alpha + p * (1 - alpha - beta)。从概念上讲，对算法4-9点（p_k，x[k]）进行插值得到的溶液（p，Q[i](p)）。每种算法的区别仅在于p_k。

至于最后一位，R只是说明S使用什么。

原始论文列出了6个“样本分位数的理想属性”功能的列表，并声明了对＃8的优先选择，该要求可以全部满足1。＃5可以满足所有要求，但出于其他原因，他们不喜欢它（现象学上比从原理上得出的更多）。＃2是像我这样的非统计专家会考虑的分位数，也是Wikipedia中描述的。

顺便说一句，为了回应人们的回答，Mathematica所做的事情大为不同。我想我了解映射。尽管Mathematica的用法更容易理解，但（a）使用无意义的参数更容易用脚射击自己，并且（b）无法执行R的算法2。（这是Mathworld的Quantile页面，该页面指出Mathematica无法执行＃2，但是根据四个参数对所有其他算法进行了更简单的概括。）

Question 3

给向量提供分位数并且没有已知的CDF时，有多种计算分位数的方法。

考虑一下当您的观测值不完全落在分位数上时该怎么办的问题。

“类型”只是确定如何执行此操作。因此，这些方法说：“在第k阶统计量和p（k）之间使用线性插值”。

那么，p（k）是多少？一个人说：“好吧，我喜欢使用k / n”。另一个人说：“我喜欢使用（k-1）/（n-1）”等。这些方法中的每一种都有不同的属性，更适合于一个问题或另一个问题。

\ alpha和\ beta只是参数化函数p的方法。在一种情况下，它们是1和1。在另一种情况下，它们是3/8和-1/4。我认为p在文档中永远不是常数。它们只是不总是显式地显示依赖性。

看看当您插入1：5和1：6之类的向量时，不同类型会发生什么。

（还要注意，即使您的观察结果恰好落在分位数上，某些类型仍将使用线性插值法）。

Question 4

我相信R帮助文档在@RobHyndman的注释中指出的修订之后就很清楚了，但是我发现它有点压倒性。我会发布此答案，以防它帮助某人快速了解选项和他们的假设。

要掌握quantile(x, probs=probs)，我想查看源代码。这也比我在R中预期的要难，因此我实际上只是从github存储库中抢了下来看起来足够新。我对默认（类型7）的行为感兴趣，因此我对其中的一些进行了注释，但对每个选项却没有做同样的事情。

您可以在代码中逐步看到“类型7”方法的内插方式，还添加了几行内容来打印一些重要的值。

quantile.default <-function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE
         , type = 7, ...){
    if(is.factor(x)) { #worry about non-numeric data
        if(!is.ordered(x) || ! type %in% c(1L, 3L))
            stop("factors are not allowed")
        lx <- levels(x)
    } else lx <- NULL
    if (na.rm){
        x <- x[!is.na(x)]
    } else if (anyNA(x)){
        stop("missing values and NaN's not allowed if 'na.rm' is FALSE")
        }
    eps <- 100*.Machine$double.eps #this is to deal with rounding things sensibly
    if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps)))
        stop("'probs' outside [0,1]")

    #####################################
    # here is where terms really used in default type==7 situation get defined

    n <- length(x) #how many observations are in sample?

    if(na.p <- any(!p.ok)) { # set aside NA & NaN
        o.pr <- probs
        probs <- probs[p.ok]
        probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot
    }

    np <- length(probs) #how many quantiles are you computing?

    if (n > 0 && np > 0) { #have positive observations and # quantiles to compute
        if(type == 7) { # be completely back-compatible

            index <- 1 + (n - 1) * probs #this gives the order statistic of the quantiles
            lo <- floor(index)  #this is the observed order statistic just below each quantile
            hi <- ceiling(index) #above
            x <- sort(x, partial = unique(c(lo, hi))) #the partial thing is to reduce time to sort, 
            #and it only guarantees that sorting is "right" at these order statistics, important for large vectors 
            #ties are not broken and tied elements just stay in their original order
            qs <- x[lo] #the values associated with the "floor" order statistics
            i <- which(index > lo) #which of the order statistics for the quantiles do not land on an order statistic for an observed value

            #this is the difference between the order statistic and the available ranks, i think
            h <- (index - lo)[i] # > 0  by construction 
            ##      qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i])
            ##      qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]])
            qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] # This is the interpolation step: assemble the estimated quantile by removing h*low and adding back in h*high. 
            # h is the arithmetic difference between the desired order statistic amd the available ranks
            #interpolation only occurs if the desired order statistic is not observed, e.g. .5 quantile is the actual observed median if n is odd. 
            # This means having a more extreme 99th observation doesn't matter when computing the .75 quantile


            ###################################
            # print all of these things

            cat("floor pos=", c(lo))
            cat("\nceiling pos=", c(hi))
            cat("\nfloor values= ", c(x[lo]))
            cat( "\nwhich floors not targets? ", c(i))
            cat("\ninterpolate between ", c(x[lo[i]]), ";", c(x[hi[i]]))
            cat( "\nadjustment values= ", c(h))
            cat("\nquantile estimates:")

    }else if (type <= 3){## Types 1, 2 and 3 are discontinuous sample qs.
                nppm <- if (type == 3){ n * probs - .5 # n * probs + m; m = -0.5
                } else {n * probs} # m = 0

                j <- floor(nppm)
                h <- switch(type,
                            (nppm > j),     # type 1
                            ((nppm > j) + 1)/2, # type 2
                            (nppm != j) | ((j %% 2L) == 1L)) # type 3

                } else{
                ## Types 4 through 9 are continuous sample qs.
                switch(type - 3,
                       {a <- 0; b <- 1},    # type 4
                       a <- b <- 0.5,   # type 5
                       a <- b <- 0,     # type 6
                       a <- b <- 1,     # type 7 (unused here)
                       a <- b <- 1 / 3, # type 8
                       a <- b <- 3 / 8) # type 9
                ## need to watch for rounding errors here
                fuzz <- 4 * .Machine$double.eps
                nppm <- a + probs * (n + 1 - a - b) # n*probs + m
                j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b)
                h <- nppm - j

                if(any(sml <- abs(h) < fuzz)) h[sml] <- 0

            x <- sort(x, partial =
                          unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j<n], n))
            )
            x <- c(x[1L], x[1L], x, x[n], x[n])
            ## h can be zero or one (types 1 to 3), and infinities matter
            ####        qs <- (1 - h) * x[j + 2] + h * x[j + 3]
            ## also h*x might be invalid ... e.g. Dates and ordered factors
            qs <- x[j+2L]
            qs[h == 1] <- x[j+3L][h == 1]
            other <- (0 < h) & (h < 1)
            if(any(other)) qs[other] <- ((1-h)*x[j+2L] + h*x[j+3L])[other]

            } 
    } else {
        qs <- rep(NA_real_, np)}

    if(is.character(lx)){
        qs <- factor(qs, levels = seq_along(lx), labels = lx, ordered = TRUE)}
    if(names && np > 0L) {
        names(qs) <- format_perc(probs)
    }
    if(na.p) { # do this more elegantly (?!)
        o.pr[p.ok] <- qs
        names(o.pr) <- rep("", length(o.pr)) # suppress <NA> names
        names(o.pr)[p.ok] <- names(qs)
        o.pr
    } else qs
}

####################

# fake data
x<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,99)
y<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7,9)
z<-c(1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,7)

#quantiles "of interest"
probs<-c(0.5, 0.75, 0.95, 0.975)

# a tiny bit of illustrative behavior
quantile.default(x,probs=probs, names=F)
quantile.default(y,probs=probs, names=F) #only difference is .975 quantile since that is driven by highest 2 observations
quantile.default(z,probs=probs, names=F) # This shifts everything b/c now none of the quantiles fall on an observation (and of course the distribution changed...)... but 
#.75 quantile is stil 5.0 b/c the observations just above and below the order statistic for that quantile are still 5. However, it got there for a different reason.

#how does rescaling affect quantile estimates?
sqrt(quantile.default(x^2, probs=probs, names=F))
exp(quantile.default(log(x), probs=probs, names=F))