更新:经过几年的事后观察,我提出了一种更为简洁的处理方法回答了一个类似问题,对基本上相同的材料进行。
如何建立一个信任区
让我们从构造置信区域的一般方法开始。可以将其应用于单个参数,以产生置信区间或区间集。可以将其应用于两个或多个参数,以产生更高维的置信区域。
我们断言,观察到的统计量D源自具有参数θ的分布,即在可能的统计量d上的采样分布s(d|θ),并在可能值θ的集合中寻找θ的置信区域。定义最高密度区域(HDR):PDF 的h - HDR 是其域中支持概率h的最小子集。表示ħ的-HDR 小号(d | ψ )为ħ ψ,对于任何ψdθΘhhhs(d|ψ)Hψψ∈Θ。然后,h为置信区域θ给定的数据,D,是集CD={ϕ:D∈Hϕ}。h的典型值为0.95。
惯常解释
从置信区域的前述定义如下
d∈Hψ⟷ψ∈Cd
与Cd={ϕ:d∈Hϕ}。现在想象一下在与D类似的情况下拍摄的大量(假想)观测值{Di}。即它们是来自s (d | θ )的样本。由于ħ θ支撑概率质量ħ PDF的小号Ds(d|θ)Hθhs(d|θ),P(Di∈Hθ)=h对所有i。因此,的分数{Di}为其中Di∈Hθ是h。因此,使用上面的等价性,的分数{Di}针对θ∈CDi也h。
那么,这就是常客对θ的h置信区域的要求等于:θ
取大量假想的观察{Di}从采样分布s(d|θ)的是引起了所观察到的统计D。然后,θ位于类似但虚构的置信区域{ C D i }的分数h之内。{CDi}
因此,置信区域CD对θ在某处的概率没有任何要求!原因很简单,公式中没有什么可以让我们说出θ上的概率分布。解释只是精心设计的上层建筑,并不能改善基础。基仅s(d|θ)和D,其中θ不会作为分布量出现,并且没有可用于解决该问题的信息。基本上有两种获得θ分布的方法:
- 直接根据现有信息分配分布:p(θ|I)。
- 涉及θ到另一个分布式量:p(θ|I)=∫p(θx|I)dx=∫p(θ|xI)p(x|I)dx。
在两种情况下,θ必须出现在左侧某处。经常使用的人不能使用这两种方法,因为它们都需要异端先验。
贝叶斯观
贝叶斯可以使最h置信区域CD无条件给予,简直就是直接解释:这是一组ϕ为其中D落在h -HDR Hϕ抽样分布的s(d|ϕ)。它不一定告诉我们有关θ太多信息,这就是原因。
的概率θ∈CD,给出D和背景信息I是:
P(θ∈CD|DI)=∫CDp(θ|DI)dθ=∫CDp(D|θI)p(θ|I)p(D|I)dθ
注意的是,不同于频率论的解释,我们立即要求在分布θ。背景信息I告诉我们,如前所述,该采样分布是s(d|θ):
P(θ∈CD|DI)i.e.P(θ∈CD|DI)=∫CDs(D|θ)p(θ|I)p(D|I)dθ=∫CDs(D|θ)p(θ|I)dθp(D|I)=∫CDs(D|θ)p(θ|I)dθ∫s(D|θ)p(θ|I)dθ
现在这个表达不一般评估为h,这是说,h置信区域CD不总是包含θ的概率h。实际上,它可能与h完全不同。但是,在许多常见情况下,它的确会评估为h,这就是为什么置信区域通常与我们的概率直觉一致的原因。
例如,假设d和θ的先验联合PDF 是对称的,其中pd,θ(d,θ|I)=pd,θ(θ,d|I)。(显然,这涉及PDF在d和θ的同一范围内的假设。)然后,如果先验值为p(θ|I)=f(θ),则s(D|θ)p(θ|I)=s(D|θ)f(θ)=s(θ|D)f(D). Hence
P(θ∈CD|DI)i.e.P(θ∈CD|DI)=∫CDs(θ|D)dθ∫s(θ|D)dθ=∫CDs(θ|D)dθ
From the definition of an HDR we know that for any ψ∈Θ
∫Hψs(d|ψ)ddand therefore that∫HDs(d|D)ddor equivalently∫HDs(θ|D)dθ=h=h=h
Therefore, given that s(d|θ)f(θ)=s(θ|d)f(d), CD=HD implies P(θ∈CD|DI)=h. The antecedent satisfies
CD=HD⟷∀ψ[ψ∈CD↔ψ∈HD]
Applying the equivalence near the top:
CD=HD⟷∀ψ[D∈Hψ↔ψ∈HD]
Thus, the confidence region CD contains θ with probability h if for all possible values ψ of θ, the h-HDR of s(d|ψ) contains D if and only if the h-HDR of s(d|D) contains ψ.
Now the symmetric relation D∈Hψ↔ψ∈HD is satisfied for all ψ when s(ψ+δ|ψ)=s(D−δ|D) for all δ that span the support of s(d|D) and s(d|ψ). We can therefore form the following argument:
- s(d|θ)f(θ)=s(θ|d)f(d) (premise)
- ∀ψ∀δ[s(ψ+δ|ψ)=s(D−δ|D)] (premise)
- ∀ψ∀δ[s(ψ+δ|ψ)=s(D−δ|D)]⟶∀ψ[D∈Hψ↔ψ∈HD]
- ∴∀ψ[D∈Hψ↔ψ∈HD]
- ∀ψ[D∈Hψ↔ψ∈HD]⟶CD=HD
- ∴CD=HD
- [s(d|θ)f(θ)=s(θ|d)f(d)∧CD=HD]⟶P(θ∈CD|DI)=h
- ∴P(θ∈CD|DI)=h
Let's apply the argument to a confidence interval on the mean of a 1-D normal distribution (μ,σ), given a sample mean x¯ from n measurements. We have θ=μ and d=x¯, so that the sampling distribution is
s(d|θ)=n−−√σ2π−−√e−n2σ2(d−θ)2
Suppose also that we know nothing about θ before taking the data (except that it's a location parameter) and therefore assign a uniform prior: f(θ)=k. Clearly we now have s(d|θ)f(θ)=s(θ|d)f(d), so the first premise is satisfied. Let s(d|θ)=g((d−θ)2). (i.e. It can be written in that form.) Then
s(ψ+δ|ψ)=g((ψ+δ−ψ)2)=g(δ2)ands(D−δ|D)=g((D−δ−D)2)=g(δ2)so that∀ψ∀δ[s(ψ+δ|ψ)=s(D−δ|D)]
whereupon the second premise is satisfied. Both premises being true, the eight-point argument leads us to conclude that the probability that θ lies in the confidence interval CD is h!
We therefore have an amusing irony:
- The frequentist who assigns the h confidence interval cannot say that P(θ∈CD)=h, no matter how innocently uniform θ looks before incorporating the data.
- The Bayesian who would not assign an h confidence interval in that way knows anyhow that P(θ∈CD|DI)=h.
Final Remarks
We have identified conditions (i.e. the two premises) under which the h confidence region does indeed yield probability h that θ∈CD. A frequentist will baulk at the first premise, because it involves a prior on θ, and this sort of deal-breaker is inescapable on the route to a probability. But for a Bayesian, it is acceptable---nay, essential. These conditions are sufficient but not necessary, so there are many other circumstances under which the Bayesian P(θ∈CD|DI) equals h. Equally though, there are many circumstances in which P(θ∈CD|DI)≠h, especially when the prior information is significant.
We have applied a Bayesian analysis just as a consistent Bayesian would, given the information at hand, including statistics D. But a Bayesian, if he possibly can, will apply his methods to the raw measurements instead---to the {xi}, rather than x¯. Oftentimes, collapsing the raw data into summary statistics D destroys information in the data; and then the summary statistics are incapable of speaking as eloquently as the original data about the parameters θ.