31

我对贝叶斯统计非常陌生，这可能是一个愚蠢的问题。不过：

考虑一个先验的可信间隔，该间隔指定了均匀分布。例如，从0到1，其中0到1表示效果的可能值的整个范围。在这种情况下，95％的可信区间等于95％的置信区间吗？

— 番茄
source

23

许多频繁性置信区间（CI）都基于似然函数。如果先验分布确实是非信息性的，则贝叶斯后验具有与似然函数基本相同的信息。因此，在实践中，贝叶斯概率区间（或可信区间）在数值上可能与频密者置信区间非常相似。[当然，即使在数值上相似，在频繁度估计和贝叶斯区间估计之间的解释也存在哲学差异。]

这是一个简单的示例，估计二项式成功概率 $\theta.$ 假设我们有 $n = 100$ 观察（试验），且 $X = 73$ 成功。

频率论：传统沃尔德间隔用途点估计和95％CI的形式为 $\hat \theta = X/n = 73/100 = 0.73.$

\hat{θ} \pm 1.96 \sqrt{\frac{\hat{θ} (1 - \hat{θ})}{n}},

$\hat \theta \pm 1.96\sqrt{\frac{\hat \theta(1-\hat \theta)} {n}},$ 其计算结果为

(0.643, 0.817) .

$(0.643,\,0.817).$

n = 100;  x = 73;  th.w = x/n;  pm = c(-1,1)
ci.w = th.w + pm*1.96*sqrt(th.w*(1-th.w)/n);  ci.w
[1] 0.6429839 0.8170161

This form of CI assumes that relevant binomial distributions can be approximated by normal ones and that the margin of error $\sqrt{\theta(1-\theta)/n}$ is well approximated by $\sqrt{\hat\theta(1-\hat\theta)/n}.$ Particularly for small $n,$ these assumptions need not be true. [The cases where $X = 0$ or $X = n$ are especially problematic.]

$\tilde \theta = (X+2)/\tilde n,$ where $\tilde n + 4.$ Then a 95% CI is of the form

\tilde{θ} \pm 1.96 \sqrt{\frac{\tilde{θ} (1 - \tilde{θ})}{\tilde{n}}},

$\tilde \theta \pm 1.96\sqrt{\frac{\tilde \theta(1-\tilde \theta)} {\tilde n}},$ which computes to

(0.612, 0.792) .

$(0.612, 0.792).$ For

n > 100

$n > 100$ and

0.3 < \tilde{θ} < 0.7,

$0.3 < \tilde \theta < 0.7,$ the difference between these two styles of confidence intervals is nearly negligible.

ci.a = th.a + pm*1.96*sqrt(th.a*(1-th.a)/n);  ci.a
[1] 0.6122700 0.7915761

$\mathsf{Beta}(1,1) \equiv \mathsf{Unif}(0,1).$ The likelihood function is proportional to $\theta^x(1-\theta)^{n-x}.$ Multiplying the kernels of the prior and likelihood we have the kernel of the posterior distribution $\mathsf{Beta}(x+1,\, n-x+1).$

Then a 95% Bayesian interval estimate uses quantiles 0.025 and 0.975 of the posterior distribution to get $(0.635, 0.807).$ When the prior distribution is 'flat' or 'noninformative' the numerical difference between the Bayesian probability interval and the Agresti-Coull confidence interval is slight.

qbeta(c(.025, .975), 74, 28)
[1] 0.6353758 0.8072313

Notes: (a) In this situation, some Bayesians prefer the noninformative prior $\mathsf{Beta}(.5, .5).$ (b) For confidence levels other than 95%, the Agresti-Coull CI uses a slightly different point estimate. (c) For data other than binomial, there may be no available 'flat' prior, but one can choose a prior with a huge variance (small precision) that carries very little information. (d) For more discussion of Agresti-Coull CIs, graphs of coverage probabilities, and some references, perhaps also see this Q & A.

— BruceET
source

10

BruceET's answer is excellent but pretty long, so here's a quick practical summary:

if the prior is flat, likelihood and posterior have the same shape
the intervals, however, are not necessarily the same, because they are constructed in different ways. A standard Bayesian 90% CI covers the central 90% of the posterior. A frequentist CI is usually defined by a point-wise comparison (see BruceET's answer). For an unbounded location parameter (e.g. estimating the mean of a normal distribution), difference are usually small, but if you estimate a bounded parameter (e.g. binomial mean) close to the boundaries (0/1), differences can be substantial.
of course, the interpretation is different too, but I interpret the question mainly as "when will the values be the same?"

— Florian Hartig
source

9

While one can solve for a prior that yields a credible interval that equals the frequentist confidence interval, it is important to realize how narrow the scope of application is. The entire discussion is assuming that the sample size was fixed and is not a random variable. It assumes that there was only one look at the data, and that sequential inference was not done. It assumes there was only one dependent variable and no other parameters were of interest. Where there are multiplicities, the Bayesian and frequentist intervals diverge (Bayesian posterior probabilities are in forward-time predictive mode and don't need to consider "how we got here", thus have no way or need to adjust for multiple looks). In addition, in the frequentist world the interpretation of confidence intervals is extremely strange and has confused many a student and caused some frequentist statisticians to become Bayesian.

— Frank Harrell
source

What does it mean to be in "forward-time predictive mode" and why don't we need to consider selection or multiplicity effects?

— badmax

1

See this. Think of forecasting the winner of a soccer match as the game progresses. Your current probability that team x wins the game can completely ignore the past forecasts you made. But if operating in a frequentist mode you'd have to envision all the times your team lost the game and consider extremes of the scores at all the points during the game that you tend to make forecasts. Multiplicities come from the chances you give data to be extreme, and this factors only into frequentist calculations.

— Frank Harrell

6

Likelihood $\neq$ Bayesian with flat prior

The likelihood function, and associated the confidence interval, are not the same (concept) as a Bayesian posterior probability constructed with a prior that specifies a uniform distribution.

In part 1 and 2 of this answer it is argued why likelihood should not be viewed as a Bayesian posterior probability based on a flat prior.

In part 3 an example is given where the confidence interval and credible interval are widely varying. Also it is pointed out how this discrepancy arises.

1 Different behavior when variable is transformed

Probabilities transform in a particular way. If we know the probability distribution distribution $f_x(x)$ then we also know the distribution of $f_\xi(\xi)$ for the variable $\xi$ defined by any function $x=\chi(\xi)$ , according to the transformation rule:

f_{ξ} (ξ) = f_{x} (χ (ξ)) \frac{d χ}{d ξ} d ξ

$f_\xi(\xi) = f_x(\chi(\xi)) \frac{d\chi}{d\xi} d\xi$

If you transform a variable then the mean and the mode may vary due to this change of the distribution function. That means $\bar{x} \neq \chi(\bar{\xi})$ and $x_{\max f(x)} \neq \chi(\xi_{\max f(\xi)})$ .

The likelihood function does not transform in this way. This is the contrasts between the likelihood function and the posterior probability. The (maximum of the) likelihood function remains the same when you transform the variable.

L_{ξ} (ξ) = L_{x} (χ (ξ))

$\mathcal{L}_\xi(\xi) = \mathcal{L}_x(\chi(\xi))$

The flat prior is ambiguous. It depends on the form of the particular statistic.

For instance, if $X$ is uniform distributed (e.g. $\mathcal{U}(0,1))$ , then $X^2$ is not a uniform distributed variable.

There is no single flat prior that you can relate the Likelihood function to. It is different when you define the flat prior for $X$ or some transformed variable like $X^2$ . For the likelihood this dependency does not exist.
The boundaries of probabilities (credibility intervals) will be different when you transform the variable, (for likelihood functions this is not the case). E.g for some parameter $a$ and a monotonic transformation $f(a)$ (e.g. logarithm) you get the equivalent likelihood intervals
$\begin{array}{ccccc} a_{min} & < & a & < & a_{max} \\ f (a_{min}) & < & f (a) & < & f (a_{max}) \end{array}$ $\begin{array}{ccccc} a_{\min} &<& a &<& a_{\max}\\ f(a_{\min}) &<& f(a) &<& f(a_{\max}) \end{array}$

2 Different concept: confidence intervals are independent from the prior

Suppose you sample a variable $X$ from a population with (unknown) parameter $\theta$ which itself (the population with parameter $\theta$ ) is sampled from a super-population (with possibly varying values for $\theta$ ).

One can make an inverse statement trying to infer what the original $\theta$ may have been based on observing some values $x_i$ for the variable $X$ .

Bayesian methods do this by supposing a prior distribution for the distribution of possible $\theta$
This contrasts with the likelihood function and confidence interval, which are independent from the prior distribution.

The confidence interval does not use information of a prior like the credible interval does (confidence is not a probability).

Regardless of the prior distribution (uniform or not) the x%-confidence interval will contain the true parameter in $x%$ of the cases (confidence intervals refer to the success rate, type I error, of the method, not of a particular case).

In the case of the credible interval this concept ( $%$ of time that the interval contains the true parameter) is not even applicable, but we may interpret it in a frequentist sense and then we observe that the credible interval will contain the true parameter only $x%$ of the time when the (uniform) prior is correctly describing the super-population of parameters that we may encounter. The interval may effectively be performing higher or lower than the x% (not that this matters since the Bayesian approach answers different questions, but it is just to note the difference).

3 Difference between confidence and credible intervals

In the example below we examine the likelihood function for the exponential distribution as function of the rate parameter $\lambda$ , the sample mean $\bar{x}$ , and sample size $n$ :

L (λ, \bar{x}, n) = \frac{n^{n}}{(n - 1)!} x^{n - 1} λ^{n} e^{- λ n \bar{x}}

$\mathcal{L}(\lambda,\bar{x},n) = \frac{n^n}{(n-1)!} x^{n-1} \lambda^n e^{-\lambda n \bar{x}}$

this functions expresses the probability to observe (for a given $n$ and $\lambda$ ) a sample mean between $\bar{x}$ and $\bar{x}+dx$ .

^{note: the rate parameter $\lambda$ goes from $0$ to $\infty$ (unlike the OP 'request' from $0$ to $1$ ). The prior in this case will be an improper prior. The principles however does not change. I am using this perspective for easier illustration. Distributions with parameters between $0$ and $1$ are often discrete distributions (difficult to drawing continuous lines) or a beta distribution (difficult to calculate)}

The image below illustrates this likelihood function (the blue colored map), for sample size $n=4$ , and also draws the boundaries for the 95% intervals (both confidence and credible).

The boundaries are created obtaining the (one-dimensional) cumulative distribution function. But, this integration/cumulation can be done in two directions.

The difference between the intervals occurs because the 5% area's are made in different ways.

The 95% confidence interval contains values $\lambda$ for which the observed value $\bar{x}$ would occur at least in 95% of the cases. In this way. whatever the value $\lambda$ , we would only make a wrong judgement in 95% of the cases.

For any $\lambda$ you have north and south of the boundaries (changing $\bar{x}$ ) 2.5% of the weight of the likelihood function.
The 95% credible interval contains values $\lambda$ which are most likely to cause the observed value $\bar{x}$ (given a flat prior).

Even when the observed result $\bar{x}$ is less than 5% likely for a given $\lambda$ , the particular $\lambda$ may be inside the credible interval. In the particular example higher values of $\lambda$ are 'preferred' for the credible interval.

For any $\bar{x}$ you have west and east of the boundaries (changing $\lambda$ ) 2.5% of the weight of the likelihood function.

A case where confidence interval and credible interval (based on improper prior) coincide is for estimating the mean of a Gaussian distributed variable (the distribution is illustrated here: https://stats.stackexchange.com/a/351333/164061 ).

An obvious case where confidence interval and credible interval do not coincide is illustrated here (https://stats.stackexchange.com/a/369909/164061). The confidence interval for this case may have one or even both of the (upper/lower) bounds at infinity.

— Sextus Empiricus
source

2

Don't speak of whether the credible interval contains the true parameter. The credible interval is making a probability statement. And the x% for the confidence interval needs to mention what replication means, i.e., what 'cases' are.

— Frank Harrell

First bullet is why some Bayesians prefer prior

B e t a (.5, .5)

$\mathsf{Beta}(.5, .5)$ as mentioned in the Note at the end of my problem. // Wald intervals do not provide the advertised level of coverage because of the approximations involved. (Not precisely based on likelihood.)

— BruceET

I don't believe I said that with a flat prior the likelihood is the posterior, even though that can be the case. Consistent with writing an answer at what I supposed to be OP's level of expertise, I tried to write the first paragraph of my Answer carefully. Do you believe what I said is actually wrong, or are you saying it might be misinterpreted?

— BruceET

1

This is not generally true, but it may seem so because of the most frequently considered special cases.

Consider $X,Y\sim\operatorname{i.i.d}\sim\operatorname{Uniform}[\theta-1/2,\, \theta+1/2].$ The interval $\big(\min\{X,Y\},\max\{X,Y\}\big)$ is a $50\%$ confidence interval for $\theta,$ albeit not one that anyone with any common sense would use. It does not coincide with a $50\%$ credible interval from the posterior from a flat prior.

Fisher's technique of conditioning on an ancillary statistic does in this case yield a confidence interval that coincides with that credible interval.

— Michael Hardy
source

0

From my reading, I thought this statement is true asymptotically, i.e. for large sample size, and if one uses an uninformative prior.

A simple numerical example would seem to confirm this - the 90% profile maximum likelihood intervals and 90% credible intervals of a ML binomial GLM and Bayesian binomial GLM are indeed virtually identical for n=1000, though the discrepancy would become larger for small n :

# simulate some data
set.seed(123)
n = 1000                     # sample size
x1 = rnorm(n)                # two continuous covariates 
x2 = rnorm(n)
z = 0.1 + 2*x1 + 3*x2        # predicted values on logit scale
y = rbinom(n,1,plogis(z))    # bernoulli response variable
d = data.frame(y=y, x1=x1, x2=x2)

# fit a regular GLM and calculate 90% confidence intervals
glmfit = glm(y ~ x1 + x2, family = "binomial", data = d)
library(MASS)
# coefficients and 90% profile confidence intervals :
round(cbind(coef(glmfit), confint(glmfit, level=0.9)), 2) 
#                      5 % 95 %
#   (Intercept) 0.00 -0.18 0.17
# x1            2.04  1.77 2.34
# x2            3.42  3.05 3.81

# fit a Bayesian GLM using rstanarm
library(rstanarm)
t_prior = student_t(df = 3, location = 0, scale = 100) # we set scale to large value to specify an uninformative prior
bfit1 = stan_glm(y ~ x1 + x2, data = d, 
                 family = binomial(link = "logit"), 
                 prior = t_prior, prior_intercept = t_prior,  
                 chains = 1, cores = 4, seed = 123, iter = 10000)
# coefficients and 90% credible intervals :
round(cbind(coef(bfit1), posterior_interval(bfit1, prob = 0.9)), 2) 
#                        5%  95%
#   (Intercept) -0.01 -0.18 0.17
# x1             2.06  1.79 2.37
# x2             3.45  3.07 3.85


# fit a Bayesian GLM using brms
library(brms)
priors = c(
  prior(student_t(3, 0, 100), class = "Intercept"),
  prior(student_t(3, 0, 100), class = "b")
)
bfit2 = brm(
  y ~ x1 + x2,
  data = d,
  prior = priors,
  family = "bernoulli",
  seed = 123 
) 
# coefficients and 90% credible intervals :
summary(bfit2, prob=0.9)
# Population-Level Effects: 
#           Estimate Est.Error l-90% CI u-90% CI Eff.Sample Rhat
# Intercept    -0.01      0.11    -0.18     0.18       2595 1.00
# x1            2.06      0.17     1.79     2.35       2492 1.00
# x2            3.45      0.23     3.07     3.83       2594 1.00


# fit a Bayesian GLM using arm
library(arm)
# we set prior.scale to Inf to specify an uninformative prior
bfit3 = bayesglm(y ~ x1 + x2, family = "binomial", data = d, prior.scale = Inf) 
sims = coef(sim(bfit3, n.sims=1000000))
# coefficients and 90% credible intervals :
round(cbind(coef(bfit3), t(apply(sims, 2, function (col) quantile(col,c(.05, .95))))),2)
#                       5%  95%
#   (Intercept) 0.00 -0.18 0.17
# x1            2.04  1.76 2.33
# x2            3.42  3.03 3.80

As you can see, in the example above, for n=1000, the 90% profile confidence intervals of a binomial GLM are virtually identical to the 90% credible intervals of a Bayesian binomial GLM (the difference is also within the bounds of using different seeds and different nrs of iterations in the bayesian fits, and an exact equivalence can also not be obtained since specifying a 100% uninformative prior is also not possible with rstanarm or brms).

— Tom Wenseleers
source

如果可信区间的先验值是平坦的，则95％的置信区间等于95％的可信区间吗？

Likelihood ≠≠\neq Bayesian with flat prior

1 Different behavior when variable is transformed

2 Different concept: confidence intervals are independent from the prior

3 Difference between confidence and credible intervals

Likelihood $\neq$ Bayesian with flat prior