可能性原则“确实”重要的示例?


20

是否有一个例子,两个具有成比例可能性的不同可辩证检验会导致一个明显不同(且同样可辩驳)的推论,例如,p值相差一个数量级,但替代方法的功效却相似?

我看到的所有示例都是非常愚蠢的,将二项式与否定二项式进行比较,第一个的p值为7%,第二个3%的p值是“不同的”,仅在对任意阈值做出二元决策的范围内显着性(例如5%)(顺便说一句,这是一个相当低的推论标准),甚至不用费心去看能力。例如,如果我将阈值更改为1%,则两者都会得出相同的结论。

我从未见过一个示例,它会导致明显不同且可辩驳的推断。有这样的例子吗?

我之所以问是因为,我已经在这个主题上花了很多笔墨,好像“可能性原则”是统计推断基础中的基本要素。但是,如果最好的例子是像上面的例子那样愚蠢的例子,则该原理似乎完全无关紧要。

因此,我正在寻找一个非常有说服力的示例,其中如果不遵循LP,则证据权重将在给定一项检验的情况下绝大多数指向一个方向,而在另一种具有成比例可能性的检验中,证据权重将压倒性地指向相反的方向,这两个结论看起来都是明智的。

理想情况下,一个能证明我们可以有任意相距甚远,但是合理的,解答,诸如与测试p=0.1p=1010具有比例似然和等效功率,以检测相同的替代。

PS:布鲁斯的答案根本没有解决这个问题。


5
在进行重要性测试时,可以始终通过更改阈值来更改决策。因此,您能否解释“显着”,“愚蠢”或“令人信服”的意思?顺便说一句,您似乎正在阅读Wikipedia文章
ub

2
欢迎来到简历,@ statslearner。您能否举一个不使用您想对比的似然原理的一种或多种特定推理方法的示例?
亚历克西斯

1
理想情况下,@ whuber我想看到您可以构造任意不同的答案,例如,如果您想使用p值,则类似 vs p = 10 5,并且两种计算似乎仍然是可以辩护的。p=0.5p=105
statslearner2

3
我不能遵循该评论,因为毫无意义。无论如何,您是否考虑过仅更改Wikipedia示例中给出的数字?p=105
ub

6
实际意义上的显着差异是停止规则的处理:在LP中,它们无关紧要,而在LP之外,则是重要的。有关详细信息,请参见Berger&Wolpert(1987)。
西安

Answers:


7

考虑一个假设情况,当一个零点假设为真,但一直采样直到p<0.05(这总是迟早会发生,即概率为1),然后决定停止试验并拒绝零点。这是公认的极端停止规则,但出于争论考虑。

此moronic程序的I型错误率将为100%,但根据似然原理,这没有什么错。

我会说这确实很重要。您当然可以在此参数中选择任何α。如果愿意,贝叶斯主义者可以对贝叶斯因子使用固定的截止值。相同的逻辑适用。这里的主要教训是,您不能遵守LP 并且拥有错误率保证。天下没有免费的午餐。


4
我也在想这个例子。但是我没有提到它,因为它确实很普通。但实际上,这是在实践中间接和非正式地发生的事情。
Sextus Empiricus

1
在您的示例中这2个统计量及其可能性是多少?在阴性。二项式与二项式的案例:1)统计数据1,直到3个头的试验次数,似然性为二项式;2)统计2,参加n次审判的负责人人数,比喻成binomail。在您的示例中,我看不到这两个统计数据是什么,以及它们是否具有成比例的可能性。
statslearner2

1
在您的示例中,可能是“直到p <0.05的试验次数”,我几乎不怀疑它与二项式成正比,因此,我不确定您的示例是否有效,变形虫。
statslearner2

1
我不认为似然原理说“它没有错”。似然原理可以滤除不良程序。该过程不遵守似然性原则的事实与它被似然性原则认可的事实不同。贝叶斯对此顺序测试问题的分析(当然会遵循似然原理)具有非常好的特性,因为它不会实现您描述的“ moronic”过程。
家伙

3
@amoeba考虑替代或下θ = 0空下与ÿ Ñ θ 1 。很容易证明贝叶斯因子的对数大约为1θN(0,τ1)θ=0YiN(θ,1),其中Zn是通常的Z检验统计量。当贝叶斯因子大于1时拒绝则等于| Zn| >O12[log(τ/n)+Zn2]ZnZ1。在null下,不能保证在顺序测试设置中会发生这种情况(请参见对数定律);因此,贝叶斯方法不会成为您描述的问题的受害者。|Zn|>O(logn)
家伙

4

免责声明:我相信这个答案是整个争论的核心,因此值得讨论,但是我还没有完全探讨这个问题。因此,我欢迎作出更正,完善和评论。

最重要的方面是关于顺序收集的数据。例如,假设您观察到二进制结果,并且看到10次成功和5次失败。似然性原则表明,无论是否收集数据直到获得10次成功(负二项式)或进行15次试验(其中10次成功(二项式)),您都应得出相同的成功概率结论。

为什么这很重要?

因为根据似然原理(或者至少是对它的某种解释),当您要停止收集数据时让数据产生影响而不必更改推理工具是完全可以的。

与顺序方法冲突

使用数据来决定何时停止收集数据而不更改推理工具的想法完全与传统的顺序分析方法相对立。这方面的经典示例是用于临床试验的方法。为了减少潜在的有害处理风险,通常在分析之前的中间时间对数据进行分析。如果试验还没有结束,但是研究人员已经有足够的数据可以得出结论,认为该疗法有效或有害,那么医学伦理学告诉我们应该停止该试验。如果治疗有效,则停止试验并开始向非试验患者提供治疗是符合道德的。如果这是有害的,则从道德上讲,停止这样做,以便我们停止将试验患者接受有害治疗。

现在的问题是,我们已经开始进行多重比较,因此,如果我们不调整方法来考虑多重比较,就会增加I类错误率。这与传统的多重比较问题并不完全相同,因为它实际上是多重部分比较(即,如果我们一次分析数据时收集了50%的数据,一次分析了100%,显然这两个样本不是独立的!) ,但总的来说,我们做的比较越多,就越需要改变拒绝零假设的标准以保持I型错误率,并且计划进行更多的比较需要更多的证据来拒绝零。

这使临床研究人员陷入了困境。您是否要经常检查您的数据,然后增加所需的证据以拒绝无效数据,还是要不经常检查您的数据,以增强您的能力,但在医学道德方面可能无法以最佳方式行事(例如,延迟产品上市或使患者不必要地长时间接受有害治疗)。

我(可能是错误的)理解是,似然原理似乎告诉我们,无论检查数据多少次,我们都应进行相同的推断。这基本上说,完全没有必要采用顺序试验设计的所有方法。只需使用似然原理,并在收集到足够的数据得出结论后就停止。由于您无需更改推理方法即可根据已准备的分析次数进行调整,因此在检查次数与功效之间不会存在权衡的难题。Bam,解决了顺序分析的整个领域(根据此解释)。

就我个人而言,令我感到困惑的是,在顺序设计领域众所周知的一个事实是,但是相当微妙的是,最终检验统计数据的可能性很大程度上受制于停止规则;基本上,停止规则会以不连续的方式增加停止点的概率。是这种失真的图。如果仅在收集所有数据后才对数据进行分析,则虚线为最终测试统计数据的PDF,而实线如果您使用给定的条件对数据进行了4次检查,则实线为您提供了测试统计数据的空值下的分布。规则。

话虽如此,据我所知,似然原理似乎意味着我们可以抛弃所有关于频率序列设计的知识,而忽略分析数据的次数。显然,这尤其是在临床设计领域的意义是巨大的。但是,我并没有考虑它们如何证明其合理性,而忽略了停止规则如何改变最终统计数据的可能性。

您可以在此处找到一些简短的讨论,主要是在最后的幻灯片上。


2
+1。我发现从概念上更容易考虑一种假设情况,即原假设为真,但人们一直采样直到(这堵墙总是迟早会发生,即它将以概率1发生),然后p<0.05决定停止试验。此moronic程序即使符合LP,也将具有100%的I型错误率。
变形虫说恢复莫妮卡

@amoeba:我同意您的示例非常简单(+1)。我回答的目的是强调为什么还要进行讨论。我认为答案是,如果 LP的含义和解释正确,那将意味着临床试验将不再需要在最大功效和不必要的暴露之间进行选择,这将是绝对巨大的收获。通常,这也将使研究人员无需提前猜测合适的样本量,从而大大提高了统计检验的实用性。
Cliff AB

好吧,我认为经常性测试的整个框架与LP不一致,事实就是如此。如果要保证错误率,可以使用频繁测试。原来这与LP不一致。另请参阅林德利的悖论和所有其他内容。好吧,艰难。我曾经对这些事情感到兴奋,但是现在我不再了。天下没有免费的午餐; 必须做出一些选择。请注意,许多贝叶斯过程也违反了LP
变形虫说恢复莫妮卡

“最终测试统计信息的可能性在很大程度上取决于停止规则” pdf发生了变化,可能性也有所变化(但仅由一个常数),但最终可能仍具有相同的可能性函数比例常数。例如,成功和n次试验的二项分布和负二项分布均具有与p k p n k成正比的似然Lp | n k knL(p|n,k)pkpnk
Sextus Empiricus

3

LR指数数据测试的概述。

X1,X2,,Xn是来自Exp(rate=λ),的随机样本 因此E(Xi)=μ=1/λ.x>0,密度函数是f(x)=λeλx和CDF是F(x)=1eλx.

1. Test statistic is sample minimum.

Let V=X(1)=minn(Xi). Then VExp(nλ). As an outline of the proof,

P(V>v)=P(X1>v,,Xn>v)=[eλv]n=enλv,
so that P(Vv)=1enλv, for v>0.

H9:μμ0Ha:μ>μ0,α=5%,VV>c, where P(V>c|μ=μ0)=0.05.

For the specific case in which n=100 and μ0=10,λ0=0.1, we have exponential rate 10=n/μ0=100/10=10, so that c=0.2295 from R, where the exponential distribution is parameterized by the rate.

 qexp(.95, 10)
 [1] 0.2995732
 1 - pexp(0.2996, 10)
 [1] 0.04998662

Accordingly, the power against the alternative μa=100 (rate n/μa=1) is about 74%.

1 - pexp(0.2996, 1)
[1] 0.7411146

2. Test statistic is the sample mean.

Oxford U. class notes (second page) show that the likelihood ratio test of H0:μμ0 against H0:μ>μ0 at the 5% level of significance rejects for X¯>c, where P(X¯>c|μ=μ0)=0.5. Furthermore, one can show using moment generating functions that X¯Gamma(n,nλ).

For the specific case in which n=100 and μ0=10,λ0=0.1, we have X¯Gamma(100,10), so that c=11.7.

qgamma(.95, 100, 10)
[1] 11.69971
1 - pgamma(11.7, 100, 10)
[1] 0.04997338

Accordingly, power against the alternative μa=14 is about 95.6%.

1 - pgamma(11.7, 100, 100/14)
[1] 0.9562513

Clearly, for purposes of testing hypotheses about the exponential mean μ, the information in the sufficient statistic X¯ is much greater than the information in the sample minimum.


I don't think this address the question at all.Are the two likelihoods proportional? You first need to show the likelihood of the two experiments are proportional, otherwise the likelihood principle does not apply. Second, in this example the two tests lead to the same conclusion, so it's even more underwhelming than the example of the binomial versus negative binomial.
statslearner2

I just checked the document, the likelihoods are not proportional, since the first likelihood has v in the exponent and the other has xi, thus the likelihood principle should not apply here, it's fine for the two tests to lead to different conclusions according to the likelihood principle.
statslearner2

2
Bruce, just to clarify what the liklihood principle states: it says that if you have two experiments where the likelihoods differ only by a constant, then you should derive the same conclusion from them. This happens in the binomial versus negative binomial case, where they differ only in the binomial coefficient part (constant). Your example shows two tests where their likelihoods do not differ only by a constant, so the LP does not apply.
statslearner2

@ statslearner2用于观察样本的似然函数 X1个Xñ 是:
f(x1,...,xn)=i=1nλeλxi
This is the same whether you select the minimum or the mean as a criteria to perform the test. The violation that occurs here can be seen as the type in which the definition of 'extreme cases' is different and the integration to compute the p-value is done differently.
Sextus Empiricus

3

Violation by different pdf functions f(x,θ) and g(x,θ)

This case will be an example of 'violation' because the probability distribution functions f(x,θ) g(x,θ) are intrinsically different. Even when f and g, differ, they may relate to the likelihood principle because at fixed measurement x they give the same functions of θ up to scaling. The difference, opens up a possibility for "violations".


The coin flip with or without optional stopping rule

The coin flip with or without optional stopping rule is a typical example, the pdf is binomial or negative binomial which are different pdf functions and lead to different calculation of p-values, and confidence intervals, but they lead to the same likelihood functions for fixed sample/measurement (up to scaling).

fNegative Binomial(n|k,p)=(n1k1)pk(1p)nkfBinomial(k|n,p)=(nk)pk(1p)nk


More extreme example

Consider some measurement of X which is distributed as

L(θ|x)=f(x|θ)={0 if x<0a if 0x<1(1a)θexp(θ(x1)) if x1

where a is some known parameter that depends on the type of experiment, and θ is some parameter that may be unknown and could be inferred from the measurement x.

For any given x and a the likelihood function is proportional to the same function that is independent from a:

  • If x<1 then L(θ|x)1
  • If x1 then L(θ|x)θexp(θ(x1))

But, albeit the same likelihood function, the p-value can vary widely depending on the experiment (ie the value of a). For instance when you measure x=2 and test H0:θ=1 against H0:θ<1 then the p-value is

P(X>2|θ=1)=(1a)exp(1)


Intuition: The reason for violation in these cases is that p-values and hypothesis tests are not solely based on the likelihood function for the particular observed value x.

The p-value is not calculated from the likelihood f(θ|x) with x fixed, but with the pdf f(x|θ) with θ fixed which is a different slice. Confidence intervals, p-value, and hypothesis tests, are different things than the information from likelihood ratios.

p-values are not really evidence: The p-value relates to type I error which is a measure that relates to an ensemble of measurements rather than to a single measurement. This type I error or p-value is not the same as 'evidential meaning' from Birnbaums 'foundations of statistical evidence'. This relates a lot to the problems with p-values and scientist searching for outcomes solely with statistical significance rather than important effects.

Do we need examples where inferences are markedly different? The extreme case is a contrived example. Such a case, or anything with a similar extreme difference, is of course not occurring easily in practice. It is more often the case that the difference will be small such as in the cases that you refer to as silly.

To ask for examples where the likelihood principle 'really matters', or where two different inferences lead to extremely different results, is a bit of a loaded question. At least when the intention for this question relates to some philosophical argument. It is a loaded question because it presupposes that principles that matter should lead to extremely varying results. In many practical cases the results are however small (in terms of different p-values less than an order). I believe that this is not a strange for two different, but both plausible, methods to result in more or less similar results. I would consider the likelihood principle not to be 'less violated' when the differences are only small.


Regarding Case 1: I think choosing a different test statistic can (should?) be seen as changing the likelihood function.
amoeba says Reinstate Monica

2
@MartijnWeterings yes it is choosing a different test statistics, what matters is the likelihood of the statistics, not of the data. Otherwise I can take a sequence of 100 flips and compute several statsistics: number of runs of heads, number of alternations of heads and tails. None of this violates the LP.
statslearner2

You need to pick two statistics that will have proportional likelihoods, such as the number of trials until 3 success or the number of successes in n trials etc.
statslearner2

1

Here is an example adapted from Statistical decision theory and Bayesian analysis by James O. Berger (Second edition page 29).

Say that two species of wasps can be distinguished by the number of notches on the wings (call this x) and by the number of black rings around the abdomen (call this y). The distribution of the characters in the two species (labelled H0 and H1) are as follows:

Table adapted from Statistical decision theory and Bayesian analysis by James O. Berger.

Say that we find a specimen with 1 notch on the wings and 1 ring around the abdomen. The weight of evidence if 100 times bigger in favor of H1 against H0 for both characters.

Now if someone wanted to set up a test for H0 at 5% level, the decision rule would be for the first character “accept H0 if there is 1 notch on the wing, otherwise reject it”, and for the second character “accept H0 if there are 3 rings around the abdomen, otherwise reject it”. There are many other possibilities, but these ones are most powerful tests at this level. Yet, they lead to different conclusions for both characters.


Note: one could of course set up a test with the rule “accept H0 if there are 1 or 3 rings around the abdomen, otherwise reject it”. The question is whether we prefer a test at 5% level with type II risk 0, or a test at 4.9% level with type II risk 0.00001. The difference is so small that we would probably not care, but as I understand it, this is the core of the argument for the likelihood principle: it is not a good idea to make the result depend on something that seems irrelevant.


The likelihood functions are proportional, and yet the p-value of x=1 is 0.95, and that of y=1 is 0.001 (assuming that we reject H0 with events of the form yα). It is obvious from the structure of the table that I could have chosen any number smaller than 0.001. Also, the type II risk of the rejection is 0, so it looks like there is nothing “wrong” here.

Still, I admit that this example is somewhat contrived and not completely honest because it plays with the difficulty of arranging tests with discrete data. One could find equivalent examples with continuous data but they would be even more contrived. I agree with the OP that the likelihood principle has almost no practical value; I interpret it as a principle to guarantee some consistency within the theory.

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.