为什么较低的p值不能提供更多的证据来证明原值？2011年约翰逊的观点

31

Johansson（2011）在“ 向不可能的事物致敬：p值，证据和可能性 ”（也与期刊链接）中指出，较低的通常被认为是抵制零值的有力证据。约翰逊（Johansson）暗示，如果他们的统计检验输出值为，那么人们会认为反对零值的证据要比他们的统计检验输出值为更好。Johansson列出了无法将值用作反对null的证据的四个原因： $p$ $p$ $0.01$ $p$ $0.45$ $p$

$p$ is uniformly distributed under the null hypothesis and can therefore never indicate evidence for the null.

$p$ is conditioned solely on the null hypothesis and is therefore unsuited to quantify evidence, because evidence is always relative in the sense of being evidence for or against a hypothesis relative to another hypothesis.

$p$ designates probability of obtaining evidence (given the null), rather than strength of evidence.

$p$ depends on unobserved data and subjective intentions and therefore implies, given the evidential interpretation, that the evidential strength of observed data depends on things that did not happen and subjective intentions.

Unfortunately I cannot get an intuitive understanding from Johansson's article. To me a $p$ -value of $0.01$ indicates there is less chance the null is true, than a $p$ -value of $0.45$ . Why are lower $p$ -values not stronger evidence against null?

— luciano
source

Hello, @luciano! I see that you have not accepted any answer in this thread. What kind of answer are you looking for? Is your question primarily about Johannson's arguments specifically, or about lower p-values in general?

— amoeba says Reinstate Monica

This is all about the Fisher vs Neyman-Pearson frequentist frameworks. See more in this answer by @gung.

— Firebug

21

My personal appraisal of his arguments:

Here he talks about using $p$ as evidence for the Null, whereas his thesis is that $p$ can't be used as evidence against the Null. So, I think this argument is largely irrelevant.
I think this is a misunderstanding. Fisherian $p$ testing follows strongly in the idea of Popper's Critical Rationalism that states you cannot support a theory but only criticize it. So in that sense there only is a single hypothesis (the Null) and you simply check if your data are in accordance with it.
I disagree here. It depends on the test statistic but $p$ is usually a transformation of an effect size that speaks against the Null. So the higher the effect, the lower the p value---all other things equal. Of course, for different data sets or hypotheses this is no longer valid.
I am not sure I completely understand this statement, but from what I can gather this is less a problem of $p$ as of people using it wrongly. $p$ was intended to have the long-run frequency interpretation and that is a feature not a bug. But you can't blame $p$ for people taking a single $p$ value as proof for their hypothesis or people publishing only $p<.05$ .

His suggestion of using the likelihood ratio as a measure of evidence is in my opinion a good one (but here the idea of a Bayes factor is more general), but in the context in which he brings it is a bit peculiar: First he leaves the grounds of Fisherian testing where there is no alternative hypothesis to calculate the likelihood ratio from. But $p$ as evidence against the Null is Fisherian. Hence he confounds Fisher and Neyman-Pearson. Second, most test statistics that we use are (functions of) the likelihood ratio and in that case $p$ is a transformation of the likelihood ratio. As Cosma Shalizi puts it:

among all tests of a given size $s$ , the one with the smallest miss probability, or highest power, has the form "say 'signal' if $q(x)/p(x) > t(s)$ , otherwise say 'noise'," and that the threshold $t$ varies inversely with $s$ . The quantity $q(x)/p(x)$ is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if it is sufficiently more likely than noise.

Here $q(x)$ is the density under state "signal" and $p(x)$ the density under state "noise". The measure for "sufficiently likely" would here be $P(q(X)/p(x) > t_{obs} \mid H_0)$ which is $p$ . Note that in correct Neyman-Pearson testing $t_{obs}$ is substituted by a fixed $t(s)$ such that $P(q(X)/p(x) > t(s) \mid H_0)=\alpha$ .

— Momo
source

6

+1 for point 3 alone. Cox describes the p-value as a calibration of the likelihood ratio (or other test statistic) & it's a point of view that's often forgotten.

— Scortchi - Reinstate Monica

(+1) Nice answer, @Momo. I am wondering if it could be improved by adding something like "But they are!" in a large font as the header of your response, because this seems to be your answer to OP's title question "Why are lower p-values not more evidence against the null?". You debunk all the given arguments, but do not explicitly provide an answer to the title question.

— amoeba says Reinstate Monica

1

I'd be a bit hesitant to do that, it is all very subtle and very dependent on assumptions, contexts etc. For example, you may flat out deny that probabilistic statements can be used as "evidence" and thus the statement is correct. In a Fisherian point of view it is not. Also, I wouldn't say I debunk (all) the arguments, I think I only provide a different perspective and point out some logical flaws in the argument. The author argues his point well and tries to provide solution to a pertinent approach that by itself may be seen as equally problematic.

— Momo

9

The reason that arguments like Johansson's are recycled so often seem to be related to the fact that P-values are indices of the evidence against the null but are not measures of the evidence. The evidence has more dimensions than any single number can measure, and so there are always aspects of the relationship between P-values and evidence that people can find difficult.

I have reviewed many of the arguments used by Johansson in a paper that shows the relationship between P-values and likelihood functions, and thus evidence: http://arxiv.org/abs/1311.0081 Unfortunately that paper has now been three times rejected, although its arguments and the evidence for them have not been refuted. (It seems that it is distasteful to referees who hold opinions like Johansson's rather than wrong.)

— Michael Lew
source

+1 @Michael Lew，如何更改标题？对P（ee）或不对P（ee）...听起来不像是难题。我们都知道在那种情况下该怎么办。= D开个玩笑，您的论文被拒绝的原因是什么？

— 一位老人在海里。

4

Adding to @Momo's nice answer:

Do not forget multiplicity. Given many independent p-values, and sparse non-trivial effect sizes, the smallest p-values are from the null, with probability tending to $1$ as the number of hypotheses increases. So if you tell me you have a small p-value, the first thing I want to know is how many hypotheses you have been testing.

— JohnRos
source

2

值得注意的是，即使您对证据的回答可能有所更改，证据本身也不受多重测试的影响。数据中的证据就是数据中的证据，不受计算机中可能执行的任何计算的影响。用于多重测试的典型p值“校正”与保留假阳性错误率有关，而不是校正p值与实验证据之间的关系。

— Michael Lew 2014年

1

约翰逊是在谈论来自两个不同实验的p值吗？如果是这样，比较p值可能就像将苹果与羊排比较。如果实验“ A”涉及大量样本，则即使很小的无关紧要的差异在统计上也可能是显着的。如果实验“ B”仅涉及几个样本，则重要的差异可能在统计上不明显。更糟糕的是（这就是为什么我说是羊排而不是橘子），秤可能是完全无法比拟的（一个是psi，另一个是kwh）。

— 埃米尔·弗里德曼（
source

3

My impression is that Johansson is not talking about comparing p-values from different experiments. In light of that & @Glen_b's comment, would you mind clarifying your post, Emil? It's fine to raise a related point ('I think J's wrong in context A, but it would have some merit in context B'), but it needs to be clear that that's what you are doing. If you are asking a question or commenting, please delete this post & make it a comment.

— gung-恢复莫妮卡