对所有“不具有统计学意义”的研究进行荟萃分析能否得出“重要”结论?


29

荟萃分析包括大量研究,所有研究均报告P值大于0.05。整体荟萃分析报告的P值是否可能小于0.05?什么情况下

(我很确定答案是肯定的,但是我想提供参考或解释。)


1
我对荟萃分析了解不多,但给我的印象是它不涉及任何假设检验,仅是对总体影响的估计,在这种情况下,没有要说的重要概念。
Kodiologist '16

1
好吧,在一天结束时,荟萃分析只是加权平均值。当然,您可以为该加权平均值建立假设检验。参见,例如,Borenstein,Michael等。“对荟萃分析的固定效应和随机效应模型进行了基本介绍。” 研究合成方法1.2(2010):97-111。
boscovich

1
其他答案也很好,但很简单:两个研究在p = 0.9时有意义,但在p = 0.95时不明显。两项独立研究均显示p> = 0.9的概率仅为0.01,因此您的荟萃分析可能显示出p = 0.99的显着性
Barrycarter

2
极限:没有一个度量可以为(非平凡的)假设提供足够的证据,以使其具有较小的p值,但是可以有足够大的度量集合。
埃里克·塔

p值既不表示“统计上显着”,也不表示无关紧要。从一个重要的结论中我们可以理解什么?这是一个荟萃分析的结论吗?
Subhash C. Davar

Answers:


31

从理论上讲,是的...

个别研究的结果可能微不足道,但综合来看,结果可能会很重要。

从理论上讲,你可以通过处理结果进行yi研究i像任何其他随机变量。

yi是一些随机变量(例如,研究的估计值i)。然后,如果yi是独立的且E[yi]=μ,则可以通过以下方式一致地估计均值:

μ^=1niyi

加入更多的假设,让σi2是估计的方差yi。然后,您可以使用逆方差加权有效地估算μ

μ^=iwiyiwi=1/σi2j1/σj2

在这两种情况下μ^可能是在一定的信心水平上显显著即使个别估计都没有。

但是可能会有大问题,需要注意的问题...

  1. 如果E[yi]μ然后荟萃分析可能不收敛到μ(即平均值的荟萃分析的是不一致的估计器)。

    例如,如果对发布负面结果存有偏见,那么这种简单的荟萃分析可能会前后矛盾且带有偏见!就像仅通过观察硬币未落尾的翻转来估计硬币翻转落下的概率一样!

  2. y j可能不是独立的。例如,如果两项研究 i j基于相同的数据,则在荟萃分析中将 y i y j视为独立的可能会大大低估标准误和夸大统计意义。您的估计仍然是一致的,但是标准误差需要合理地考虑研究中的互相关。yiyjijyiyj

  3. 组合(1)和(2)可能特别糟糕。

    例如,平均民意测验的荟萃分析往往比任何单个民意测验更准确。但是将轮询平均在一起仍然很容易受到相关误差的影响。过去的选举中发生的事情是,年轻的退出民意调查工作者可能倾向于采访其他年轻人而不是老人。如果所有出口民意测验都产生相同的错误,那么您可能会认为估算值不错,您可能会认为这是一个不错的估计(出口民意测验是相关的,因为它们使用相同的方法进行出口民意测验,并且此方法会产生相同的错误)。

毫无疑问,更熟悉荟萃分析的人们可能会提出更好的例子,更细微的问题,更复杂的估算技术等...,但这涉及到一些最基本的理论和一些更大的问题。如果不同的研究产生独立的随机误差,那么荟萃分析可能会非常有用。如果跨研究的错误是系统性的(例如,每个人都低估了年长选民等),那么研究的平均值也将不正确。如果您低估了相关研究的程度或相关误差的程度,那么您实际上就高估了样本总数,而低估了标准误差。

一致性定义等还有各种实际问题...


1
我批评忽略了效应大小之间的依赖性的荟萃分析(即,许多效应大小是基于相同的参与者,但被视为独立的)。作者说没什么大不了的,无论如何我们只是对主持人感兴趣。我在这里指出您的意思:将它们“独立进行荟萃分析可能会大大低估标准误差和夸大统计意义”。是否有证明/模拟研究表明为什么会这样?我有很多参考资料说相关错误意味着SE被低估了...但是我不知道为什么?
马克·怀特

1
@MarkWhite基本思想并不比。如果对所有我们有瓦尔X=σ2冠状病毒XXĴ=0对于Ĵ然后瓦尔1Var(1niXi)=1n2(iVar(Xi)+ijCov(Xi,Xj))iVar(Xi)=σ2Cov(Xi,Xj)=0ij,您的标准误是σVar(1niXi)=σ2n。另一方面,如果协方差项为正且较大,则标准误差将更大。σn
马修·冈恩

@MarkWhite我不是一个荟萃分析专家,我真的不知道什么是一个如何的重要来源应该做的现代,荟萃分析。从概念上讲,对相同数据进行重复分析肯定是有用的(就像对某些主题的深入研究一样),但这与在新的独立主题上再现发现并不相同。
马修·冈恩

1
嗯,也就是说:效果大小的总方差来自(a)它的方差和(b)它与其他效果大小的协方差。如果协方差为0,则标准误差估计很好;但是,如果它与其他效应大小协变量,则需要考虑该差异,而忽略它意味着我们低估了差异。这就像方差由A和B的两个部分组成,而忽略依赖关系则假设B部分为0时不是吗?
马克·怀特

1
此外,这似乎是一个很好的来源(尤其是专栏2):nature.com/neuro/journal/v17/n4/pdf/nn.3648.pdf
马克·怀特

29

是。假设你有从p值ñ独立研究。NN

费舍尔测试

(编辑-为响应下面的@mdewey的有用评论,区分不同的元测试很重要。我在下面阐明了mdewey提到的另一个元测试的情况)

经典费希尔元测试(参见费舍尔(1932),“用于研究工作者的统计方法”)的统计 具有χ 2 2 Ñ零分布,如- 2 LN û χ 2 2对于均匀RV ü

F=2i=1Nln(pi)
χ2N22ln(U)χ22U

Let χ2N2(1α) denote the (1α)-quantile of the null distribution.

Suppose all p-values are equal to c, where, possibly, c>α. Then, F=2Nln(c) and F>χ2N2(1α) when

c<exp(χ2N2(1α)2N)
For example, for α=0.05 and N=20, the individual p-values only need to be less than
> exp(-qchisq(0.95, df = 40)/40)
[1] 0.2480904

Of course, what the meta statistic tests is "only" the "aggregate" null that all individual nulls are true, which is to be rejected as soon as only one of the N nulls is false.

EDIT:

Here is a plot of the "admissible" p-values against N, which confirms that c grows in N, although it seems to level off at c0.36.

enter image description here

I found an upper bound for the quantiles of the χ2 distribution

χ2N2(1α)2N+2log(1/α)+22Nlog(1/α),
here, suggesting that χ2N2(1α)=O(N) so that exp(χ2N2(1α)2N) is bounded from above by exp(1) as N. As exp(1)0.3679, this bound seems reasonably sharp.

Inverse Normal test (Stouffer et al., 1949)

The test statistic is given by

Z=1Ni=1NΦ1(pi)
with Φ1 the standard normal quantile function. The test rejects for large negative values, viz., if Z<1.645 at α=0.05. Hence, for pi=c, Z=NΦ1(c). When c<0.5, Φ1(c)<0 and hence Zp as N. If c0.5, Z will take values in the acceptance region for any N. Hence, a common p-value less than 0.5 is sufficient to produce a rejection of the meta test as N.

More specifically, Z<1.645 if c<Φ(1.645/N), which tends to Φ(0)=0.5 from below as N.


2
+1 and wow! did not expect there to be an upper bound at all, let alone 1/e.
amoeba says Reinstate Monica

Thanks :-). I had not expected one either before I saw the plot...
Christoph Hanck

5
Interestingly the method due to Fisher is the only one of the commonly used methods which has this property. For most of the others what you call F increases with N if $c>0.5) and decreases otherwise. That applies to Stouffer's method and Edgington's method as well as methods based on logits and on mean of p. The various methods which are special cases of Wilkinson's method (minimum p, maximum p, etc) have different properties again.
mdewey

1
@mdewey, that is interesting indeed, I just picked Fisher's test purely because it came to my mind first. That said, by "only one", do you mean the specific bound 1/e? Your comments, that I try to spell out in my edit, suggest to me that Stouffer's method also has an upper bound, that turns out to be 0.5?
Christoph Hanck

I am not going to have time to go into this for another week but I think if you have ten studies with p=0.9 you get an overall p as close to unity as makes no difference. There may be a one- versus two-sided issue here. If you want to look at more material I have a draft of extra stuff to go into my R package <code>metap</code> here which you are free to use to expand your answer if you wish.
mdewey

4

The answer to this depends on what method you use for combining p-values. Other answers have considered some of these but here I focus on one method for which the answer to the original question is no.

The minimum p method, also known as Tippett's method, is usually described in terms of a rejection at the α level of the null hypothesis. Define

p[1]p[2]p[k]
for the k studies. Tippett's method then evaluates whether
p[1]<1(1α)1k

It is easy to see the since the kth root of a number less than unity is closer to unity the last term is greater than α and hence the overall result will be non-significant unless p[1] is already less than α.

It is possible to work out the critical value and for example if we have ten primary studies each with a p-values of 00.05 so as close to significant as can be then the overall critical value is 0.40. The method can be seen as a special case of Wilkinson's method which uses p[r] for 1rk and in fact for the particular set of primary studies even r=2 is not significant (p=0.09)

L H C Tippett's method is described in a book The methods of statistics. 1931 (1st ed) and Wilkinson's method is here in an article "A statistical consideration in psychological research"


1
Thanks. But note that most meta-analysis methods combine effect sizes (accounting for any difference in sample size), and do not combine P values.
Harvey Motulsky

@HarveyMotulsky agreed, combining p-values is a last resort but the OP did tag his question with the combining-p-values tag so I responded in that spirit
mdewey

I think that your answer is correct.
Subhash C. Davar
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.