高p值的强相关系数示例


21

我想知道,是否可能有一个非常强的相关系数(例如0.9或更高)和高p值(例如0.25或更高)?

这是一个相关系数较低,p值较高的示例:

set.seed(10)
y <- rnorm(100)
x <- rnorm(100)+.1*y
cor.test(x,y)

cor = 0.03908927,p = 0.6994

高相关系数,低p值:

y <- rnorm(100)
x <- rnorm(100)+2*y
cor.test(x,y)

cor = 0.8807809,p = 2.2e-16

低相关系数,低p值:

y <- rnorm(100000)
x <- rnorm(100000)+.1*y
cor.test(x,y)

cor = 0.1035018,p = 2.2e-16

高相关系数,高p值:???

Answers:


36

底线

随着样本量的增加,拒绝真实(皮尔逊)相关系数为零的假设所需的样本相关系数变得非常小。因此,总的来说,不,您不能同时具有较大的(相关性)相关系数和较大的p

顶线 (详细信息)

函数中用于Pearson相关系数的检验是我下面讨论的方法的非常细微修改的版本。[Rcor.test

假设是具有相关性ρ的同构双变量正态随机向量。我们要检验ρ = 0相对于ρ 0的零假设。令r为样本相关系数。使用标准线性回归理论,这是不难证明检验统计量, Ť = - [R (X1,Y1),(X2,Y2),,(Xn,Yn)ρρ=0ρ0r在原假设下 具有tn-2分布。对于大ntn-2分布接近标准正态。因此,T2近似为具有一个自由度的卡方分布。(根据假设,我们已经取得了,牛逼2˚F1ñ-2在现实中,但χ21近似使得清晰的是怎么回事,我想。)

T=rn2(1r2)
tn2ntn2T2T2F1,n2χ12

因此, 其中 q 1 - α1 - α

P(r21r2(n2)q1α)α,
q1α具有一个自由度的卡方分布分位数。(1α)

现在,注意随着r 2的增加而增加。在概率陈述中重新排列数量,我们所有 | r | 1r2/(1r2)r2 我们将拒绝水平α的原假设。显然,右侧随n减小。

|r|11+(n2)/q1α
αn

剧情

这是的拒绝区域的图r | 作为样本量的函数。因此,例如,当样本大小超过100时,(绝对)相关仅需要大约0.2即可拒绝α = 0.05级别的零值。|r|α=0.05

模拟

我们可以做一个简单的仿真来生成一对具有精确相关系数的零均值向量。下面是代码。由此我们可以看到的输出cor.test

k <- 100
n <- 4*k

# Correlation that gives an approximate p-value of 0.05
# Change 0.05 to some other desired p-value to get a different curve
pval <- 0.05
qval <- qchisq(pval,1,lower.tail=F)
rho  <- 1/sqrt(1+(n-2)/qval)

# Zero-mean orthogonal basis vectors
b1 <- rep(c(1,-1),n/2)
b2 <- rep(c(1,1,-1,-1),n/4)

# Construct x and y vectors with mean zero and an empirical
# correlation of *exactly* rho
x <- b1
y <- rho * b1 + sqrt(1-rho^2) * b2

# Do test
ctst <- cor.test(x,y)

根据注释中的要求,以下是用于重现该图的代码,该代码可以在上面的代码之后立即运行(并使用其中定义的一些变量)。

png("cortest.png", height=600, width=600)
m  <- 3:1000
yy <- 1/sqrt(1+(m-2)/qval)
plot(m, yy, type="l", lwd=3, ylim=c(0,1),
     xlab="sample size", ylab="correlation")
polygon( c(m[1],m,rev(m)[1]), c(1,yy,1), col="lightblue2", border=NA)
lines(m,yy,lwd=2)
text(500, 0.5, "p < 0.05", cex=1.5 )
dev.off()

1
那么-底线是什么?我认为您是在说,除非样本量很小,否则较高的相关值意味着较低的p值-但我认为这将有助于明确地说明这一点。
DW

p值性随样本大小的变化而单调递减。我将弄清楚如何对此效果做出更明确的说明,并将其放到适当的位置。再次感谢您的建设性反馈。
主教

@cardinal,能否请您发布生成的图形的源代码?
aL3xa 2011年

@DW,我已尝试解决您的问题。如果您看到可以改进的地方,请告诉我。
主教

1
@ aL3xa:我已经添加了我使用的绘图代码。希望这可以帮助。
主教


11

仅当样本量非常小时,才能对具有高p值的相关系数进行高估。我本来要提供一个插图,但是亚伦只是做到了!


9

我相信通过Fisher RZ变换,在零值下,样本相关性的双曲反正切近似于正态,均值为零,标准误差为1/n3. So to get, for example, a sample correlation ρ^>0 with a fixed p-value, p, you would need

p=22Φ(atanh(ρ^)n3),
where Φ is the CDF of the standard normal, and you are performing a two-sided test for the null H0:ρ=0.

You can turn this into a function which gives the required n for a fixed ρ^ and p. In R:

 #get n for sample correlation and p-value, 2-sided test of 0 correlation
 n.size <- function(rho.hat,p.val) {
   n <- 3 + ((qnorm(1 - 0.5 * p.val)) / atanh(rho.hat))^2
 }

Running this for ρ^=0.5 and p=0.2 gives:

print(n.size(0.5,0.2))

[1] 8.443062

So your sample size should be around 8. Playing around with this function should give you some idea of the relationship between n,p and ρ^.


1

Yes. A p-value depends on the sample size, so a small sample can give this.

Say the true effect size was very small, and you draw a small sample. By luck, you get a few data points with very high correlation. The p-value will be high, as it should be. The correlation is high but it's not a very dependable result.

The sample correlation from R's cor() will tell you the best estimate of the correlation (given the sample). The p-value does NOT measure the strength of correlation. It measures how likely it could have arisen in case there actually was no effect, considering the size of the sample.

Another way to see this: If you have the same effect size, but get more samples, the p-value always goes to zero.

(If you want to more closely integrate the notions of estimated effect size and confidence about the estimate, it may be better to use confidence intervals; or, use Bayesian techniques.)


"small sample" here is basically so small as to be pointless, basically any sample size greater than 4 will reject the null at α=0.05 for correlations greater than 0.9: x <- seq(0,4); y <- seq(0,4) + rnorm(5); cor.test(x,y).
naught101
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.