高p值的强相关系数示例

21

我想知道，是否可能有一个非常强的相关系数（例如0.9或更高）和高p值（例如0.25或更高）？

这是一个相关系数较低，p值较高的示例：

set.seed(10)
y <- rnorm(100)
x <- rnorm(100)+.1*y
cor.test(x,y)

cor = 0.03908927，p = 0.6994

高相关系数，低p值：

y <- rnorm(100)
x <- rnorm(100)+2*y
cor.test(x,y)

cor = 0.8807809，p = 2.2e-16

低相关系数，低p值：

y <- rnorm(100000)
x <- rnorm(100000)+.1*y
cor.test(x,y)

cor = 0.1035018，p = 2.2e-16

高相关系数，高p值：???

r hypothesis-testing correlation

— 扎克
source

36

底线

随着样本量的增加，拒绝真实（皮尔逊）相关系数为零的假设所需的样本相关系数变得非常小。因此，总的来说，不，您不能同时具有较大的（相关性）相关系数和较大的值 $p$ 。

顶线 （详细信息）

函数中用于Pearson相关系数的检验是我下面讨论的方法的非常细微修改的版本。 $R$ cor.test

假设是具有相关性同构双变量正态随机向量。我们要检验相对于的零假设。令为样本相关系数。使用标准线性回归理论，这是不难证明检验统计量， $(X_1,Y_1), (X_2,Y_2),\ldots,(X_n,Y_n)$ $\rho$ $\rho = 0$ $\rho \neq 0$ $r$ 在原假设下具有分布。对于大，分布接近标准正态。因此，近似为具有一个自由度的卡方分布。（根据假设，我们已经取得了，在现实中，但近似使得清晰的是怎么回事，我想。）

T = \frac{r \sqrt{n - 2}}{\sqrt{(1 - r^{2})}}

$T = \frac{r \sqrt{n-2}}{\sqrt{(1-r^2)}}$

t_{n - 2}

$t_{n-2}$

n

$n$

t_{n - 2}

$t_{n-2}$

T^{2}

$T^2$

T^{2} \sim F_{1, n - 2}

$T^2 \sim F_{1,n-2}$

χ_{1}^{2}

$\chi^2_1$

因此，其中是

P (\frac{r^{2}}{1 - r^{2}} (n - 2) \geq q_{1 - α}) \approx α,

$\mathbb P\left(\frac{r^2}{1-r^2} (n-2) \geq q_{1-\alpha} \right) \approx \alpha \>,$

q_{1 - α}

$q_{1-\alpha}$

具有一个自由度的卡方分布

分位数。

(1 - α)

$(1-\alpha)$

现在，注意随着增加而增加。在概率陈述中重新排列数量，我们所有 $r^2/(1-r^2)$ $r^2$ 我们将拒绝水平假设。显然，右侧随减小。

| r | \geq \frac{1}{\sqrt{1 + (n - 2) / q_{1 - α}}}

$|r| \geq \frac{1}{\sqrt{1+(n-2)/q_{1-\alpha}}}$

α

$\alpha$

n

$n$

剧情

这是的拒绝区域的图作为样本量的函数。因此，例如，当样本大小超过100时，（绝对）相关仅需要大约0.2即可拒绝级别的零值。 $|r|$ $\alpha = 0.05$

模拟

我们可以做一个简单的仿真来生成一对具有精确相关系数的零均值向量。下面是代码。由此我们可以看到的输出cor.test。

k <- 100
n <- 4*k

# Correlation that gives an approximate p-value of 0.05
# Change 0.05 to some other desired p-value to get a different curve
pval <- 0.05
qval <- qchisq(pval,1,lower.tail=F)
rho  <- 1/sqrt(1+(n-2)/qval)

# Zero-mean orthogonal basis vectors
b1 <- rep(c(1,-1),n/2)
b2 <- rep(c(1,1,-1,-1),n/4)

# Construct x and y vectors with mean zero and an empirical
# correlation of *exactly* rho
x <- b1
y <- rho * b1 + sqrt(1-rho^2) * b2

# Do test
ctst <- cor.test(x,y)

根据注释中的要求，以下是用于重现该图的代码，该代码可以在上面的代码之后立即运行（并使用其中定义的一些变量）。

png("cortest.png", height=600, width=600)
m  <- 3:1000
yy <- 1/sqrt(1+(m-2)/qval)
plot(m, yy, type="l", lwd=3, ylim=c(0,1),
     xlab="sample size", ylab="correlation")
polygon( c(m[1],m,rev(m)[1]), c(1,yy,1), col="lightblue2", border=NA)
lines(m,yy,lwd=2)
text(500, 0.5, "p < 0.05", cex=1.5 )
dev.off()

— 红衣主教
source

1

那么-底线是什么？我认为您是在说，除非样本量很小，否则较高的相关值意味着较低的p值-但我认为这将有助于明确地说明这一点。

— DW

p

$p$ 值性随样本大小的变化而单调递减。我将弄清楚如何对此效果做出更明确的说明，并将其放到适当的位置。再次感谢您的建设性反馈。

— 主教

@cardinal，能否请您发布生成的图形的源代码？

— aL3xa 2011年

@DW，我已尝试解决您的问题。如果您看到可以改进的地方，请告诉我。

— 主教

1

@ aL3xa：我已经添加了我使用的绘图代码。希望这可以帮助。

— 主教

17

cor.test(c(1,2,3),c(1,2,2))

cor = 0.866，p = 0.333

— 亚伦-恢复莫妮卡
source

6

@扎克：由于枢机主教和沙比夫夫花了一些时间给出完整的答案，请随时重新考虑一下。

— 亚伦-恢复莫妮卡

11

仅当样本量非常小时，才能对具有高p值的相关系数进行高估。我本来要提供一个插图，但是亚伦只是做到了！

— 1站
source

9

我相信通过Fisher RZ变换，在零值下，样本相关性的双曲反正切近似于正态，均值为零，标准误差为 $1 / \sqrt{n-3}$ . So to get, for example, a sample correlation $\hat{\rho} > 0$ with a fixed p-value, $p$ , you would need

p = 2 - 2 Φ (atanh (\hat{ρ}) \sqrt{n - 3}),

$p = 2 - 2 \Phi\left(\operatorname{atanh}(\hat{\rho})\sqrt{n-3}\right),$ where

Φ

$\Phi$ is the CDF of the standard normal, and you are performing a two-sided test for the null

H_{0} : ρ = 0

$H_0: \rho = 0$ .

You can turn this into a function which gives the required $n$ for a fixed $\hat{\rho}$ and $p$ . In R:

 #get n for sample correlation and p-value, 2-sided test of 0 correlation
 n.size <- function(rho.hat,p.val) {
   n <- 3 + ((qnorm(1 - 0.5 * p.val)) / atanh(rho.hat))^2
 }

Running this for $\hat{\rho} = 0.5$ and $p = 0.2$ gives:

print(n.size(0.5,0.2))

[1] 8.443062

So your sample size should be around 8. Playing around with this function should give you some idea of the relationship between $n, p$ and $\hat{\rho}$ .

— shabbychef
source

1

Yes. A p-value depends on the sample size, so a small sample can give this.

Say the true effect size was very small, and you draw a small sample. By luck, you get a few data points with very high correlation. The p-value will be high, as it should be. The correlation is high but it's not a very dependable result.

The sample correlation from R's cor() will tell you the best estimate of the correlation (given the sample). The p-value does NOT measure the strength of correlation. It measures how likely it could have arisen in case there actually was no effect, considering the size of the sample.

Another way to see this: If you have the same effect size, but get more samples, the p-value always goes to zero.

(If you want to more closely integrate the notions of estimated effect size and confidence about the estimate, it may be better to use confidence intervals; or, use Bayesian techniques.)

— Brendan OConnor
source

"small sample" here is basically so small as to be pointless, basically any sample size greater than 4 will reject the null at

α = 0.05

$\alpha=0.05$ for correlations greater than 0.9: x <- seq(0,4); y <- seq(0,4) + rnorm(5); cor.test(x,y).

— naught101