表明对5个主题进行100次测量比对100个主题进行5次测量提供的信息少得多

在一次会议上，我无意中听到以下声明：

5个主题的100次测量所提供的信息比100个主题的5次测量要少得多。

显然这是对的，但是我想知道如何用数学方式证明这一点……我认为可以使用线性混合模型。但是，我对用于估算它们的数学知识不甚了解（我只lmer4为LMM和bmrsGLMM 运行：）您能给我展示一个真实的例子吗？与R中的某些代码相比，我更希望提供一些公式的答案。请随意假设一个简单的设置，例如具有正态分布的随机截距和斜率的线性混合模型。

PS不涉及LMM的基于数学的答案也是可以的。我之所以想到LMM，是因为它们在我看来是一种自然的工具，可以解释为什么来自更多学科的较少量度要比来自少数学科的更多量度更好，但是我很可能错了。

— 三角洲IV
source

+1。我想最简单的设定是考虑估计总体均值的任务

μ

$\mu$ ，其中每个受试者都有自己的平均

a \sim N (μ, σ_{a}^{2})

$a \sim \mathcal N(\mu, \sigma_a^2)$ 和该对象的每个测量分布为

x \sim N (a, σ^{2})

$x \sim \mathcal N(a, \sigma^2)$ 。如果我们从对象中的每一个进行

n

$n$ 测量，那么给定常数乘积，设置和的最佳方法是什么。

m

$m$

n

$n$

m

$m$

n m = N

$nm=N$

— 变形虫说恢复莫妮卡

在使获取的数据点的样本均值的方差最小的意义上，“最佳” 。

N

$N$

— 变形虫说莫妮卡（Reonica）

是。但是对于您的问题，我们无需关心如何估算方差；你的问题（即在你的问题的引号）是我认为只有大约估计全球平均

μ

$\mu$ ，它似乎很明显，最好的估计是由盛大平均值决定

\bar{x}

$\bar x$ 所有的

N = n m

$N=nm$ 样品中的点。接下来的问题是：给定

μ

$\mu$ ，

σ^{2}

$\sigma^2$ ，

σ_{a}^{2}

$\sigma^2_a$ ，

n

$n$ 和

m

$m$ ，是什么的方差

\bar{x}

$\bar x$ ？如果我们知道，我们将能够将其与尊重尽量减少

n

$n$ 给出的

n m = N

$nm=N$ 约束。

— 变形虫说莫妮卡（Reonica）

我不知道如何得出任何结论，但我同意这似乎很明显：为了估计误差方差，最好是对一个受试者的所有

N

$N$ 测量都进行评估。并估计受试者差异（最好？）最好是让

N

$N$ 不同的受试者各进行1次测量。但是，关于均值还不是很清楚，但是我的直觉告诉我，让

N

$N$ 受试者各进行1次测量也是最好的。我想知道这是否是真的...

— 变形虫说莫妮卡（Monica）恢复

也许类似的东西：样品的方差每个受试者手段应该是

，其中第一项是所述受试者方差，第二个是每个受试者的平均的估计的方差。然后过的受试者的平均方差（即总平均值）将

σ_{a}^{2} + σ^{2} / n

$\sigma^2_a + \sigma^2/n$

当其被最小化

。

(σ_{a}^{2} + σ^{2} / n) / m = σ_{a}^{2} / m + σ^{2} / (n m) = σ_{a}^{2} / m + σ^{2} / N = σ_{a}^{2} / m + c o n s t,

$(\sigma^2_a + \sigma^2/n)/m = \sigma^2_a/m + \sigma^2/(nm) = \sigma^2_a/m + \sigma^2/N = \sigma^2_a/m + \mathrm{const},$

m = N

$m=N$

— 变形虫说莫妮卡（Reonica）

简短的答案是，当且仅当数据中的类内相关性为正时，您的猜想才为真。从经验上讲，大多数时候大多数聚类数据集都显示出正的类内相关性，这意味着实际上您的猜想通常是正确的。但是，如果类内相关为0，则您提到的两种情况同样有用。而且，如果类内相关性为负，那么对更多主题进行更少的测量实际上就没有什么意义。实际上，我们宁愿（就减少参数估计的方差而言）对单个主题进行所有测量。

从统计学上讲，我们可以从两个角度考虑这一问题：您在问题中提到的随机效应（或混合）模型，或边际模型，最终在这里提供了更多信息。

随机效应（混合）模型

假设我们有一组主题，每个主题我们都进行了测量。那么第个对象的第次测量的简单随机效应模型可能是其中是固定截距，是随机对象效应（方差），是观察级别误差项（具有方差 $n$ $m$ $j$ $i$

y_{i j} = β + u_{i} + e_{i j},

$y_{ij} = \beta + u_i + e_{ij},$

β

$\beta$

u_{i}

$u_i$

σ_{u}^{2}

$\sigma^2_u$

e_{i j}

$e_{ij}$

σ_{e}^{2}

$\sigma^2_e$ ），而后两个随机项是独立的。

在此模型中，表示总体平均值，并且在平衡的数据集（即，每个受试者的测量数量相等）的情况下，我们的最佳估计值就是样本平均值。因此，如果我们采用“更多信息”来表示此估计的较小方差，则基本上我们想知道样本均值的方差如何取决于和。有了一点代数，我们就可以算出 $\beta$ $n$ $m$

\begin{aligned} var (\frac{1}{n m} \sum_{i} \sum_{j} y_{i j}) & = var (\frac{1}{n m} \sum_{i} \sum_{j} β + u_{i} + e_{i j}) \\ = \frac{1}{n^{2} m^{2}} var (\sum_{i} \sum_{j} u_{i} + \sum_{i} \sum_{j} e_{i j}) \\ = \frac{1}{n^{2} m^{2}} (m^{2} \sum_{i} var (u_{i}) + \sum_{i} \sum_{j} var (e_{i j})) \\ = \frac{1}{n^{2} m^{2}} (n m^{2} σ_{u}^{2} + n m σ_{e}^{2}) \\ = \frac{σ_{u}^{2}}{n} + \frac{σ_{e}^{2}}{n m} . \end{aligned}

$\begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + u_i + e_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_ju_i + \sum_i\sum_je_{ij}) \\ &= \frac{1}{n^2m^2}\Big(m^2\sum_i\text{var}(u_i) + \sum_i\sum_j\text{var}(e_{ij})\Big) \\ &= \frac{1}{n^2m^2}(nm^2\sigma^2_u + nm\sigma^2_e) \\ &= \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm}. \end{aligned}$ Examining this expression, we can see that whenever there is any subject variance (i.e.,

σ_{u}^{2} > 0

$\sigma^2_u>0$ ), increasing the number of subjects (

n

$n$ ) will make both of these terms smaller, while increasing the number of measurements per subject (

m

$m$ ) will only make the second term smaller. (For a practical implication of this for designing multi-site replication projects, see this blog post I wrote a while ago.)

Now you wanted to know what happens when we increase or decrease $m$ or $n$ while holding constant the total number of observations. So for that we consider $nm$ to be a constant, so that the whole variance expression just looks like

\frac{σ_{u}^{2}}{n} + constant,

$\frac{\sigma^2_u}{n} + \text{constant},$ which is as small as possible when

n

$n$ is as large as possible (up to a maximum of

n = n m

$n=nm$ , in which case

m = 1

$m=1$ , meaning we take a single measurement from each subject).

My short answer referred to the intra-class correlation, so where does that fit in? In this simple random-effects model the intra-class correlation is

ρ = \frac{σ_{u}^{2}}{σ_{u}^{2} + σ_{e}^{2}}

$\rho = \frac{\sigma^2_u}{\sigma^2_u + \sigma^2_e}$ (sketch of a derivation here). So we can write the variance equation above as

var (\frac{1}{n m} \sum_{i} \sum_{j} y_{i j}) = \frac{σ_{u}^{2}}{n} + \frac{σ_{e}^{2}}{n m} = (\frac{ρ}{n} + \frac{1 - ρ}{n m}) (σ_{u}^{2} + σ_{e}^{2})

$\text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) = \frac{\sigma^2_u}{n} + \frac{\sigma^2_e}{nm} = \Big(\frac{\rho}{n} + \frac{1-\rho}{nm}\Big)(\sigma^2_u+\sigma^2_e)$ This doesn't really add any insight to what we already saw above, but it does make us wonder: since the intra-class correlation is a bona fide correlation coefficient, and correlation coefficients can be negative, what would happen (and what would it mean) if the intra-class correlation were negative?

In the context of the random-effects model, a negative intra-class correlation doesn't really make sense, because it implies that the subject variance $\sigma^2_u$ is somehow negative (as we can see from the $\rho$ equation above, and as explained here and here)... but variances can't be negative! But this doesn't mean that the concept of a negative intra-class correlation doesn't make sense; it just means that the random-effects model doesn't have any way to express this concept, which is a failure of the model, not of the concept. To express this concept adequately we need to consider the marginal model.

Marginal model

For this same dataset we could consider a so-called marginal model of $y_{ij}$ ,

y_{i j} = β + e_{i j}^{*},

$y_{ij} = \beta + e^*_{ij},$ where basically we've pushed the random subject effect

u_{i}

$u_i$ from before into the error term

e_{i j}

$e_{ij}$ so that we have

e_{i j}^{*} = u_{i} + e_{i j}

$e^*_{ij} = u_i + e_{ij}$ . In the random-effects model we considered the two random terms

u_{i}

$u_i$ and

e_{i j}

$e_{ij}$ to be i.i.d., but in the marginal model we instead consider

e_{i j}^{*}

$e^*_{ij}$ to follow a block-diagonal covariance matrix

C

$\textbf{C}$ like

C = σ^{2} [\begin{matrix} R & 0 & \dots & 0 \\ 0 & R & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & R \end{matrix}], R = [\begin{matrix} 1 & ρ & \dots & ρ \\ ρ & 1 & \dots & ρ \\ ⋮ & ⋮ & ⋱ & ⋮ \\ ρ & ρ & \dots & 1 \end{matrix}]

$\textbf{C}= \sigma^2\begin{bmatrix} \textbf{R} & 0& \cdots & 0\\ 0& \textbf{R} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots &\textbf{R}\\ \end{bmatrix}, \textbf{R}= \begin{bmatrix} 1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots & \vdots & \ddots & \vdots \\ \rho & \rho & \cdots &1\\ \end{bmatrix}$ In words, this means that under the marginal model we simply consider

ρ

$\rho$ to be the expected correlation between two

e^{*}

$e^*$ s from the same subject (we assume the correlation across subjects is 0). When

ρ

$\rho$ is positive, two observations drawn from the same subject tend to be more similar (closer together), on average, than two observations drawn randomly from the dataset while ignoring the clustering due to subjects. When

ρ

$\rho$ is negative, two observations drawn from the same subject tend to be less similar (further apart), on average, than two observations drawn completely at random. (More information about this interpretation in the question/answers here.)

So now when we look at the equation for the variance of the sample mean under the marginal model, we have

\begin{aligned} var (\frac{1}{n m} \sum_{i} \sum_{j} y_{i j}) & = var (\frac{1}{n m} \sum_{i} \sum_{j} β + e_{i j}^{*}) \\ = \frac{1}{n^{2} m^{2}} var (\sum_{i} \sum_{j} e_{i j}^{*}) \\ = \frac{1}{n^{2} m^{2}} (n (m σ^{2} + (m^{2} - m) ρ σ^{2})) \\ = \frac{σ^{2} (1 + (m - 1) ρ)}{n m} \\ = (\frac{ρ}{n} + \frac{1 - ρ}{n m}) σ^{2}, \end{aligned}

$\begin{aligned} \text{var}(\frac{1}{nm}\sum_i\sum_jy_{ij}) &= \text{var}(\frac{1}{nm}\sum_i\sum_j\beta + e^*_{ij}) \\ &= \frac{1}{n^2m^2}\text{var}(\sum_i\sum_je^*_{ij}) \\ &= \frac{1}{n^2m^2}\Big(n\big(m\sigma^2 + (m^2-m)\rho\sigma^2\big)\Big) \\ &= \frac{\sigma^2\big(1+(m-1)\rho\big)}{nm} \\ &= \Big(\frac{\rho}{n}+\frac{1-\rho}{nm}\Big)\sigma^2, \end{aligned}$ which is the same variance expression we derived above for the random-effects model, just with

σ_{e}^{2} + σ_{u}^{2} = σ^{2}

$\sigma^2_e+\sigma^2_u=\sigma^2$ , which is consistent with our note above that

e_{i j}^{*} = u_{i} + e_{i j}

$e^*_{ij} = u_i + e_{ij}$ . The advantage of this (statistically equivalent) perspective is that here we can think about a negative intra-class correlation without needing to invoke any weird concepts like a negative subject variance. Negative intra-class correlations just fit naturally in this framework.

(BTW, just a quick aside to point out that the second-to-last line of the derivation above implies that we must have $\rho \ge -1/(m-1)$ , or else the whole equation is negative, but variances can't be negative! So there is a lower bound on the intra-class correlation that depends on how many measurements we have per cluster. For $m=2$ (i.e., we measure each subject twice), the intra-class correlation can go all the way down to $\rho=-1$ ; for $m=3$ it can only go down to $\rho=-1/2$ ; and so on. Fun fact!)

So finally, once again considering the total number of observations $nm$ to be a constant, we see that the second-to-last line of the derivation above just looks like

(1 + (m - 1) ρ) \times positive constant .

$\big(1+(m-1)\rho\big) \times \text{positive constant}.$ So when

ρ > 0

$\rho>0$ , having

m

$m$ as small as possible (so that we take fewer measurements of more subjects--in the limit, 1 measurement of each subject) makes the variance of the estimate as small as possible. But when

ρ < 0

$\rho<0$ , we actually want

m

$m$ to be as large as possible (so that, in the limit, we take all

n m

$nm$ measurements from a single subject) in order to make the variance as small as possible. And when

ρ = 0

$\rho=0$ , the variance of the estimate is just a constant, so our allocation of

m

$m$ and

n

$n$ doesn't matter.

— Jake Westfall
source

+1. Great answer. I have to admit that the second part, about

ρ < 0

$\rho<0$ , is quite unintuitive: even with a huge (or infinite) total number

n m

$nm$ of observations the best we can do is to allocate all observations to one single subject, meaning that the standard error of the mean will be

σ_{u}

$\sigma_u$ and it's not possible in principle to reduce it any further. This is just so weird! True

β

$\beta$ remains unknowable, whatever resources one puts into measuring it. Is this interpretation correct?

— amoeba says Reinstate Monica

Ah, no. The above is not correct because as

m

$m$ increases to infinity,

ρ

$\rho$ cannot stay negative and has to approach zero (corresponding to zero subject variance). Hmm. This negative correlation is a funny thing: it's not really a parameter of the generative model because it's constrained by the sample size (whereas one would normally expect a generative model to be able to generate any number of observations, whatever the parameters are). I am not quite sure what is the proper way to think about it.

— amoeba says Reinstate Monica

@DeltaIV What is "the covariance matrix of the random effects" in this case? In the mixed model written by Jake above, there is only one random effect and so there is no "covariance matrix" really, but just one number:

σ_{u}^{2}

$\sigma^2_u$ . What

Σ

$\Sigma$ are you referring to?

— amoeba says Reinstate Monica

@DeltaIV Well, the general principle is en.wikipedia.org/wiki/Inverse-variance_weighting, and the variance of each subject's sample mean is given by

σ_{u}^{2} + σ_{e}^{2} / m_{i}

$\sigma^2_u + \sigma^2_e/m_i$ (that's why Jake wrote above that the weights have to depend on the estimate of between-subject variance). The estimate of within-subject variance is given by the variance of the pooled within-subject deviations, the estimate of between-subject variance is the variance of subjects' means, and using all that one can compute the weights. (But I am not sure if this is 100% equivalent to what lmer will do.)

— amoeba says Reinstate Monica

Jake, yes, it's exactly this hard-coding of

m

$m$ that was bothering me. If this is "sample size" then it cannot be a parameter of the underlying system. My current thinking is that negative

ρ

$\rho$ should actually indicate that there is another within-subject factor that is ignored/unknown to us. E.g. it could be pre & post of some intervention and the difference between them is so large that the measurements are negatively correlated. But this would mean that

m

$m$ is not really a sample size, but the number of levels of this unknown factor, and that can certainly be hard coded...

— amoeba says Reinstate Monica