给定已知的组方差，均值和样本量，如何计算两个或多个组的合并方差？

假设有元素分为两组（和）。第一组的方差为，第二组的方差为。元素本身被假定为未知，但我知道均值和。 $m+n$ $m$ $n$ $\sigma_m^2$ $\sigma^2_n$ $\mu_m$ $\mu_n$

有一种方法来计算组合的方差 $\sigma^2_{(m+n)}$ ？

方差不必是无偏的，因此分母是 $(m+n)$ 而不是 $(m+n-1)$ 。

variance pooling

— 用户名
source

当您说出这些组的均值和方差时，它们是参数还是样本值？如果它们是样本均值/方差，则不应使用

μ

$\mu$ 和

σ

$\sigma$ ...

— Jonathan Christensen 2012年

我只是用符号来表示。否则，将很难解释我的问题。

— user1809989

对于样本值，我们通常使用拉丁字母（例如

和

）。希腊字母通常保留给参数。使用“正确的”（预期）符号将帮助您更清晰地交流。

m

$m$

s

$s$

— 乔纳森·克里斯坦森

不用担心，从现在开始，我将继续关注！欢呼声

— user1809989

@Jonathan因为这不是一个关于样品或估计问题，可以采取合法认为

和

是真正均值和一批数据的经验分布的方差，由此证明常规使用的希腊字母，而不是拉丁字母来引用它们。

μ

$\mu$

σ^{2}

$\sigma^2$

— ub

Answers:

使用均值的定义

μ_{1 : n} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

$\mu_{1:n} = \frac{1}{n}\sum_{i=1}^n x_i$

和样本方差

σ_{1个 ： ñ}^{2} = \frac{1个}{ñ} \sum_{一世 = 1个}^{ñ} {（ X_{一世} - μ_{1个 ： ñ} ）}^{2} = \frac{ñ - 1个}{ñ} （ \frac{1个}{ñ - 1个} \sum_{一世 = 1个}^{ñ} {（ X_{一世} - μ_{1个 ： ñ} ）}^{2} ）

$\sigma_{1:n}^2 = \frac{1}{n}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2 = \frac{n-1}{n}\left(\frac{1}{n-1}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2\right)$

（括号中的最后一项是通常在统计软件中默认计算的无偏方差估计量），以查找所有数据的平方和。让我们对索引排序，以便表示第一组元素，而表示第二组元素。按组打破该平方和，并根据数据子集的方差和均值重新表达这两部分： $x_i$ $i$ $i=1,\ldots,n$ $i=n+1,\ldots,n+m$

\begin{aligned} (m + n) (σ_{1 : m + n}^{2} + μ_{1 : m + n}^{2}) & = \sum_{i = 1}^{1 : n + m} x_{i}^{2} \\ = \sum_{i = 1}^{n} x_{i}^{2} + \sum_{i = n + 1}^{n + m} x_{i}^{2} \\ = n (σ_{1 : n}^{2} + μ_{1 : n}^{2}) + m (σ_{1 + n : m + n}^{2} + μ_{1 + n : m + n}^{2}) . \end{aligned}

$\eqalign{ (m+n)(\sigma^2_{1:m+n} + \mu_{1:m+n}^2) &= \sum_{i=1}^{1:n+m} x_i^2 \\ &= \sum_{i=1}^n x_i^2 + \sum_{i=n+1}^{n+m} x_i^2 \\ &= n(\sigma^2_{1:n} + \mu_{1:n}^2) + m(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2). }$

代数求解此为在其他（已知）量的产率方面 $\sigma^2_{m+n}$

σ_{1 : m + n}^{2} = \frac{n (σ_{1 : n}^{2} + μ_{1 : n}^{2}) + m (σ_{1 + n : m + n}^{2} + μ_{1 + n : m + n}^{2})}{m + n} - μ_{1 : m + n}^{2} .

$\sigma^2_{1:m+n} = \frac{n(\sigma^2_{1:n} + \mu_{1:n}^2) + m(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2)}{m+n} - \mu^2_{1:m+n}.$

Of course, using the same approach, $\mu_{1:m+n} = (n\mu_{1:n} + m\mu_{1+n:m+n})/(m+n)$ can be expressed in terms of the group means, too.

An anonymous contributor points out that when the sample means are equal (so that $\mu_{1:n}=\mu_{1+n:m+n}=\mu_{1:m+n}$ ), the solution for $\sigma^2_{m+n}$ is a weighted mean of the group sample variances.

— whuber
source

The "homework" tag doesn't mean the question is elementary or stupid: it's used for self-study questions that can even include research-level queries. It distinguishes routine, more or less context-free questions (of the sort that might ordinarily grace the math forum) from specific applied questions.

— whuber

I cannot understand your first passage:

n (σ^{2} + μ^{2}) = \sum (x - μ)^{2} + n μ^{2} \overset{?}{=} \sum x^{2}

$n(\sigma^2+\mu^2) = \sum (x - \mu)^2 + n\mu^2 \stackrel{?}{=} \sum x^2$ In particular I get

\sum [(x - μ)^{2} + μ^{2}] = \sum [x^{2} - 2 x μ]

$\sum [(x-\mu)^2+\mu^2] = \sum [x^2-2x\mu]$ which requires

μ = 0

$\mu = 0$ Am I missing something? Could you please explain this?

— DarioP

@Dario

\sum (x - μ)^{2} + n μ^{2} = (\sum x^{2} - 2 μ \sum x + n μ^{2}) + n μ^{2} = \sum x^{2} - 2 n μ^{2} + 2 n μ^{2} = \sum x^{2} .

$\sum(x-\mu)^2+n\mu^2=(\sum x^2 - 2\mu\sum x + n \mu^2)+n\mu^2 = \sum x^2 - 2n\mu^2 + 2n\mu^2 = \sum x^2.$

— whuber

Oh yes, I did a stupid sign mistake in my derivation, now is clear, thanks!!

— DarioP

I guess this can be extended to an arbitrary number of samples as long as you have the mean and variance for each. Calculating pooled (biased) standard deviation in R is simply sqrt(weighted.mean(u^2 + rho^2, n) - weighted.mean(u, n)^2) where n, u and rho are equal-length vectors. E.g. n=c(10, 14, 9) for three samples.

— Jonas Lindeløv

I'm going to use standard notation for sample means and sample variances in this answer, rather than the notation used in the question. Using standard notation, another formula for the pooled sample variance of two groups can be found in O'Neill (2014) (Result 1):

\begin{aligned} s_{pooled}^{2} & = \frac{1}{n_{1} + n_{2} - 1} [(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + \frac{n_{1} n_{2}}{n_{1} + n_{2}} ({\bar{x}}_{1} - {\bar{x}}_{2})^{2}] . \end{aligned}

$\begin{equation} \begin{aligned} s_\text{pooled}^2 &= \frac{1}{n_1+n_2-1} \Bigg[ (n_1-1) s_1^2 + (n_2-1) s_2^2 + \frac{n_1 n_2}{n_1+n_2} (\bar{x}_1 - \bar{x}_2)^2 \Bigg]. \\[10pt] \end{aligned} \end{equation}$

This formula works directly with the underlying sample means and sample variances of the two subgroups, and does not require intermediate calculation of the pooled sample mean. (Proof of result in linked paper.)

— Reinstate Monica
source

-3

Yes, given the mean, sample count, and variance or standard deviation of each of two or more groups of samples, you can exactly calculate the variance or standard deviation of the combined group.

This web page describes how to do it, and why it works; it also includes source code in Perl: http://www.burtonsys.com/climate/composite_standard_deviations.html

BTW, contrary to the answer given above,

\begin{aligned} n (σ^{2} + μ^{2}) \neq \sum_{i = 1}^{n} x_{i}^{2} \end{aligned}

$\eqalign{ n(\sigma^2 + \mu^2) \space\space \ne \space\space \sum_{i=1}^n x_i^2 }$

See for yourself, e.g., in R:

> x = rnorm(10,5,2)
> x
 [1] 6.515139 8.273285 2.879483 3.624233 6.199610 3.683164 4.921028 8.084591
 [9] 2.974520 6.049962
> mean(x)
[1] 5.320502
> sd(x)
[1] 2.007519
> sum(x**2)
[1] 319.3486
> 10 * (mean(x)**2 + sd(x)**2)
[1] 323.3787

— Dave Burton
source

it's because you forgot the n-1 factor, e.g. try with n*(mean(x)**2+sd(x)**2/(n)*(n-1))

— user603

user603, what on earth are you talking about?

— Dave Burton

Dave, mathematics is a more reliable teacher than software. In this case R computes the unbiased estimate of the standard deviation rather than the standard deviation of the set of numbers. For instance, sd(c(-1,1)) returns 1.414214 rather than 1. Your example needs to use sqrt(9/10)*sd(x) in place of sd(x). Interpreting "

σ

$\sigma$ " as the SD of the data and "

μ

$\mu$ " as the mean of the data, your BTW remark is wrong. A program demonstrating this is n <- 10; x <- rnorm(n,5,2); m <- mean(x); s <- sd(x) * sqrt((n-1)/n); m2 <- sum(x^2); c(lhs=n * (m^2 + s^2), rhs=m2)

— whuber