我正在运行一个实验,在该实验中我将并行收集(独立)样本,我计算了每组样本的方差,现在我想将所有样本合并起来以找到所有样本的总方差。
由于不确定术语,我很难找到这个的派生词。我认为它是一个RV的分区。
所以,我想找到从,,...,和,其中 = 。
编辑:分区不是相同的大小/基数,但是分区大小的总和等于整个样本集中的样本数。
编辑2:这里有一个用于并行计算的公式,但是它仅涉及将分区分为两组而不是组的情况。
我正在运行一个实验,在该实验中我将并行收集(独立)样本,我计算了每组样本的方差,现在我想将所有样本合并起来以找到所有样本的总方差。
由于不确定术语,我很难找到这个的派生词。我认为它是一个RV的分区。
所以,我想找到从,,...,和,其中 = 。
编辑:分区不是相同的大小/基数,但是分区大小的总和等于整个样本集中的样本数。
编辑2:这里有一个用于并行计算的公式,但是它仅涉及将分区分为两组而不是组的情况。
Answers:
如果所有子样本都具有相同的样本量,则公式非常简单。如果您有个大小为k的子样本(总共g k个样本),则合并样本的方差取决于每个子样本的均值E j和方差V j: V a r (X 1,… ,X g k)= k − 1其中Var(Ej)表示样本均值的方差。
R中的演示:
> x <- rnorm(100)
> g <- gl(10,10)
> mns <- tapply(x, g, mean)
> vs <- tapply(x, g, var)
> 9/99*(sum(vs) + 10*var(mns))
[1] 1.033749
> var(x)
[1] 1.033749
如果样本数量不相等,则公式不是很好。
编辑:不相等样本量的公式
如果存在个子样本,每个子样本具有k j,j = 1 ,... ,g个元素,总共n = ∑ k j个值,则 V a r (X 1,… ,X n)= 1
再次演示:
> k <- rpois(10, lambda=10)
> n <- sum(k)
> g <- factor(rep(1:10, k))
> x <- rnorm(n)
> mns <- tapply(x, g, mean)
> vs <- tapply(x, g, var)
> 1/(n-1)*(sum((k-1)*vs) + sum(k*(mns-weighted.mean(mns,k))^2))
[1] 1.108966
> var(x)
[1] 1.108966
, using the square of difference formula, and simplifying.
This is simply an add-on to the answer of aniko with a rough sketch of the derivation and some python code, so all credits go to aniko.
Let be one of parts of the data where the number of elements in each part is . We define the mean and the variance of each part to be
The following python function works for arrays that have been splitted along the first dimension and implements the "more complex" formula for differently sized parts.
import numpy as np
def combine(averages, variances, counts, size=None):
"""
Combine averages and variances to one single average and variance.
# Arguments
averages: List of averages for each part.
variances: List of variances for each part.
counts: List of number of elements in each part.
size: Total number of elements in all of the parts.
# Returns
average: Average over all parts.
variance: Variance over all parts.
"""
average = np.average(averages, weights=counts)
# necessary for correct variance in case of multidimensional arrays
if size is not None:
counts = counts * size // np.sum(counts, dtype='int')
squares = (counts - 1) * variances + counts * (averages - average)**2
return average, np.sum(squares) / (size - 1)
It can be used as follows:
# sizes k_j and n
ks = np.random.poisson(10, 10)
n = np.sum(ks)
# create data
x = np.random.randn(n, 20)
parts = np.split(x, np.cumsum(ks[:-1]))
# compute statistics on parts
ms = [np.mean(p) for p in parts]
vs = [np.var(p, ddof=1) for p in parts]
# combine and compare
combined = combine(ms, vs, ks, x.size)
numpied = np.mean(x), np.var(x, ddof=1)
distance = np.abs(np.array(combined) - np.array(numpied))
print('combined --- mean:{: .9f} - var:{: .9f}'.format(*combined))
print('numpied --- mean:{: .9f} - var:{: .9f}'.format(*numpied))
print('distance --- mean:{: .5e} - var:{: .5e}'.format(*distance))