数据集更改后使用旧标准偏差计算新标准偏差


16

我的阵列n真实值,其具有平均μold和标准偏差σold。如果将数组xi元素替换为另一个元素xj,则新的均值将为

μnew=μold+xjxin

这种方法的优点是,无论的值如何,都需要恒定的计算量。是否有任何的方法来计算σ Ñ Ë 瓦特使用σ ö d等的计算μ Ñ Ë 瓦特使用μ ö dnσnewσoldμnewμold


这是作业吗?在我们的数理统计课程中,我们提出了一个非常类似的任务……
krlmlr 2012年

2
@ user946850:不,这不是功课。我正在进行关于进化算法的论文。我想使用标准差来衡量人口多样性。只是寻找更有效的解决方案。
用户

1
该SD是方差,这仅仅是平均值的平方根平方值(用的多平方调整平均,你已经知道如何更新)。因此,可以应用与计算运行均值相同的方法,而无需进行任何基本更改即可计算运行方差。 实际上,可以使用相同的思想在线计算更复杂的统计信息:例如,参见stats.stackexchange.com/questions/6920stats.stackexchange.com/questions/23481上的线程。
Whuber

1
@whuber:Wikipedia文章中针对Variance提到了此问题,但同时也提到了可能发生的灾难性取消(或重要性降低)。这是高估了,还是运行方差的真正问题?
krlmlr 2012年

这是一个很好的问题。如果您天真地积累方差,而没有事先将其居中,则您确实会遇到麻烦。当数字很大但它们的方差很小时,就会出现问题。例如,考虑一系列以m / s为单位的光速精确测量值,例如299792458.145、299792457.883、299792457.998,...:它们的方差(大约为0.01)与平方(大约为相比是如此之小如图17所示,粗心的计算(即使是双精度)也会导致方差为零:所有有效数字都将消失。1017
ub

Answers:


7

一个维基百科的文章中的“算法计算方差”部分展示了如何计算方差,如果元素被添加到您的意见。(回想一下,标准偏差是方差的平方根。)假设您将附加到数组中,然后xn+1

σnew2=σold2+(xn+1μnew)(xn+1μold).

编辑:上面的公式似乎是错误的,请参阅注释。

现在,替换一个元素意味着添加一个观测值并删除另一个观测值。两者都可以用上面的公式计算。但是,请记住,可能会出现数值稳定性问题。引用的文章还提出了数值稳定的变体。

到自己推导式中,计算使用样本方差和替代的定义μ Ñ Ë 瓦特式你给适当的时候。这给你σ 2 Ñ ë 瓦特 - σ 2 ö d到底,从而为一个公式σ Ñ Ë 瓦特给出σ ö d(n1)(σnew2σold2)μnewσnew2σold2σnewσold。在我的符号,我想你更换元素 X ñ通过 X ' ñμoldxnxn

σ2=(n1)1k(xkμ)2(n1)(σnew2σold2)=k=1n1((xkμnew)2(xkμold)2)+ ((xnμnew)2(xnμold)2)=k=1n1((xkμoldn1(xnxn))2(xkμold)2)+ ((xnμoldn1(xnxn))2(xnμold)2)

xkμold, but you'll have to work the equation a little bit more to derive a neat result. This should give you the general idea.


the first formula you gave does not seem correct, well it means that if the xn+1 is smaller/larger then from both new and old mean, the variance always increases, which does not make any sense. It may increase or decrease depending on the distribution.
Emmet B

@EmmetB: Yes, you're right -- this should probably be σnew2=n1nσold2+1n(xn+1μnew)(xn+1μold). Unfortunately, this renders void my whole discussion from there, but I'm leaving it for historic purposes. Feel free to edit, though.
krlmlr

4

Based on what i think i'm reading on the linked Wikipedia article you can maintain a "running" standard deviation:

real sum = 0;
int count = 0;
real S = 0;
real variance = 0;

real GetRunningStandardDeviation(ref sum, ref count, ref S, x)
{
   real oldMean;

   if (count >= 1)
   {
       real oldMean = sum / count;
       sum = sum + x;
       count = count + 1;
       real newMean = sum / count;

       S = S + (x-oldMean)*(x-newMean)
   }
   else
   {
       sum = x;
       count = 1;
       S = 0;         
   }

   //estimated Variance = (S / (k-1) )
   //estimated Standard Deviation = sqrt(variance)
   if (count > 1)
      return sqrt(S / (count-1) );
   else
      return 0;
}

Although in the article they don't maintain a separate running sum and count, but instead have the single mean. Since in thing i'm doing today i keep a count (for statistical purposes), it is more useful to calculate the means each time.


0

Given original x¯, s, and n, as well as the change of a given element xn to xn, I believe your new standard deviation s will be the square root of

s2+1n1(2nΔx¯(xnx¯)+n(n1)(Δx¯)2),
where Δx¯=x¯x¯, with x¯ denoting the new mean.

Maybe there is a snazzier way of writing it?

I checked this against a small test case and it seemed to work.


1
@john / whistling in the Dark: I liked your answer, it seems work properly in my small dataset. Is there any mathematical foundation/reference on it? Could you kindly help?
Alok Chowdhury

The question was all @Whistling in the Dark, I just cleaned it up for the site. You should pose a new question referencing the question and answer here. And also you should upvote this answer if you feel that way.
John
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.