为什么不将方差定义为彼此跟随的每个值之间的差异?


19

对于许多人来说,这可能是一个简单的问题,但这是:

为什么不将方差定义为彼此跟随的每个值之间的差异,而不是平均值的差异?

对我来说,这将是更合乎逻辑的选择,我想我显然已经忽略了一些缺点。谢谢

编辑:

让我尽可能清楚地改写一下。这就是我的意思:

  1. 假设您有一系列数字,顺序为:1、2、3、4、5
  2. 计算并总结(绝对,连续)每个值之间的差异(连续,在每个后续值之间,而不是成对)(不使用平均值)。
  3. 除以差异数量
  4. (后续:如果数字是无序的,答案会有所不同)

->与方差的标准公式相比,此方法有哪些缺点?


1
您可能还对阅读有关自相关的内容感兴趣(例如stats.stackexchange.com/questions/185521/…)。
蒂姆

2
@ user2305193 whuber的答案是正确的,但他的公式利用了数据排序与所有排序平均值之间的平方距离。整洁的技巧,但是找到您指出的方差的过程,正是我在答案中试图实现的,并且证明它做得不好。试图清除混乱。
Greenparker

1
为了娱乐,请查找Allan Variance。
霍布斯2016年

换个角度说,我猜想是因为您不对差进行平方(并且之后也不对平方根求平方),但取绝对值,所以应该说是“为什么我们不这样计算标准偏差”而不是“为什么我们不这样计算方差”。但是,我现在休息一下
user2305193

Answers:


27

最明显的原因是这些值通常没有时间顺序。因此,如果您弄乱了数据,则它对数据传达的信息没有影响。如果我们遵循您的方法,那么每次您弄乱数据时,都会得到不同的样本方差。

从理论上讲,样本方差估计随机变量的真实方差。随机变量的真实方差为 E [X - E X 2 ]X

E[(XEX)2].

在此,表示期望值或“平均值”。因此,方差的定义是变量与其平均值之间的平均平方距离。当您查看此定义时,这里没有“时间顺序”,因为没有数据。它只是随机变量的一个属性。E

当您从此分布收集iid数据时,您就有。估计期望值的最佳方法是取样本平均值。这里的关键是我们获得了iid数据,因此没有对数据的排序。样本x 1x 2x n与样本x 2x 5x 1x n相同x1,x2,,xñx1X2XñX2X5X1个Xñ

编辑

样本方差衡量样本的一种特定类型的分散,即一种衡量与均值的平均距离的方法。还有其他类型的分散,例如数据范围和分位数范围。

即使您按升序对值进行排序,也不会改变样本的特征。您获得的样本(数据)是变量的实现。计算样本方差类似于了解变量中的离散程度。因此,例如,如果您对20个人进行抽样并计算其身高,则从随机变量身高中获得20个“实现” 。现在,样本方差应该用来衡量个体身高的总体变化。如果您对数据进行排序 100 110 123 124 ... X=

100,110,123,124,,

不会更改样本中的信息。

让我们再看一个例子。可以说你从以这种方式排列的随机变量具有100个观测那么平均后续距离为1个单位,因此根据您的方法,方差为1。

1,2,3,4,5,6,7,8,9,10,11,12,13,14,...100.

解释“方差”或“分散”的方法是了解数据可能在哪个值范围内。在这种情况下,您将获得.99单位的范围,这当然不能很好地代表变化。

如果您不求平均值,而只是求和随后的差值,那么您的方差将是99。当然,这并不代表样本中的变异性,因为99代表了数据的范围,而不是变异性。


1
在最后一段中,您与我取得了联系,哈哈,谢谢您这个疯狂的回答,我希望我有足够的代表来支持它,请大家为我做;-)接受!!!
user2305193

后续跟进:我真正的意思是(是的,对不起,我在阅读完您的答案后才意识到正确的问题)是您总结差异并将其除以样本数。在您的最后一个示例中,该值为99/100-您能否详细说明一下,以获取完整的惊喜?
user2305193 '16

@ user2305193对,我说的平均是1个单位,这是不正确的。应该是0.99个单位。改了
Greenparker

有关1-100系列的更多信息:1-100的方差为841.7,标准差为29.01 source。所以确实有很大不同。
user2305193

31

定义的方式!

这是代数。令值是。用F表示这些值的经验分布函数(这意味着每个x i在值x i处贡献的概率为1 / n),并且XY是具有分布F的独立随机变量。借助方差的基本属性(即,它是二次形式)以及F的定义和事实x=(x1,x2,,xn)Fxi1/ñxiXYFF Y的均值相同XY

Var(x)=Var(X)=12(Var(X)+Var(Y))=12(Var(XY))=12(E((XY)2)E(XY)2)=E(12(XY)2)0=1n2i,j12(xixj)2.

该公式不依赖于的排序方式:它使用所有可能的分量对,并使用平方差的一半进行比较。它可以,但是,涉及到的平均在所有可能的顺序(组小号ñ 所有的ñ 索引的排列1 2 ... ñxS(n)n!1,2,,n)。即

Var(x)=1n2i,j12(xixj)2=1n!σS(n)1ni=1n112(xσ(i)xσ(i+1))2.

该内总和需要重新排序的值和和的(半)的平方所有之间的差异ñ - 1连续对。除以n本质上是平均这些连续平方差。它计算所谓的lag-1半方差。外部求和对所有可能的排序执行此操作。xσ(1),xσ(2),,xσ(n)n1n


标准方差公式的这两个等效代数视图为方差的含义提供了新的见解。半方差是序列串行协方差的逆度量:当半方差低时,协方差高(且数字呈正相关),反之亦然。因此,无序数据集的方差是在任意重新排序下可获得的所有可能半方差平均值。


1
@ Mur1lo相反:我相信这种推导是正确的。将公式应用于一些数据即可查看!
Whuber

1
我认为Mur1lo可能不是在谈论方差公式的正确性,而是在谈论显然直接从对随机变量的期望转移到样本量的函数。
Glen_b-恢复莫妮卡

1
@glen但这正是经验分布函数让我们做到的。这就是这种方法的重点。
Whuber

3
Yes, that's clear to me; I was trying to point out where the confusion seemed to lay. Sorry to be vague. Hopefully it's clearer now why it only appears* to be a problem. *(this why I used the word "apparent" earlier, to emphasize it was just the out-of-context appearance of that step that was likely to be the cause of the confusion)
Glen_b -Reinstate Monica

2
@Mur1o The only thing I have done in any of these equations is to apply definitions. There is no passing from expectations to "sample quantities". (In particular, no sample of F has been posited or used.) Thus I am unable to identify what the apparent problem is, nor suggest an alternative explanation. If you could expand on your concern then I might be able to respond.
whuber

11

Just a complement to the other answers, variance can be computed as the squared difference between terms:

Var(X)=12n2injn(xixj)2=12n2injn(xix¯xj+x¯)2=12n2injn((xix¯)(xjx¯))2=1nin(xix¯)2

I think this is the closest to the OP proposition. Remember the variance is a measure of dispersion of every observation at once, not only between "neighboring" numbers in the set.


UPDATE

Using your example: X=1,2,3,4,5. We know the variance is Var(X)=2.

With your proposed method Var(X)=1, so we know beforehand taking the differences between neighbors as variance doesn't add up. What I meant was taking every possible difference squared then summed:

Var(X)==(51)2+(52)2+(53)2+(54)2+(55)2+(41)2+(42)2+(43)2+(44)2+(45)2+(31)2+(32)2+(33)2+(34)2+(35)2+(21)2+(22)2+(23)2+(24)2+(25)2+(11)2+(12)2+(13)2+(14)2+(15)2252==16+9+4+1+9+4+1+1+4+1+1+4+1+1+4+9+1+4+9+1650==2

Now I'm seriously confused guys
user2305193

@user2305193 In your question, did you mean every pairwise difference or did you mean the difference between a value and the next in a sequence? Could you please clarify?
Firebug

2
@Mur1lo no one is though, I have no idea what you're referring to.
Firebug

2
@Mur1lo This is a general question, and I answered it generally. Variance is a computable parameter, which can be estimated from samples. This question isn't about estimation though. Also we are talking about discrete sets, not about continuous distributions.
Firebug

1
You showed how to estimate the variance by its U-statistic and its fine. The problem is when you write: Var("upper case"X) = things involving "lower case" x, you are mixing the two different notions of parameter and of estimator.
Mur1lo

6

Others have answered about the usefulness of variance defined as usual. Anyway, we just have two legitimate definitions of different things: the usual definition of variance, and your definition.

Then, the main question is why the first one is called variance and not yours. That is just a matter of convention. Until 1918 you could have invented anything you want and called it "variance", but in 1918 Fisher used that name to what is still called variance, and if you want to define anything else you will need to find another name to name it.

The other question is if the thing you defined might be useful for anything. Others have pointed its problems to be used as a measure of dispersion, but it's up to you to find applications for it. Maybe you find so useful applications that in a century your thing is more famous than variance.


I know every definition is up to the people deciding on it, I really was looking for help in up/downsides for each approaches. Usually there's good reason for people converging to a definition and as I suspected didn't see why straight away.
user2305193

1
Fisher introduced variance as a term in 1918 but the idea is older.
Nick Cox

As far as I know, Fisher was the first one to use the name "variance" for variance. That's why I say that before 1918 you could have use "variance" to name anything else you had invented.
Pere

3

@GreenParker answer is more complete, but an intuitive example might be useful to illustrate the drawback to your approach.

In your question, you seem to assume that the order in which realisations of a random variable appear matters. However, it is easy to think of examples in which it doesn't.

Consider the example of the height of individuals in a population. The order in which individuals are measured is irrelevant to both the mean height in the population and the variance (how spread out those values are around the mean).

Your method would seem odd applied to such a case.


2

Although there are many good answers to this question I believe some important points where left behind and since this question came up with a really interesting point I would like to provide yet another point of view.

Why isn't variance defined as the difference between every value following    
each other instead of the difference to the average of the values?

The first thing to have in mind is that the variance is a particular kind of parameter, and not a certain type of calculation. There is a rigorous mathematical definition of what a parameter is but for the time been we can think of then as mathematical operations on the distribution of a random variable. For example if X is a random variable with distribution function FX then its mean μx, which is also a parameter, is:

μX=+xdFX(x)

and the variance of X, σX2, is:

σX2=+(xμX)2dFX(x)

The role of estimation in statistics is to provide, from a set of realizations of a r.v., a good approximation for the parameters of interest.

What I wanted to show is that there is a big difference in the concepts of a parameters (the variance for this particular question) and the statistic we use to estimate it.

Why isn't the variance calculated this way?

So we want to estimate the variance of a random variable X from a set of independent realizations of it, lets say x={x1,,xn}. The way you propose doing it is by computing the absolute value of successive differences, summing and taking the mean:

ψ(x)=1ni=2n|xixi1|

and the usual statistic is:

S2(x)=1n1i=in(xix¯)2,

where x¯ is the sample mean.

When comparing two estimator of a parameter the usual criterion for the best one is that which has minimal mean square error (MSE), and a important property of MSE is that it can be decomposed in two components:

MSE = estimator bias + estimator variance.

Using this criterion the usual statistic, S2, has some advantages over the one you suggests.

  • First it is a unbiased estimator of the variance but your statistic is not unbiased.

  • One other important thing is that if we are working with the normal distribution then S2 is the best unbiased estimator of σ2 in the sense that it has the smallest variance among all unbiased estimators and thus minimizes the MSE.

When normality is assumed, as is the case in many applications, S2 is the natural choice when you want to estimate the variance.


3
Everything in this answer is well explained, correct, and interesting. However, introducing the "usual statistic" as an estimator confuses the issue, because the question is not about estimation, nor about bias, nor about the distinction between 1/n and 1/(n1). That confusion might be at the root of your comments to several other answers in this thread.
whuber


1

Lots of good answers here, but I'll add a few.

  1. The way it is defined now has proven useful. For example, normal distributions appear all the time in data and a normal distribution is defined by its mean and variance. Edit: as @whuber pointed out in a comment, there are various other ways specify a normal distribution. But none of them, as far as I'm aware, deal with pairs of points in sequence.
  2. Variance as normally defined gives you a measure of how spread out the data is. For example, lets say you have a lot of data points with a mean of zero but when you look at it, you see that the data is mostly either around -1 or around 1. Your variance would be about 1. However, under your measure, you would get a total of zero. Which one is more useful? Well, it depends, but its not clear to me that a measure of zero for its "variance" would make sense.
  3. It lets you do other stuff. Just an example, in my stats class we saw a video about comparing pitchers (in baseball) over time. As I remember it, pitchers appeared to be getting worse since the proportion of pitches that were hit (or were home-runs) was going up. One reason is that batters were getting better. This made it hard to compare pitchers over time. However, they could use the z-score of the pitchers to compare them over time.

Nonetheless, as @Pere said, your metric might prove itself very useful in the future.


1
A normal distribution can also be determined by its mean and fourth central moment, for that matter -- or by means of many other pairs of moments. The variance is not special in that way.
whuber

@whuber interesting. I'll admit I didn't realize that. Nonetheless, unless I'm mistaken, all the moments are "variance like" in that they are based on distances from a certain point as opposed to dealing with pairs of points in sequence. But I'll edit my answers to make note of what you said.
roundsquare

1
Could you explain the sense in which you mean "deal with pairs of points in sequence"? That's not a part of any standard definition of a moment. Note, too, that all the absolute moments around the mean--which includes all even moments around the mean--give a "measure of how spread out the data" are. One could, therefore, construct an analog of the Z-score with them. Thus, none of your three points appears to differentiate the variance from any absolute central moment.
whuber

@whuber yeah. The original question posited a 4 step sequence where you sort the points, take the differences between each point and the next point, and then average these. That's what I referred to as "deal[ing] with pairs of points in sequence". So you are right, none of the three points I gave distinguishes variance from any absolute central moment - they are meant to distinguish variance (and, I suppose, all absolute central moments) from the procedure described in the original question.
roundsquare
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.