非负数据的标准偏差可以超过平均值吗？

15

我有一些三角3D网格。三角形区域的统计信息是：

最低0.000
最高2341.141
均值56.317
标准开发98.720

那么，当数字像上面那样工作时，这是否意味着对于标准偏差特别有用或表明在计算标准偏差时存在错误？这些区域肯定远非正常分布。

就像某人在以下他们的回应之一中提到的那样，令我感到非常惊讶的是，数字均值仅用一个标准差就能得出负数，从而超出了法律范围。

谢谢

distributions mean standard-deviation

— 安迪·登特
source

4

在数据集

{2, 2, 2, 202}

$\{2,2,2,202\}$ 样本标准差为

100

$100$ ，而平均是

52

$52$ --pretty接近你观察什么。

— Whuber

5

对于一个熟悉的示例（某人），某人玩二十一点的平均结果可能为负25 美元，但标准差为100美元（用于说明的数字）。如此大的变异系数使某人更容易被欺骗，以为他们比实际情况要好。

— Michael McGowan

在后续的问题是相当有价值的信息，太多：它给出的平均边界放置一组（非负数据）的SD。

— ub

9

没有什么可以说标准偏差必须小于或大于平均值。给定一组数据，您可以保持平均值不变，但可以通过适当地加/减正数来将标准差更改为任意程度。

使用@whuber从他的评论到问题的示例数据集：{2，2，2，202}。如@whuber所述：平均值为52，标准差为100。

现在，如下扰动数据的每个元素：{22，22，22，142}。平均值仍然是52，但标准差是60。

— 变种
source

1

如果添加到每个元素，则更改位置参数，即均值。您可以通过乘以比例因子（假设您的平均值为零）来更改色散（即标准偏差）。

— Dirk Eddelbuettel

@DirkEddelbuettel您是正确的。我确定了答案，并提供了一个示例以使其清晰。

— varty

2

我不遵循这个例子。显然，新数据集不是通过从每个原始值中“相加或减去正数”来从原始数据派生的。

— ub

3

我无法编辑它，因为我不知道您要说什么。如果您可以将单独的值任意添加到数据集中的每个数字，则只需将一组

值更改为一组完全不同的

值。我看不到与问题甚至您的开头段落有什么关系。我认为任何人都会承认这样的变化可以改变均值和标准差，但这并不能告诉我们为什么一组非负数据的标准差可以是均值的任何正数。

n

$n$

n

$n$

— ub

2

您是对的：引用的断言是我的，并且不会出现在您的答复中。（不过，它恰好是正确和相关的。:-)我试图阐明的一点是，仅在保持均值不变的情况下更改SD的能力并不能解决问题。 SD可以更改多少（同时保持所有数据均为非负数）？我试图说明的另一点是，您的示例并未说明对数据进行此类更改的一般，可预测的过程。这使得它看起来是任意的，没有太大帮助。

— ub

9

当然，这些是独立的参数。您可以在R（或您可能更喜欢的其他工具）中设置简单的探索。

R> set.seed(42)     # fix RNG
R> x <- rnorm(1000) # one thousand N(0,1)
R> mean(x)          # and mean is near zero
[1] -0.0258244
R> sd(x)            # sd is near one
[1] 1.00252
R> sd(x * 100)      # scale to std.dev of 100
[1] 100.252
R>

同样的，你规范减去均值和标准差除以你正在寻找的数据。

编辑按照@whuber的想法，这是一个接近于您的四个度量的无穷数据集：

R> data <- c(0, 2341.141, rep(52, 545))
R> data.frame(min=min(data), max=max(data), sd=sd(data), mean=mean(data))
  min     max      sd    mean
1   0 2341.14 97.9059 56.0898
R>

— 德克·埃德比布特尔
source

我不确定我是否理解你的意思。它们并不是完全独立的，因为可以通过扰动一个数据点来改变平均值，从而也可以改变标准偏差。我误解了吗？

— varty

注意到三角形面积不能为负（已由问题中引用的最小值证实），因此希望有人举一个仅由非负数组成的示例。

— ub

（+1）重新编辑：尝试使用52.15 :-)的536个副本。

— ub

尼斯代表536次。应该做过一个二进制搜索:)

— Dirk Eddelbuettel

@Dirk“这些是独立的参数”，请考虑

为bernouilli的情况。方差和均值不是独立的：

。考虑一个随机变量

，最大可能的方差是

现在如果你强制平均为等于一（即，低于

）的最大偏差不能大于

X

$X$

v a r (X) = p (1 - p)

$var(X)=p(1-p)$

100 > X > 0

$100>X>0$

(50)^{2}

$(50)^2$

50

$50$

99 / 100 * (1)^{2} + (1 / 100) * 99^{2}

$99/100*(1)^2+(1/100)*99^2$ 。自然界中有界变量的例子比高斯有更多的例子吗？

— 罗宾吉拉德2011年

7

我不确定为什么@Andy对这个结果感到惊讶，但是我知道他并不孤单。我也不确定数据的正态性与sd高于平均值的事实有关。在这种情况下，生成一个正态分布的数据集非常简单；实际上，标准正态的均值为0，sd为1。很难获得sd> mean的所有正值的正态分布数据集。确实，这是不可能的（但是这取决于样本量以及您使用的正态性检验...如果样本很小，则会发生奇怪的事情）

但是，一旦删除了正常性的规定（如@Andy所做的那样），就没有理由为什么sd应该大于或小于平均值，即使对于所有正值也是如此。单个异常值将执行此操作。例如

x <-符文（100，1，200）x <-c（x，2000）

给出113的平均值和198的sd（当然取决于种子）。

但是，更大的问题是为什么这会让人们感到惊讶。

我不教统计学，但是我想知道统计学的教学方式如何使这一概念变得普遍。

— 彼得·弗洛姆-恢复莫妮卡
source

I have never studied statistics, just a couple of units of engineering math and that was thirty years ago. Other people at work, who I thought understood the domain better, have been talking about representing bad data by "number of std devs away from the mean". So, it's more about "how std dev is commonly mentioned" than "taught" :-)

— Andy Dent

@Andy与平均值之间有大量std只是意味着该变量与零没有显着差异。然后，它取决于上下文（随机变量的含义是什么），但在某些情况下，您可能希望删除那些变量？

— 罗宾吉拉德2011年

@Peter看到我对Dirk的评论，这可能在某些情况下解释了“惊奇”。实际上，我已经教过统计学一段时间了，但从未听说过您在说什么惊奇。无论如何，我更喜欢对所有事情都感到惊讶的学生，我很确定这是一个好的认识论立场（比晕倒绝对没有令人惊讶的立场要好：）。

— 罗宾·吉拉德

@AndyDent对我而言，“坏”数据表示记录有误的数据。离平均值远的数据是离群值。例如，假设您正在测量人们的身高。如果您测量我的身高，并将其身高记录为7'5'而不是5'7，那就不好了。如果您测量姚明并将其身高记录为7'5“，那是一个离群值但并非是不好的数据。不管这个事实与均值相距甚远（例如6 sds）

— 彼得·弗洛姆-恢复莫妮卡

@Peter Florn, In our case, we have outliers which we want to get rid of because they represent triangles that will cause algorithmic problems processing the mesh. They may even be "bad data" in your sense if they were created by faulty scanning devices or conversion from other formats :-) Other shapes may have outliers which are legitimately a long way from the mean but don't represent a problem. One of the more interesting things about this data is we have "bad data" at both ends but the small ones are not far from the mean.

— Andy Dent

6

Just adding a generic point that, from a calculus perspective,

\int x f (x) d x

$\int x f(x) \text{d}x$ and

\int x^{2} f (x) d x

$\int x^2 f(x) \text{d}x$ are related by Jensen's inequality, assuming both integrals exist,

\int x^{2} f (x) d x \geq {\int x f (x) d x}^{2} .

$\int x^2 f(x) \text{d}x \ge \left\{ \int x f(x) \text{d}x \right\}^2\,.$ Given this general inequality, nothing prevents the variance to get arbitrarily large. Witness the Student's t distribution with

ν

$\nu$ degrees of freedom,

X \sim T (ν, μ, σ)

$X \sim \mathfrak{T}(\nu,\mu,\sigma)$ and take

Y = | X |

$Y=|X|$ whose second moment is the same as the second moment of

X

$X$ ,

E [| X |^{2}] = \frac{ν}{ν - 2} σ^{2} + μ^{2},

$\mathbb{E}[|X|^2] = \frac{\nu}{\nu-2}\sigma^2 + \mu^2,$ when

ν > 2

$\nu>2$ . So it goes to infinity when

ν

$\nu$ goes down to

2

$2$ , while the mean of

Y

$Y$ remains finite as long as

ν > 1

$\nu>1$ .

— Xi'an
source

1

Please note the explicit restriction to nonnegative values in the question.

— whuber

The Student example gets easily translated into the absolute-value-of-a-Student's-t-distribution example...

— Xi'an

1

But that changes the mean, of course :-). The question concerns the relationship between the SD and the mean (see its title). I am not saying you're wrong; I'm just (implicitly) suggesting that your reply could, with little work, more directly address the question.

— whuber

@whuber: ok, I edited the above to consider the absolute value (I also derived the mean of the absolute value but <a href="ceremade.dauphine.fr/~xian/meanabs.pdf">it is rather ungainly</a>...)

— Xi'an

3

Perhaps the OP is surprised that the mean - 1 S.D. is a negative number (especially where the minimum is 0).

Here are two examples that may clarify.

Suppose you have a class of 20 first graders, where 18 are 6 years old, 1 is 5, and 1 is 7. Now add in the 49-year-old teacher. The average age is 8.0, while the standard deviation is 9.402.

You might be thinking: one standard deviation ranges for this class ranges from -1.402 to 17.402 years. You might be surprised that the S.D. includes a negative age, which seems unreasonable.

You don't have to worry about the negative age (or the 3D plots extending less than the minimum of 0.0). Intuitively, you still have about two-thirds of the data within 1 S.D. of the mean. (You actually have 95% of the data within 2 S.D. of the mean.)

When the data takes on a non-normal distribution, you will see surprising results like this.

Second example. In his book, Fooled by Randomness, Nassim Taleb sets up the thought experiment of a blindfolded archer shooting at a wall of inifinte length. The archer can shoot between +90 degrees and -90 degrees.

Every once in a while, the archer will shoot the arrow parallel to the wall, and it will never hit. Consider how far the arrow misses the target as the distribution of numbers. The standard deviation for this scenario would be inifinte.

— rajah9
source

The rule about 2/3 of the data within 1 SD of the mean is for normal data. But the classroom data is clearly non-normal (even if it passes some test for normality because of small sample size). Taleb's example is terrible. It's an example of poor operationalization of a variable. Taken as is, both the mean and the SD would be infinite. But that's nonsense. "How far the arrow misses" - to me, that's a distance. The arrow, no matter how it is fired, will land somewhere. Measure the distance from there to the target. No more infinity.

— Peter Flom - Reinstate Monica

1

Yup, the OP was sufficiently surprised the first time I saw mean - 1 SD went negative that I wrote a whole new set of unit tests using data from Excel to confirm at least my algorithm was calculating the same values. Because Excel just has to be an authoritative source, right?

— Andy Dent

@Peter The 2/3 rule (part of a 68-95-99.7% rule) is good for a huge variety of datasets, many of them non-normal and even for moderately skewed ones. (The rule is quite good for symmetric datsets.) The non-finiteness of the SD and mean are not "nonsense." Taleb's example is one of the few non-contrived situations where the Cauchy distribution clearly governs the data-generation process. The infiniteness of the SD does not derive from the possibility of missing the wall but from the distribution of actual hits.

— whuber

1

@whuber I was aware of your first point, which is a good one. I disagree about your second point re Taleb. It seems to me like another contrived example.

— Peter Flom - Reinstate Monica

3

A gamma random variable $X$ with density

f_{X} (x) = \frac{β^{α}}{Γ (α)} x^{α - 1} e^{- β x} I_{(0, \infty)} (x),

$f_X(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} I_{(0,\infty)}(x) \, ,$ with

α, β > 0

$\alpha,\beta>0$ , is almost surely positive. Choose any mean

m > 0

$m>0$ and any standard deviation

s > 0

$s>0$ . As long as they are positive, it does not matter if

m > s

$m>s$ or

m < s

$m<s$ . Putting

α = m^{2} / s^{2}

$\alpha=m^2/s^2$ and

β = m / s^{2}

$\beta=m/s^2$ , the mean and standard deviation of

X

$X$ are

E [X] = α / β = m

$\mathbb{E}[X]=\alpha/\beta=m$ and

\sqrt{V a r [X]} = \sqrt{α / β^{2}} = s

$\sqrt{\mathbb{Var}[X]}=\sqrt{\alpha/\beta^2}=s$ . With a big enough sample from the distribution of

X

$X$ , by the SLLN, the sample mean and sample standard deviation will be close to

m

$m$ and

s

$s$ . You can play with R to get a feeling about this. Here are examples with

m > s

$m>s$ and

m < s

$m<s$ .

> m <- 10
> s <- 1
> x <- rgamma(10000, shape = m^2/s^2, rate = m/s^2)
> mean(x)
[1] 10.01113
> sd(x)
[1] 1.002632

> m <- 1
> s <- 10
> x <- rgamma(10000, shape = m^2/s^2, rate = m/s^2)
> mean(x)
[1] 1.050675
> sd(x)
[1] 10.1139

— Zen
source

1

As pointed out in the other answers, the mean $\bar{x}$ and standard deviation $\sigma_x$ are essentially unrelated in that it is not necessary for the standard deviation to be smaller than the mean. However, if the data are nonnegative, taking on values in $[0,c]$ , say, then, for large data sets (where the distinction between dividing by $n$ or by $n-1$ does not matter very much), the following inequality holds:

σ_{x} \leq \sqrt{\bar{x} (c - \bar{x})} \leq \frac{c}{2}

$\sigma_x \leq \sqrt{\bar{x}(c-\bar{x})} \leq \frac{c}{2}$ and so if

\bar{x} > c / 2

$\bar{x} > c/2$ , we can be sure that

σ_{x}

$\sigma_x$ will be smaller. Indeed, since

σ_{x} = c / 2

$\sigma_x = c/2$ only for an extremal distribution (half the data have value

0

$0$ and the other half value

c

$c$ ),

σ_{x} < \bar{x}

$\sigma_x < \bar{x}$ can hold in some cases when

\bar{x} < c / 2

$\bar{x} < c/2$ as well. If the data are measurements of some physical quantity that is nonnegative (e.g. area) and have an empirical distribution that is a good fit to a normal distribution, then

σ_{x}

$\sigma_x$ will be considerably smaller than

min {\bar{x}, c - \bar{x}}

$\min\{\bar{x}, c - \bar{x}\}$ since the fitted normal distribution should assign negligibly small probability to the events

{X < 0}

$\{X < 0\}$ and

{X > c}

$\{X > c\}$ .

— Dilip Sarwate
source

4

I don't think the question is whether the dataset is normal; its non-normality is stipulated. The question concerns whether there might have been some error made in computing the standard deviation, because the OP is surprised that even in this obviously non-normal dataset the SD is much larger than the mean. If an error was not made, what can one conclude from such a large coefficient of variation?

— whuber

9

Any answer or comment that claims the mean and sd of a dataset are unrelated is plainly incorrect, because both are functions of the same data and both will change whenever a single one of the data values is changed. This remark does bear some echoes of a similar sounding statement that is true (but not terribly relevant to the current question); namely, that the sample mean and sample sd of data drawn independently from a normal distribution are independent (in the probabilistic sense).

— whuber

1

您似乎暗中想到的是一个预测间隔，它将限制新观测值的出现。要注意的是：您必须假设统计分布符合以下事实：观察结果（三角形区域）必须保持非负数。正常无济于事，但对数正常可能就好了。实际上，获取观察区域的对数，计算均值和标准差，使用正态分布形成预测区间，最后评估上下限的指数-变换后的预测区间不会左右对称均值，并且保证不低于零。我认为这是OP真正想到的。

— Felipe G. Nievinski
source

0

Felipe Nievinski在这里指出了一个真正的问题。当分布显然不是正态分布时，以正态分布术语进行讨论是没有意义的。具有相对较小的平均值和相对较大的标准偏差的所有正值不能具有正态分布。因此，任务是弄清哪种分布适合这种情况。最初的帖子表明，显然要考虑正态分布（或某些正态分布）。否则负数将不会出现。很快就会想到对数常态，瑞利，威布尔...我不知道，但想知道在这种情况下最好的方法是什么？

— fred3
source