我有一些三角3D网格。三角形区域的统计信息是:
- 最低0.000
- 最高2341.141
- 均值56.317
- 标准开发98.720
那么,当数字像上面那样工作时,这是否意味着对于标准偏差特别有用或表明在计算标准偏差时存在错误?这些区域肯定远非正常分布。
就像某人在以下他们的回应之一中提到的那样,令我感到非常惊讶的是,数字均值仅用一个标准差就能得出负数,从而超出了法律范围。
谢谢
我有一些三角3D网格。三角形区域的统计信息是:
那么,当数字像上面那样工作时,这是否意味着对于标准偏差特别有用或表明在计算标准偏差时存在错误?这些区域肯定远非正常分布。
就像某人在以下他们的回应之一中提到的那样,令我感到非常惊讶的是,数字均值仅用一个标准差就能得出负数,从而超出了法律范围。
谢谢
Answers:
没有什么可以说标准偏差必须小于或大于平均值。给定一组数据,您可以保持平均值不变,但可以通过适当地加/减正数来将标准差更改为任意程度。
使用@whuber从他的评论到问题的示例数据集:{2,2,2,202}。如@whuber所述:平均值为52,标准差为100。
现在,如下扰动数据的每个元素:{22,22,22,142}。平均值仍然是52,但标准差是60。
当然,这些是独立的参数。您可以在R(或您可能更喜欢的其他工具)中设置简单的探索。
R> set.seed(42) # fix RNG
R> x <- rnorm(1000) # one thousand N(0,1)
R> mean(x) # and mean is near zero
[1] -0.0258244
R> sd(x) # sd is near one
[1] 1.00252
R> sd(x * 100) # scale to std.dev of 100
[1] 100.252
R>
同样的,你规范减去均值和标准差除以你正在寻找的数据。
编辑按照@whuber的想法,这是一个接近于您的四个度量的无穷数据集:
R> data <- c(0, 2341.141, rep(52, 545))
R> data.frame(min=min(data), max=max(data), sd=sd(data), mean=mean(data))
min max sd mean
1 0 2341.14 97.9059 56.0898
R>
我不确定为什么@Andy对这个结果感到惊讶,但是我知道他并不孤单。我也不确定数据的正态性与sd高于平均值的事实有关。在这种情况下,生成一个正态分布的数据集非常简单;实际上,标准正态的均值为0,sd为1。很难获得sd> mean的所有正值的正态分布数据集。确实,这是不可能的(但是这取决于样本量以及您使用的正态性检验...如果样本很小,则会发生奇怪的事情)
但是,一旦删除了正常性的规定(如@Andy所做的那样),就没有理由为什么sd应该大于或小于平均值,即使对于所有正值也是如此。单个异常值将执行此操作。例如
x <-符文(100,1,200)x <-c(x,2000)
给出113的平均值和198的sd(当然取决于种子)。
但是,更大的问题是为什么这会让人们感到惊讶。
我不教统计学,但是我想知道统计学的教学方式如何使这一概念变得普遍。
Just adding a generic point that, from a calculus perspective,
Perhaps the OP is surprised that the mean - 1 S.D. is a negative number (especially where the minimum is 0).
Here are two examples that may clarify.
Suppose you have a class of 20 first graders, where 18 are 6 years old, 1 is 5, and 1 is 7. Now add in the 49-year-old teacher. The average age is 8.0, while the standard deviation is 9.402.
You might be thinking: one standard deviation ranges for this class ranges from -1.402 to 17.402 years. You might be surprised that the S.D. includes a negative age, which seems unreasonable.
You don't have to worry about the negative age (or the 3D plots extending less than the minimum of 0.0). Intuitively, you still have about two-thirds of the data within 1 S.D. of the mean. (You actually have 95% of the data within 2 S.D. of the mean.)
When the data takes on a non-normal distribution, you will see surprising results like this.
Second example. In his book, Fooled by Randomness, Nassim Taleb sets up the thought experiment of a blindfolded archer shooting at a wall of inifinte length. The archer can shoot between +90 degrees and -90 degrees.
Every once in a while, the archer will shoot the arrow parallel to the wall, and it will never hit. Consider how far the arrow misses the target as the distribution of numbers. The standard deviation for this scenario would be inifinte.
A gamma random variable with density
R
to get a feeling about this. Here are examples with and .
> m <- 10
> s <- 1
> x <- rgamma(10000, shape = m^2/s^2, rate = m/s^2)
> mean(x)
[1] 10.01113
> sd(x)
[1] 1.002632
> m <- 1
> s <- 10
> x <- rgamma(10000, shape = m^2/s^2, rate = m/s^2)
> mean(x)
[1] 1.050675
> sd(x)
[1] 10.1139
As pointed out in the other answers, the mean and standard deviation are essentially unrelated in that it is not necessary for the standard deviation to be smaller than the mean. However, if the data are nonnegative, taking on values in , say, then, for large data sets (where the distinction between dividing by or by does not matter very much), the following inequality holds:
您似乎暗中想到的是一个预测间隔,它将限制新观测值的出现。要注意的是:您必须假设统计分布符合以下事实:观察结果(三角形区域)必须保持非负数。正常无济于事,但对数正常可能就好了。实际上,获取观察区域的对数,计算均值和标准差,使用正态分布形成预测区间,最后评估上下限的指数-变换后的预测区间不会左右对称均值,并且保证不低于零。我认为这是OP真正想到的。