如何估计一个总体中的随机成员比另一个总体中的随机成员“更好”的概率?


15

假设我从两个不同的人群中取样。如果我测量每个成员完成一项任务需要多长时间,则可以轻松估算每个总体的均值和方差。

如果我现在假设与每个人口中的一个人进行随机配对,我是否可以估计第一个比第二个更快的概率?

我确实有一个具体的例子:这些测量值是我从A骑自行车到B的时间,这些人群代表我可以采取的不同路线;我正在尝试找出下一个循环的拾取路线A的速度比拾取路线B更快的概率。当我实际执行该循环时,我为我的样品组设置了另一个数据点:)。

我知道这是尝试解决此问题的一种极其简单的方法,尤其是因为在任何一天,风比其他任何时间都更可能影响我的时间,所以请告诉我您是否认为我在问错误的问题...


这可以通过简单的二项式测试完成,@ Macro有一个很好的答案。但是,样本本身就是一个问题:是否有任何可能影响您选择A路线或B路线的决定?特别是,您是否想在干燥的道路,狂风拂面的晚餐中等待A路线?:)请小心所有可能影响集合中异常值或可能以某种方式使样本偏差的内容。例如,尝试提前设置您的采样计划,并考虑到任何需要的变化(例如安全性)。
Iterator

另一个要考虑的问题:假设您有两条路线的均值非常相似,并且在速度更快的可能性上,两条路线都不占优势。例如,一个总是10分钟或20分钟,而另一个总是15分钟。您可能会发现最好惩罚更大的不确定性(例如标准偏差),或者青睐可能花费少于某个时间阈值的不确定性。您的问题原样就可以了;我只是建议未来的完善。
Iterator

统计问题很好,但是如果您想计算出哪条路线更快的可能性,我建议您测量一下路线的长度。如果地形不是丘陵,那么较短的路线将总是更快。
mpiktas 2011年

如果风是一个重要因素,并且风速与两条路线有关,那么似乎一个人将​​需要有关A和B之间的依存关系的信息才能准确地回答问题。为此,您将需要双变量数据,并且很难同时使用两条路径。您可以邀请其他人来帮助您收集数据,但随后需要考虑骑手之间的差异。在A和B独立的情况下,下面的答案很好。

换句话说,如果我试图决定走哪条路,一个人穿过一条隧道,一个人穿过田野,狂风如风,我可能会选择,即使它的平均表现糟透了。

Answers:


12

让两个装置是μ ÿ和它们的标准偏差是σ Xσ ÿ,分别。2台游戏机(之间的时刻的差ý - X)因此具有平均μ ý - μ X和标准偏差μxμyσxσyYXμyμx。标准化差异(“ z得分”)为σx2+σy2

z=μyμxσx2+σy2.

除非您的乘车时间具有奇怪的分布,否则乘车比乘车X花更长的时间大约是在z处估算的正态累积分布ΦYXΦz

计算方式

您可以解决这个概率出你的游乐设施之一,因为你已经拥有的估计等:-)。为此目的,很容易记住的几个关键值ΦΦ 0 = 0.5 = 1 / 2Φ - 1 0.16 1 / 6Φ - 2 0.022 1 / 40,和Φ - 3 0.0013μxΦΦ(0)=.5=1/2Φ(1)0.161/6Φ(2)0.0221/40Φ(3)0.00131/750。(近似可能是差为远远大于2,但我们知道Φ - 3 具有内插帮助。)在结合Φ ż = 1 - Φ - Ž 和一个位内插的,则可以快速地将概率估计为一个有效数字,考虑到问题和数据的性质,该概率已经足够精确。|z|2Φ(3)Φ(z)=1Φ(z)

假设路线花费30分钟,标准偏差为6分钟,而路线Y花费36分钟,标准偏差为8分钟。如果有足够的数据涵盖广泛的条件,则数据的直方图可能最终近似于以下条件:XY

Two histograms

(这些是Gamma(25,30/25)和Gamma(20,36/20)变量的概率密度函数。观察到它们确实偏向右侧,就像人们期望的乘车时间一样。)

然后

μx=30,μy=36,σx=6,σy=8.

何处

z=363062+82=0.6.

我们有

Φ(0)=0.5;Φ(1)=1Φ(1)10.16=0.84.

因此,我们估计答案是0.5到0.84之间的0.6:0.5 + 0.6 *(0.84-0.5)= 0.70。(正态分布的正确但过于精确的值为0.73。)

路线比X路线花费更长的时间大约有70%的机会。在您的脑海中进行此计算将使您的注意力从下一个山丘上移开。:-)YX

(即使直方图都不是正常值,所显示的直方图的正确概率为72%:这说明了行程时间差异的“正常值”近似值的范围和实用性。)


P(X>Y)

@Macro:如果可以将数据简化为感兴趣的Q的摘要统计信息,则可以存储较少的数据...只是一种想法。
Iterator

P(X>Y), while @whuber is considering the difference in the mean times, which isn't the same. It isn't too hard to construct a case where option Y is shorter than option X 60% of the time, but the mean for Y is greater than the mean for X.
Iterator

FWIW: @whuber is describing Student's t-test for the difference in means between two samples with different standard deviations.
Iterator

1
Thanks, @whuber, this is the answer to the question I'd been trying to ask :).
Andrew Aylett

6

My instinctive approach may not be the most statistically sophisticated, but you may find it to be more fun :)

I would get a decent-sized sheet of graph paper, and divide up the columns into time blocks. Depending on how long your rides are - are we talking about a mean time of 5 minutes or an hour - you might use different sized blocks. Let's say each column is a block of two minutes. Pick a color for route A and a different color for route B, and after each ride, make a dot in the appropriate column. If there's already a dot of that color, move up one row. In other words, this would be a histogram in absolute numbers.

Then, you would be building a fun histogram with each ride you take, and can visually see the difference between the two routes.

My sense based on my own experience as a bike commuter (not verified through quantification) is that the times will not be normally distributed - they would have a positive skew, or in other words a long tail of upper-end times. My typical time is not that much longer than my shortest possible time, but every now and then I seem to hit all the red lights, and there's a much higher upper-end. Your experience may be different. That's why I think the histogram approach might be better, so you can observe the shape of the distribution yourself.

PS: I don't have enough rep to comment in this forum, but I love whuber's answer! He addresses my concern about skewness pretty effectively with a sample analysis. And I like the idea of calculating in your head to keep your mind off the next hill :)


1
+1 For creativity. Actually, your idea is on the path toward practical utility. It would be quite a bit more interesting to use one of the biking tracking sites (I forget which one now, but do add, if you know) to track segment times. If the OP were to come back to CV or StackOverflow with a question about plotting segment time and get a density associated with it, it would be a fabulous statistical exercise - GIS, statistical visualization, and density functions, oh my! :)
Iterator

1
I have used Google MyTracks on my phone to track biking segments. I find that the phone is not great at it as it tends to be a power-suck on a device not optimized for it. Garmin (and others) make GPS devices specifically targeted at runners and bikers to track time spent on routes and provide neat charts in an online interface. I don't use a dedicated GPS device myself, but some of my friends use them to share routes on facebook.
Jonathan

1
这是Garmin设备生产的示例。图表的问题在于它们已经进行了大量的预处理,平滑等操作。例如,也没有方便的方法将数据导入到R中。但是作为专用设备,它出色地完成了工作,我无法想象没有它就可以跑步或骑自行车。
mpiktas 2011年

+1请注意,碰到红灯不会有太大的偏斜(除非它们是定时的):总体而言,它们通常只会在时间分布中添加一些高斯噪声。(计算其方差是您可以在下一座山上进行的另一项心理锻炼。)实际上,偏斜来自于控制整个行驶过程的几个重要因素中的非高斯变化:天气,您的感觉,与谁在一起再骑马,和偶尔的事故/绕路/交通堵塞等
whuber

Now that I think about it some more, another very important factor is the time of day. The traffic lights act very differently at peak traffic times - much longer greens for the higher-traffic road. In off-peak times, the lights tend to cycle quickly, defaulting to green for the high-traffic road, but quickly changing when I press the crossing button or a car activates the sensor.
Jonathan

5

Suppose the two data sets are X and Y. Randomly sample one person from each population, giving you x,y. Record a '1' if x>y and 0 otherwise. Repeat this many times (say, 10000) and the mean of these indicators will give you an estimate of P(Xi>Yj) where i,j are randomly selected subjects from the two populations, respectively. In R, the code would go something like:

#X, Y are the two data sets
ii = rep(0,10000)
for(k in 1:10000)
{
   x1 = sample(X,1)
   y1 = sample(Y,1)
   ii[k] = (x1>y1) 
}

# this is an estimate of P(X>Y)
mean(ii)

This is a good answer, but you could simplify it by removing the for loop: let x1 = sample(X, 10000, replace = TRUE) and y1 = sample(Y, 10000, replace = TRUE) and then calculate mean(x1 > y1) along with mean(x1 == y1) - to get a sense of the # of times the values are equal.
Iterator

Thanks. I knew the loop was unnecessary but I wanted the logic underlying the approach to be abundantly clear. Your code would certainly produce the same results.
Macro
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.