请参阅以下适用于Python2的Python脚本。
答案的灵感来自David C的答案。
我的最终答案是,根据https://www.ssa.gov/oact/babynames/limits.html “国家数据”中的数据,在一类中至少找到五个Jacobs的可能性,其中Jacob是最可能的名字。从2006年开始。
根据二项式分布计算概率,其中Jacob-Probability为成功概率。
import pandas as pd
from scipy.stats import binom
data = pd.read_csv(r"yob2006.txt", header=None, names=["Name", "Sex", "Count"])
# count of children in the dataset:
sumCount = data.Count.sum()
# do calculation for every name:
for i, row in data.iterrows():
# relative counts of each name being interpreted as probabily of occurrence
data.loc[i, "probability"] = data.loc[i, "Count"]/float(sumCount)
# Probabilites being five or more children with that name in a class of size n=25,50 or 100
data.loc[i, "atleast5_class25"] = 1 - binom.cdf(4,25,data.loc[i, "probability"])
data.loc[i, "atleast5_class50"] = 1 - binom.cdf(4,50,data.loc[i, "probability"])
data.loc[i, "atleast5_class100"] = 1 - binom.cdf(4,100,data.loc[i, "probability"])
maxP25 = data["atleast5_class25"].max()
maxP50 = data["atleast5_class50"].max()
maxP100 = data["atleast5_class100"].max()
print ("""Max. probability for at least five kids with same name out of 25: {:.2} for name {}"""
.format(maxP25, data.loc[data.atleast5_class25==maxP25,"Name"].values[0]))
print
print ("""Max. probability for at least five kids with same name out of 50: {:.2} for name {}, of course."""
.format(maxP50, data.loc[data.atleast5_class50==maxP50,"Name"].values[0]))
print
print ("""Max. probability for at least five kids with same name out of 100: {:.2} for name {}, of course."""
.format(maxP100, data.loc[data.atleast5_class100==maxP100,"Name"].values[0]))
最高 至少有五个同名孩子的概率低于25:雅各布名字的4.7e-07
最高 至少有五个同名孩子的概率不超过50:雅各布名字当然是1.6e-05。
最高 当然,至少有五个相同名称的孩子的概率不超过100:Jacob名称为0.00045。
与David C的结果相差十倍。谢谢。(我的答案并没有列出所有名称,应该进行讨论)