估计优惠券收集者问题中的n


14

优惠券收集者问题的一个变体中,您不知道优惠券的数量,必须根据数据确定该数量。我将其称为幸运饼干问题:

给定未知数量的不同幸运饼干消息,通过一次采样一个cookie并计算每个幸运出现多少次来估算。还确定在此估计上获得所需置信区间所需的样本数量。ññ

基本上,我需要一种算法,该算法只需采样足够的数据即可达到给定的置信区间,例如,置信度为。为简单起见,我们可以假设所有的命运都以相同的概率/频率出现,但是对于更普遍的问题而言并非如此,因此也欢迎对此提出解决方案。ñ±595

这似乎类似于德国的坦克问题,但是在这种情况下,幸运饼干没有按顺序贴上标签,因此没有排序。


1
我们知道消息是否同样频繁吗?
Glen_b-恢复莫妮卡2014年

编辑问题:是的
-goweon

2
您可以写下似然函数吗?
2014年

2
研究野生动植物的人会捕获,标记和释放动物。随后,他们根据重新捕获已标记动物的频率推断出种群数量。听起来您的问题在数学上等同于他们的问题。
埃米尔·弗里德曼

Answers:


6

对于相同的概率/频率情况,此方法可能对您有用。

为样本总数,N为观察到的不同项目的数量,N 1为恰好被看到的项目数量,N 2是恰好被看到的项目数量两次,A = N 11 - N 1KNN1N2 Q =Ñ1A=N1(1N1K)+2N2Q^=N1个ķ

然后,总人口规模的大约95%置信区间 为n

n^Lower=11Q^+1.96AK

n^Upper=11Q^1.96AK

实施时,您可能需要根据数据进行调整。

该方法归因于Good and Turing。带有置信区间的参考是Esty,Warren W.(1983),“随机样本覆盖率的非参数估计量的正态极限定律”Ann。统计员。,第11卷,第3期,905-912。

对于更普遍的问题,Bunge开发了免费软件,可以产生多个估算值。搜索他的名字和单词CatchAll


1
我随意添加Esty参考。请仔细检查这是您的意思
Glen_b -Reinstate Monica

如果仅知道(样本大小)和N(看到的唯一项数),@ soakley是否有可能获得边界(可能是精度较低的边界)?即我们没有关于N 1N 2的信息KNN1N2
巴斯基(Basj)

我不知道仅用N就能做到这一点KN.
soakley

2

我不知道这是否有帮助,但这是一个问题,在骨灰盒中的n个试验中,用m个标记不同的球替换了个球。根据此页面(法文),如果X n是计算不同球数的随机变量,则概率函数为: P X n = k = mknmXnP(Xn=k)=(mk)i=0k(1)ki(ki)(im)n

然后,您可以使用最大似然估计器。

这里给出了另一个有证明的公式以解决占用问题


2

似然函数和概率

在回答有关反向生日问题的问题时,科迪·莫恩(Cody Maughan)给出了似然函数的解决方案。

当我们在n次抽奖中抽出k个不同的幸运饼干(其中每个幸运饼干类型在抽奖中出现的概率相同)时,对于炊具类型的数量m的似然函数可以表示为:kn

L(m|k,n)=mnm!(mk)!P(k|m,n)=mnm!(mk)!S(n,k)Stirling number of the 2nd kind=mnm!(mk)!1k!i=0k(1)i(ki)(ki)n=(mk)i=0k(1)i(ki)(kim)n

有关右侧概率的推导,请参见占用问题。Ben 之前在此网站上对此进行了描述。该表达方式与Sylvain的回答相似。

最大似然估计

我们可以计算似然函数最大值在的一阶和二阶近似

m1(n2)nk

m2(n2)+(n2)24(nk)(n3)2(nk)

可能性区间

(注意,这是一样的置信区间看到:构建置信区间的基本逻辑

这对我来说仍然是一个未解决的问题。我还不确定如何处理表达式mnm!(mk)!

置信区间

对于置信区间,我们可以使用正态近似。在Ben的答案中,给出了以下均值和方差:

E[K]=m(1(11m)n)
V[K]=m((m1)(12m)n+(11m)nm(11m)2n)

Say for a given sample n=200 and observed unique cookies k the 95% boundaries E[K]±1.96V[K] look like:

confidence interval boundaries

In the image above the curves for the interval have been drawn by expressing the lines as a function of the population size m and sample size n (so the x-axis is the dependent variable in drawing these curves).

The difficulty is to inverse this and obtain the interval values for a given observed value k. It can be done computationally, but possibly there might be some more direct function.

In the image I have also added Clopper Pearson confidence intervals based on a direct computation of the cumulative distribution based on all the probabilities P(k|m,n) (I did this in R where I needed to use the Strlng2 function from the CryptRndTest package which is an asymptotic approximation of the logarithm of the Stirling number of the second kind). You can see that the boundaries coincide reasonably well, so the normal approximation is performing well in this case.

# function to compute Probability
library("CryptRndTest")
P5 <- function(m,n,k) {
  exp(-n*log(m)+lfactorial(m)-lfactorial(m-k)+Strlng2(n,k))
}
P5 <- Vectorize(P5)

# function for expected value 
m4 <- function(m,n) {
  m*(1-(1-1/m)^n)
}

# function for variance
v4 <- function(m,n) {
  m*((m-1)*(1-2/m)^n+(1-1/m)^n-m*(1-1/m)^(2*n))
}


# compute 95% boundaries based on Pearson Clopper intervals
# first a distribution is computed
# then the 2.5% and 97.5% boundaries of the cumulative values are located
simDist <- function(m,n,p=0.05) {
  k <- 1:min(n,m)
  dist <- P5(m,n,k)
  dist[is.na(dist)] <- 0
  dist[dist == Inf] <- 0
  c(max(which(cumsum(dist)<p/2))+1,
       min(which(cumsum(dist)>1-p/2))-1)
}


# some values for the example
n <- 200
m <- 1:5000
k <- 1:n

# compute the Pearon Clopper intervals
res <- sapply(m, FUN = function(x) {simDist(x,n)})


# plot the maximum likelihood estimate
plot(m4(m,n),m,
     log="", ylab="estimated population size m", xlab = "observed uniques k",
     xlim =c(1,200),ylim =c(1,5000),
     pch=21,col=1,bg=1,cex=0.7, type = "l", yaxt = "n")
axis(2, at = c(0,2500,5000))

# add lines for confidence intervals based on normal approximation
lines(m4(m,n)+1.96*sqrt(v4(m,n)),m, lty=2)
lines(m4(m,n)-1.96*sqrt(v4(m,n)),m, lty=2)
# add lines for conficence intervals based on Clopper Pearson
lines(res[1,],m,col=3,lty=2)
lines(res[2,],m,col=3,lty=2)

# add legend
legend(0,5100,
       c("MLE","95% interval\n(Normal Approximation)\n","95% interval\n(Clopper-Pearson)\n")
       , lty=c(1,2,2), col=c(1,1,3),cex=0.7,
       box.col = rgb(0,0,0,0))

For the case of unequal probabilities. You can approximate the number of cookies of a particular type as independent Binomial/Poisson distributed variables and describe whether they are filled or not as Bernouilli variables. Then add together the variance and means for those variables. I guess that this is also how Ben derived/approximated the expectation value and variance. ----- A problem is how you describe these different probabilities. You can not do this explicitly since you do not know the number of cookies.
Sextus Empiricus
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.