似然函数和概率
在回答有关反向生日问题的问题时,科迪·莫恩(Cody Maughan)给出了似然函数的解决方案。
当我们在n次抽奖中抽出k个不同的幸运饼干(其中每个幸运饼干类型在抽奖中出现的概率相同)时,对于炊具类型的数量m的似然函数可以表示为:kn
L(m|k,n)=m−nm!(m−k)!∝P(k|m,n)===m−nm!(m−k)!⋅S(n,k)Stirling number of the 2nd kindm−nm!(m−k)!⋅1k!∑ki=0(−1)i(ki)(k−i)n(mk)∑ki=0(−1)i(ki)(k−im)n
有关右侧概率的推导,请参见占用问题。Ben 之前在此网站上对此进行了描述。该表达方式与Sylvain的回答相似。
最大似然估计
我们可以计算似然函数最大值在的一阶和二阶近似
m1≈(n2)n−k
m2≈(n2)+(n2)2−4(n−k)(n3)−−−−−−−−−−−−−−−√2(n−k)
可能性区间
(注意,这是不一样的置信区间看到:构建置信区间的基本逻辑)
这对我来说仍然是一个未解决的问题。我还不确定如何处理表达式m−nm!(m−k)!
置信区间
对于置信区间,我们可以使用正态近似。在Ben的答案中,给出了以下均值和方差:
E[K]=m(1−(1−1m)n)
V[K]=m((m−1)(1−2m)n+(1−1m)n−m(1−1m)2n)
Say for a given sample n=200 and observed unique cookies k the 95% boundaries E[K]±1.96V[K]−−−−√ look like:
In the image above the curves for the interval have been drawn by expressing the lines as a function of the population size m and sample size n (so the x-axis is the dependent variable in drawing these curves).
The difficulty is to inverse this and obtain the interval values for a given observed value k. It can be done computationally, but possibly there might be some more direct function.
In the image I have also added Clopper Pearson confidence intervals based on a direct computation of the cumulative distribution based on all the probabilities P(k|m,n) (I did this in R where I needed to use the Strlng2
function from the CryptRndTest package which is an asymptotic approximation of the logarithm of the Stirling number of the second kind). You can see that the boundaries coincide reasonably well, so the normal approximation is performing well in this case.
# function to compute Probability
library("CryptRndTest")
P5 <- function(m,n,k) {
exp(-n*log(m)+lfactorial(m)-lfactorial(m-k)+Strlng2(n,k))
}
P5 <- Vectorize(P5)
# function for expected value
m4 <- function(m,n) {
m*(1-(1-1/m)^n)
}
# function for variance
v4 <- function(m,n) {
m*((m-1)*(1-2/m)^n+(1-1/m)^n-m*(1-1/m)^(2*n))
}
# compute 95% boundaries based on Pearson Clopper intervals
# first a distribution is computed
# then the 2.5% and 97.5% boundaries of the cumulative values are located
simDist <- function(m,n,p=0.05) {
k <- 1:min(n,m)
dist <- P5(m,n,k)
dist[is.na(dist)] <- 0
dist[dist == Inf] <- 0
c(max(which(cumsum(dist)<p/2))+1,
min(which(cumsum(dist)>1-p/2))-1)
}
# some values for the example
n <- 200
m <- 1:5000
k <- 1:n
# compute the Pearon Clopper intervals
res <- sapply(m, FUN = function(x) {simDist(x,n)})
# plot the maximum likelihood estimate
plot(m4(m,n),m,
log="", ylab="estimated population size m", xlab = "observed uniques k",
xlim =c(1,200),ylim =c(1,5000),
pch=21,col=1,bg=1,cex=0.7, type = "l", yaxt = "n")
axis(2, at = c(0,2500,5000))
# add lines for confidence intervals based on normal approximation
lines(m4(m,n)+1.96*sqrt(v4(m,n)),m, lty=2)
lines(m4(m,n)-1.96*sqrt(v4(m,n)),m, lty=2)
# add lines for conficence intervals based on Clopper Pearson
lines(res[1,],m,col=3,lty=2)
lines(res[2,],m,col=3,lty=2)
# add legend
legend(0,5100,
c("MLE","95% interval\n(Normal Approximation)\n","95% interval\n(Clopper-Pearson)\n")
, lty=c(1,2,2), col=c(1,1,3),cex=0.7,
box.col = rgb(0,0,0,0))