观察到的信息矩阵是否是预期信息矩阵的一致估计?


16

我试图证明在弱一致性最大似然估计器(MLE)处评估的观测信息矩阵是预期信息矩阵的弱一致性估计器。这是被广泛引用的结果,但没有人提供参考或证明(我已经用尽我认为Google搜索结果的前20页和我的统计资料教科书)!

使用MLE的弱一致序列,我可以使用大数弱定律(WLLN)和连续映射定理来获得所需的结果。但是,我相信不能使用连续映射定理。相反,我认为需要使用统一的大数定律(ULLN)。有人知道有证明这一点的参考文献吗?我尝试了ULLN,但为简洁起见,现在省略。

对于这个问题的冗长,我深表歉意,但必须引入一些符号。表示法如下(我的证明在结尾)。

假设我们有随机变量的IID样本{Y1,,YN}与密度f(Y~|θ),其中θΘRk(这里Y~是具有相同密度的只是一般随机变量作为样本的任何成员)。向量Y=(Y1,,YN)T是所有样本向量的向量,其中YiRn所有i=1,,N。密度的真实参数值是θ0 θ ÑÝ是的弱一致最大似然估计(MLE) θ 0。根据规律性条件,Fisher信息矩阵可以写为θ^N(Y)θ0

I(θ)=Eθ[Hθ(logf(Y~|θ)]

其中Hθ是Hessian矩阵。等效样本为

IN(θ)=i=1NIyi(θ),

其中Iyi=Eθ[Hθ(logf(Yi|θ)]。所观察到的信息矩阵是;

J(θ)=Hθ(logf(y|θ)

(有些人的需求矩阵在评估θ,但有些却没有)。样本观察信息矩阵为:θ^

JN(θ)=Ni=1Jyi(θ)

其中Jyi(θ)=Hθ(logf(yi|θ)

我可以证明在所述估计的概率收敛θ ,但不ñ - 1 Ĵ Ñθ ÑÝ θ 0N1JN(θ)I(θ)N1JN(θ^N(Y))I(θ0)。到目前为止,这是我的证明;

Now (JN(θ))rs=Ni=1(Hθ(logf(Yi|θ))rs is element (r,s) of JN(θ), for any r,s=1,,k. If the sample is iid, then by the weak law of large numbers (WLLN), the average of these summands converges in probability to Eθ[(Hθ(logf(Y1|θ))rs]=(IY1(θ))rs=(I(θ))rs. Thus N1(JN(θ))rsP(I(θ))rs for all r,s=1,,k, and so N1JN(θ)PI(θ). Unfortunately we cannot simply conclude N1JN(θ^N(Y))PI(θ0) by using the continuous mapping theorem since N1JN() is not the same function as I().

Any help on this would be greatly appreciated.



does my answer below address answer your question?
Dapz

1
@Dapz Please accept my sincerest apologies for not replying to you until now - I made the mistake of assuming nobody would answer. Thank-you for your answer below - I have upvoted it since I can see it is most useful, however I need to spend a little time considering it. Thank-you for your time, and I will reply to your post below soon.
dandar

Answers:


7

I guess directly establishing some sort of uniform law of large numbers is one possible approach.

Here is another.

We want to show that JN(θMLE)NPI(θ).

(As you said, we have by the WLLN that JN(θ)NPI(θ). But this doesn't directly help us.)

One possible strategy is to show that

|I(θ)JN(θ)N|P0.

and

|JN(θMLE)NJN(θ)N|P0

If both of the results are true, then we can combine them to get

|I(θ)JN(θMLE)N|P0,

which is exactly what we want to show.

The first equation follows from the weak law of large numbers.

The second almost follows from the continuous mapping theorem, but unfortunately our function g() that we want to apply the CMT to changes with N: our g is really gN(θ):=JN(θ)N. So we cannot use the CMT.

(Comment: If you examine the proof of the CMT on Wikipedia, notice that the set Bδ they define in their proof for us now also depends on n. We essentially need some sort of equicontinuity at θ over our functions gN(θ).)

Fortunately, if you assume that the family G={gN|N=1,2,} is stochastically equicontinuous at θ, then it immediately follows that for θMLEPθ,

|gn(θMLE)gn(θ)|P0.

(See here: http://www.cs.berkeley.edu/~jordan/courses/210B-spring07/lectures/stat210b_lecture_12.pdf for a definition of stochastic equicontinuity at θ, and a proof of the above fact.)

Therefore, assuming that G is SE at θ, your desired result holds true and the empirical Fisher information converges to the population Fisher information.

Now, the key question of course is, what sort of conditions do you need to impose on G to get SE? It looks like one way to do this is to establish a Lipshitz condition on the entire class of functions G (see here: http://econ.duke.edu/uploads/media_items/uniform-convergence-and-stochastic-equicontinuity.original.pdf ).


1

The answer above using stochastic equicontinuity works very well, but here I am answering my own question by using a uniform law of large numbers to show that the observed information matrix is a strongly consistent estimator of the information matrix , i.e. N1JN(θ^N(Y))a.s.I(θ0) if we plug-in a strongly consistent sequence of estimators. I hope it is correct in all details.

We will use IN={1,2,...,N} to be an index set, and let us temporarily adopt the notation J(Y~,θ):=J(θ) in order to be explicit about the dependence of J(θ) on the random vector Y~. We shall also work elementwise with (J(Y~,θ))rs and (JN(θ))rs=Ni=1(J(Yi,θ))rs, r,s=1,...,k, for this discussion. The function (J(,θ))rs is real-valued on the set Rn×Θ, and we will suppose that it is Lebesgue measurable for every θΘ. A uniform (strong) law of large numbers defines a set of conditions under which

supθΘN1(JN(θ))rsEθ[(J(Y1,θ))rs]=supθΘN1Ni=1(J(Yi,θ))rs(I(θ))rsa.s0(1)

The conditions that must be satisfied in order that (1) holds are (a) Θ is a compact set; (b) (J(Y~,θ))rs is a continuous function on Θ with probability 1; (c) for each θΘ (J(Y~,θ))rs is dominated by a function h(Y~), i.e. |(J(Y~,θ))rs|<h(Y~); and (d) for each θΘ Eθ[h(Y~)]<;. These conditions come from Jennrich (1969, Theorem 2).

Now for any yiRn, iIN and θSΘ, the following inequality obviously holds

N1Ni=1(J(yi,θ))rs(I(θ))rssupθSN1Ni=1(J(yi,θ))rs(I(θ))rs.(2)

Suppose that {θ^N(Y)} is a strongly consistent sequence of estimators for θ0, and let ΘN1=BδN1(θ0)KΘ be an open ball in Rk with radius δN10 as N1, and suppose K is compact. Then since θ^N(Y)ΘN1 for N sufficiently large enough we have P[limN{θ^N(Y)ΘN1}]=1 for sufficiently large N. Together with (2) this implies

P[limN{N1Ni=1(J(Yi,θ^N(Y)))rs(I(θ^N(Y)))rssupθΘN1N1Ni=1(J(Yi,θ))rs(I(θ))rs}]=1.(3)

Now ΘN1Θ implies conditions (a)-(d) of Jennrich (1969, Theorem 2) apply to ΘN1. Thus (1) and (3) imply

P[limN{N1Ni=1(J(Yi,θ^N(Y)))rs(I(θ^N(Y)))rs=0}]=1.(4)

Since (I(θ^N(Y)))rsa.s.I(θ0) then (4) implies that N1(JN(θ^N(Y)))rsa.s.(I(θ0))rs. Note that (3) holds however small ΘN1 is, and so the result in (4) is independent of the choice of N1 other than N1 must be chosen such that ΘN1Θ. This result holds for all r,s=1,...,k, and so in terms of matrices we have N1JN(θ^N(Y))a.s.I(θ0).

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.