Fisher信息是什么信息?


29

假设我们有一个随机变量XFX|θ。如果θ0是真正的参数,则所述似然函数应最大化和衍生物等于零。这是最大似然估计器背后的基本原理。

据我了解,费舍尔信息被定义为

一世θ=Ë[θFX|θ2]

因此,如果θ0是真实参数,一世θ=0。但如果θ0是不是真正的参数,那么我们将有费希尔信息量更大。

我的问题

  1. Fisher信息是否衡量给定MLE的“错误”?换句话说,是否存在积极的Fisher信息并不意味着我的MLE不够理想?
  2. “信息”的定义与Shannon使用的定义有何不同?我们为什么称其为信息?

为什么你写的?期望值超过X分布的值,就好像它们来自参数θ的分布。ËθXθ
Neil G

3
同样,在真实参数下,不为零。一世θ
Neil G

E(S)为零(即,对得分函数的期望),但是正如Neil G所写-渔民信息(V(S))并非(通常)为零。
Tal Galili

Answers:


15

试图补充其他答案... Fisher信息是什么信息?启动与所述对数似然函数 作为的函数θθ ∈ Θ,参数空间。假设一定的规律性条件下,我们在此不讨论,我们有 ē

θ=日志FX;θ
θθΘ(我们将写相对于衍生物的参数作为点作为这里)。方差是Fisher信息 θ=Èθ ;· θ2=-ëθ ¨ θ 表示,它是对数似然函数的(负)曲率的最后一个公式。人们经常发现最大似然估计(mle)为Ëθθ=Ëθ˙θ=0
一世θ=Ëθ˙θ2=-Ëθ¨θ
通过求解似然方程 ;· θ = 0时Fisher信息作为分数的方差 ;· θ 变大,则解决这个方程将给数据很敏感,使对于高一个希望精确度 至少渐近地证实了这一点,该渐进性的渐近方差是费舍尔信息的倒数。θ˙θ=0˙θ

我们如何解释呢? 是关于样本的参数θ的似然信息。这可以实际上只在相对的意义上解释,就像当我们使用它通过似然比检验来比较两个不同的可能的参数值的plausibilities θ 0- θ 1。在对数似然的变化率是得分函数˙ θθ(θ0)(θ1)告诉我们如何快速的可能性的变化,它的变化θ ˙(θ)I(θ)这多少变化从样品到样品,在给定的paramiter值,即。公式(这真是奇了!) θ | θ = θ 0θ0 告诉我们有在该信息(似然性)对于给定的参数值的变化之间的relationsship(平等), θ 0,并且似然函数的该参数值的曲率。这是部份效果统计量的变异性(方差)之间的令人惊讶的关系 ;·

I(θ)=Eθ¨(θ)
θ0˙(θ)θ=θ0并在似然预期变化时,我们改变该参数在一些间隔围绕θ 0(对于相同的数据)。这确实既奇怪,令人惊讶又强大!θθ0

那么,似然函数是什么?我们通常认为统计模型的作为家庭为数据的概率分布的X,由参数索引θ在参数空间中的一些元素Θ。我们认为,这种模式就好像存在一定的价值是真实θ 0∈ Θ使得数据X居然有概率分布˚F X ; θ 0{f(x;θ),θΘ}xθΘθ0Θxf(x;θ0)。因此,我们通过嵌入真实数据生成概率分布得到的统计模型在一个家庭中的概率分布。但是,很明显,可以以许多不同的方式完成这种嵌入,并且每个这样的嵌入将是一个“真实”模型,并且它们将提供不同的似然函数。并且,没有这种嵌入,就没有似然函数。看来我们确实确实需要一些帮助,一些有关如何明智地选择嵌入物的原则!f(x;θ0)

那么这是什么意思?这意味着选择似然函数可以告诉我们,如果真相发生一点变化,我们将如何期望数据发生变化。但是,这并不能真正由数据验证,因为数据只给出了关于真实模型功能的信息这实际上产生的数据,并没有任何关于在已选定的模型中的所有其他元素。这样,我们看到似然函数的选择类似于贝叶斯分析中先验的选择,它将非数据信息注入到分析中。让我们来看看这在一个简单的(有点人为的)例子,看看嵌入的效果˚F X ; θ 0f(x;θ0)f(x;θ0) 在模型中以不同的方式

让我们假定,被IID为Ñ μ = 10 σ 2 = 1 。因此,这就是真正的数据生成分布。现在,让我们在两种不同的方式,模型A和模型B.嵌入此模型中的 X 1... X Ñ IID Ñ μ σ 2 = 1 μ ∈ [RX1,,XnN(μ=10,σ2=1) 可以检查此重合 μ = 10

A:X1,,Xn iid N(μ,σ2=1),μRB:X1,,Xn iid N(μ,μ/10),μ>0
μ=10

的对数似然函数成为

A(μ)=n2log(2π)12i(xiμ)2B(μ)=n2log(2π)n2log(μ/10)102i(xiμ)2μ

˙A(μ)=n(x¯μ)˙B(μ)=n2μ102i(xiμ)215n
¨A(μ)=n¨B(μ)=n2μ2+102i2xi2μ3
so, the Fisher information do really depend on the imbedding. Now, we calculate the Fisher information at the true value μ=10,
IA(μ=10)=n,IB(μ=10)=n(1200+20202000)>n
so the Fisher information about the parameter is somewhat larger in model B.

This illustrates that, in some sense, the Fisher information tells us how fast the information from the data about the parameter would have changed if the governing parameter changed in the way postulated by the imbedding in a model family. The explanation of higher information in model B is that our model family B postulates that if the expectation would have increased, then the variance too would have increased. So that, under model B, the sample variance will also carry information about μ, which it will not do under model A.

Also, this example illustrates that we really do need some theory for helping us in how to construct model families.


1
great explanation. Why do you say \Eθ˙(θ)=0? it's a function of θ - isn't it 0 only when evaluated at the true parameter θ0?
ihadanny

1
Yes, what you say is true, @idadanny It is zero when evaluated at the true parameter value.
kjetil b halvorsen

Thanks again @kjetil - so just one more question: is the surprising relationship between the variance of the score and the curvature of the likelihood true for every θ? or only in the neighborhood of the true parameter θ0?
ihadanny

Again, that trelationship is true for the true parameter value. But for that to be of much help, there must be continuity, so that it is approximately true in some neighborhood, since we will use it at the estimated value θ^, not only at the true (unknown) value.
kjetil b halvorsen

so, the relationship holds for the true parameter θ0, it almost holds for θmle since we assume that it's in the neighborhood of θ0, but for a general θ1 it does not hold, right?
ihadanny

31

Let's think in terms of the negative log-likelihood function . The negative score is its gradient with respect to the parameter value. At the true parameter, the score is zero. Otherwise, it gives the direction towards the minimum (or in the case of non-convex , a saddle point or local minimum or maximum).

The Fisher information measures the curvature of around θ if the data follows θ. In other words, it tells you how much wiggling the parameter would affect your log-likelihood.

Consider that you had a big model with millions of parameters. And you had a small thumb drive on which to store your model. How should you prioritize how many bits of each parameter to store? The right answer is to allocate bits according the Fisher information (Rissanen wrote about this). If the Fisher information of a parameter is zero, that parameter doesn't matter.

We call it "information" because the Fisher information measures how much this parameter tells us about the data.


A colloquial way to think about it is this: Suppose the parameters are driving a car, and the data is in the back seat correcting the driver. The annoyingness of the data is the Fisher information. If the data lets the driver drive, the Fisher information is zero; if the data is constantly making corrections, it's big. In this sense, the Fisher information is the amount of information going from the data to the parameters.

Consider what happens if you make the steering wheel more sensitive. This is equivalent to a reparametrization. In that case, the data doesn't want to be so loud for fear of the car oversteering. This kind of reparametrization decreases the Fisher information.


20

Complementary to @NeilG's nice answer (+1) and to address your specific questions:

  1. I would say it counts the "precision" rather than the "error" itself.

Remember that the Hessian of the log-likelihood evaluated at the ML estimates is the observed Fisher information. The estimated standard errors are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. Stemming from this the Fisher information is the trace of the Fisher information matrix. Given that the Fisher Information matrix I is a Hermitian positive-semidefinite matrix matrix then the diagonal entries Ij,j of it are real and non-negative; as a direct consequence it trace tr(I) must be positive. This means that you can have only "non-ideal" estimators according to your assertion. So no, a positive Fisher information is not related to how ideal is your MLE.

  1. The definition differs in the way we interpreter the notion of information in both cases. Having said that, the two measurements are closely related.

The inverse of Fisher information is the minimum variance of an unbiased estimator (Cramér–Rao bound). In that sense the information matrix indicates how much information about the estimated coefficients is contained in the data. On the contrary the Shannon entropy was taken from thermodynamics. It relates the information content of a particular value of a variable as p·log2(p) where p is the probability of the variable taking on the value. Both are measurements of how "informative" a variable is. In the first case though you judge this information in terms of precision while in the second case in terms of disorder; different sides, same coin! :D

To recap: The inverse of the Fisher information matrix I evaluated at the ML estimator values is the asymptotic or approximate covariance matrix. As this ML estimator values are found in a local minimum graphically the Fisher information shows how deep is that minimum and who much wiggle room you have around it. I found this paper by Lutwak et al. on Extensions of Fisher information and Stam’s inequality an informative read on this matter. The Wikipedia articles on the Fisher Information Metric and on Jensen–Shannon divergence are also good to get you started.

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.