假设我们有一个随机变量。如果是真正的参数,则所述似然函数应最大化和衍生物等于零。这是最大似然估计器背后的基本原理。
据我了解,费舍尔信息被定义为
因此,如果是真实参数,。但如果是不是真正的参数,那么我们将有费希尔信息量更大。
我的问题
- Fisher信息是否衡量给定MLE的“错误”?换句话说,是否存在积极的Fisher信息并不意味着我的MLE不够理想?
- “信息”的定义与Shannon使用的定义有何不同?我们为什么称其为信息?
假设我们有一个随机变量。如果是真正的参数,则所述似然函数应最大化和衍生物等于零。这是最大似然估计器背后的基本原理。
据我了解,费舍尔信息被定义为
因此,如果是真实参数,。但如果是不是真正的参数,那么我们将有费希尔信息量更大。
我的问题
Answers:
试图补充其他答案... Fisher信息是什么信息?启动与所述对数似然函数 作为的函数θ为θ ∈ Θ,参数空间。假设一定的规律性条件下,我们在此不讨论,我们有 ē ∂
我们如何解释呢? 是关于样本的参数θ的似然信息。这可以实际上只在相对的意义上解释,就像当我们使用它通过似然比检验来比较两个不同的可能的参数值的plausibilities ℓ (θ 0)- ℓ (θ 1)。在对数似然的变化率是得分函数˙ ℓ(告诉我们如何快速的可能性的变化,它的变化我(θ )这多少变化从样品到样品,在给定的paramiter值,即。公式(这真是奇了!) 我 θ )| θ = θ 0 告诉我们有在该信息(似然性)对于给定的参数值的变化之间的relationsship(平等), θ 0,并且似然函数的该参数值的曲率。这是部份效果统计量的变异性(方差)之间的令人惊讶的关系 ;· ℓ(
那么,似然函数是什么?我们通常认为统计模型的作为家庭为数据的概率分布的X,由参数索引θ在参数空间中的一些元素Θ。我们认为,这种模式就好像存在一定的价值是真实θ 0∈ Θ使得数据X居然有概率分布˚F (X ; θ 0)。因此,我们通过嵌入真实数据生成概率分布得到的统计模型在一个家庭中的概率分布。但是,很明显,可以以许多不同的方式完成这种嵌入,并且每个这样的嵌入将是一个“真实”模型,并且它们将提供不同的似然函数。并且,没有这种嵌入,就没有似然函数。看来我们确实确实需要一些帮助,一些有关如何明智地选择嵌入物的原则!
那么这是什么意思?这意味着选择似然函数可以告诉我们,如果真相发生一点变化,我们将如何期望数据发生变化。但是,这并不能真正由数据验证,因为数据只给出了关于真实模型功能的信息这实际上产生的数据,并没有任何关于在已选定的模型中的所有其他元素。这样,我们看到似然函数的选择类似于贝叶斯分析中先验的选择,它将非数据信息注入到分析中。让我们来看看这在一个简单的(有点人为的)例子,看看嵌入的效果˚F (X ; θ 0) 在模型中以不同的方式
让我们假定,被IID为Ñ (μ = 10 ,σ 2 = 1 )。因此,这就是真正的数据生成分布。现在,让我们在两种不同的方式,模型A和模型B.嵌入此模型中的 阿:X 1,... ,X Ñ IID Ñ (μ ,σ 2 = 1 ),μ ∈ [R 可以检查此重合 μ = 10。
的对数似然函数成为
This illustrates that, in some sense, the Fisher information tells us how fast the information from the data about the parameter would have changed if the governing parameter changed in the way postulated by the imbedding in a model family. The explanation of higher information in model B is that our model family B postulates that if the expectation would have increased, then the variance too would have increased. So that, under model B, the sample variance will also carry information about , which it will not do under model A.
Also, this example illustrates that we really do need some theory for helping us in how to construct model families.
Let's think in terms of the negative log-likelihood function . The negative score is its gradient with respect to the parameter value. At the true parameter, the score is zero. Otherwise, it gives the direction towards the minimum (or in the case of non-convex , a saddle point or local minimum or maximum).
The Fisher information measures the curvature of around if the data follows . In other words, it tells you how much wiggling the parameter would affect your log-likelihood.
Consider that you had a big model with millions of parameters. And you had a small thumb drive on which to store your model. How should you prioritize how many bits of each parameter to store? The right answer is to allocate bits according the Fisher information (Rissanen wrote about this). If the Fisher information of a parameter is zero, that parameter doesn't matter.
We call it "information" because the Fisher information measures how much this parameter tells us about the data.
A colloquial way to think about it is this: Suppose the parameters are driving a car, and the data is in the back seat correcting the driver. The annoyingness of the data is the Fisher information. If the data lets the driver drive, the Fisher information is zero; if the data is constantly making corrections, it's big. In this sense, the Fisher information is the amount of information going from the data to the parameters.
Consider what happens if you make the steering wheel more sensitive. This is equivalent to a reparametrization. In that case, the data doesn't want to be so loud for fear of the car oversteering. This kind of reparametrization decreases the Fisher information.
Complementary to @NeilG's nice answer (+1) and to address your specific questions:
Remember that the Hessian of the log-likelihood evaluated at the ML estimates is the observed Fisher information. The estimated standard errors are the square roots of the diagonal elements of the inverse of the observed Fisher information matrix. Stemming from this the Fisher information is the trace of the Fisher information matrix. Given that the Fisher Information matrix is a Hermitian positive-semidefinite matrix matrix then the diagonal entries of it are real and non-negative; as a direct consequence it trace must be positive. This means that you can have only "non-ideal" estimators according to your assertion. So no, a positive Fisher information is not related to how ideal is your MLE.
The inverse of Fisher information is the minimum variance of an unbiased estimator (Cramér–Rao bound). In that sense the information matrix indicates how much information about the estimated coefficients is contained in the data. On the contrary the Shannon entropy was taken from thermodynamics. It relates the information content of a particular value of a variable as where is the probability of the variable taking on the value. Both are measurements of how "informative" a variable is. In the first case though you judge this information in terms of precision while in the second case in terms of disorder; different sides, same coin! :D
To recap: The inverse of the Fisher information matrix evaluated at the ML estimator values is the asymptotic or approximate covariance matrix. As this ML estimator values are found in a local minimum graphically the Fisher information shows how deep is that minimum and who much wiggle room you have around it. I found this paper by Lutwak et al. on Extensions of Fisher information and Stam’s inequality an informative read on this matter. The Wikipedia articles on the Fisher Information Metric and on Jensen–Shannon divergence are also good to get you started.