为什么当概率分布均匀时熵最大?


32

我知道熵是过程/变量随机性的量度,可以定义如下。对于集合的随机变量:。在MacKay撰写的《熵和信息论》一书中,他在第二章中提供了这一陈述ħ X = Σ X - p X 日志p X XAH(X)=xiAp(xi)log(p(xi))

如果p是均匀的,则熵最大。

直观地说,我能够理解,如果像在集合中的所有数据点都以相同的概率拾取(为组的基数),则随机性或熵的增加。但是,如果我们知道集合中的某些点比其他点更有可能发生(例如,在正态分布的情况下,数据点的最大集中度在均值附近,并且标准偏差区域较小,则随机性或熵应减少。1 /m A AA1/mm一种一种

但是,对此有任何数学证明吗?像的方程式一样,我针对对其进行微分,并将其设置为0或类似的值。p x HXp(x)

附带说明一下,信息理论中出现的熵和化学(热力学)中的熵计算之间是否有联系?


2
可以通过stats.stackexchange.com/a/49174/919回答(通过)此问题。
ub

我对克里斯托弗·毕晓普斯(Christopher Bishops)的书中的另一句话感到非常困惑,该书指出“对于单个实变量,使熵最大化的分布就是高斯”。它还指出“对于给定的协方差,具有最大熵的多元分布是高斯”。此声明如何有效?均匀分布的熵不是总是最大吗?
user76170 2014年

6
始终在可能解决方案的约束条件下执行最大化。当约束条件是所有概率都必须超过预定义的限制时,最大熵解是统一的。相反,当约束条件是期望值和方差必须等于预定义值时,则ME解决方案是高斯式的。您引用的陈述必须在声明或至少隐含理解了这些约束的特定上下文中进行。
ub

2
我可能还应该提到,“熵”一词在高斯环境中的含义与此处原始问题中的含义有所不同,因为那时我们正在讨论连续分布的熵。这种“微分熵”是与离散分布的熵不同的动物。主要区别在于,在变量变化的情况下,微分熵不是不变的。
ub

那么,这意味着最大化总是与约束有关的?如果没有约束怎么办?我的意思是,不能有这样的问题吗?哪个概率分布具有最大熵?
user76170 '02

Answers:


25

启发式地,上的概率密度函数具有最大熵的x n }证明是与{ x 1x 2,。。。最小知识量相对应的x n },即均匀分布。{x1,x2Xñ}{x1,x2,..,.xn}

现在,要获得更正式的证明,请考虑以下内容:

上的概率密度函数X Ñ } 是一组非负实数p 1p Ñ加起来为1。熵是一个连续函数Ñ元组p 1p Ñ,并且这些点位于的紧凑子组ř Ñ,所以有一个Ñ{x1,x2,..,.xn}p1,...,pnn(p1,...,pn)Rnn-熵最大的元组。我们要展示这发生在和其他地方。(1/n,...,1/n)

假设并非全部相等,例如p 1 < p 2。(显然n 1。)我们将发现具有更高熵的新概率密度。然后,由于熵在某个n元组处最大化,因此该熵在n元组处唯一地最大化,所有i的p i = 1 / npjp1<p2n1nnpi=1/ni

由于,对于小的正ε,我们有p 1 + ε < p 2 - ε{ p 1 + ε p 2 - ε p 3,...的熵p n }减去{ p 1p 2p 3,...的熵pp1<p2εp1+ε<p2ε{p1+ε,p2ε,p3,...,pn}等于{p1,p2,p3,...,pn}

为了完成证明,我们要证明这是积极的足够小ε。改写上述公式为 -p1个日志1+ε

p1log(p1+εp1)εlog(p1+ε)p2log(p2εp2)+εlog(p2ε)
ε
p1log(1+εp1)ε(logp1+log(1+εp1))p2log(1εp2)+ε(logp2+log(1εp2))

回顾为小X,上述公式是 - ε - ε 日志p 1 + ε + ε 日志p 2 + Ö ε 2= ε 日志p 2 / p 1+ ö ε 2 ,其为正时log(1+x)=x+O(x2)x

εεlogp1+ε+εlogp2+O(ε2)=εlog(p2/p1)+O(ε2)
足够小,因为 p 1 < p 2εp1<p2

不太严格的证据如下:

首先考虑以下引理:

q X 是对的间隔连续概率密度函数 在实数,用p 0q > 0。我们有 - p p d X - p q d X ,如果存在两个积分。此外,当且仅当p x = q p(x)q(x)Ip0q>0I

IplogpdxIplogqdx
对于所有 xp(x)=q(x)x

现在,令{ x 1,...上的任何概率密度函数x n },其中p i = p x i。令所有i的q i = 1 / nn i = 1 p i log q i = n i = 1 p i log n =p{x1,...,xn}pi=p(xi)qi=1/ni q的熵。因此,我们的引理说 ^ h p ^ h q ,用平等的当且仅当 p是均匀的。

i=1npilogqi=i=1npilogn=logn
qh(p)h(q)p

此外,维基百科对此也进行了简短的讨论:维基


11
我很欣赏呈现基本(无微积分)证明的努力。通过加权,可以通过加权AM-GM不等式进行严格的单线演示。exp(H)(1pi)pipi1pi=n1/pi

I don't understand how logn can be equal to logn.
user1603472

4
@user1603472 do you mean i=1npilogn=logn? Its because i=1npilogn=logni=1npi=logn×1
HBeel

@Roland I pulled the logn outside of the sum since it does not depend on i. Then the sum is equal to 1 because p1,,pn are the densities of a probability mass function.
HBeel

Same explanation with more details can be found here: math.uconn.edu/~kconrad/blurbs/analysis/entropypost.pdf
Roland

14

Entropy in physics and information theory are not unrelated. They're more different than the name suggests, yet there's clearly a link between. The purpose of entropy metric is to measure the amount of information. See my answer with graphs here to show how entropy changes from uniform distribution to a humped one.

The reason why entropy is maximized for a uniform distribution is because it was designed so! Yes, we're constructing a measure for the lack of information so we want to assign its highest value to the least informative distribution.

Example. I asked you "Dude, where's my car?" Your answer is "it's somewhere in USA between Atlantic and Pacific Oceans." This is an example of the uniform distribution. My car could be anywhere in USA. I didn't get much information from this answer.

However, if you told me "I saw your car one hour ago on Route 66 heading from Washington, DC" - this is not a uniform distribution anymore. The car's more likely to be in 60 miles distance from DC, than anywhere near Los Angeles. There's clearly more information here.

Hence, our measure must have high entropy for the first answer and lower one for the second. The uniform must be least informative distribution, it's basically "I've no idea" answer.


7

The mathematical argument is based on Jensen inequality for concave functions. That is, if f(x) is a concave function on [a,b] and y1,yn are points in [a,b], then: nf(y1+ynn)f(y1)++f(yn)

Apply this for the concave function f(x)=xlog(x) and Jensen inequality for yi=p(xi) and you have the proof. Note that p(xi) define a discrete probability distribution, so their sum is 1. What you get is log(n)i=1np(xi)log(p(xi)), with equality for the uniform distribution.


1
I actually find the Jensen's inequality proof to be a much deeper proof conceptually than the AM-GM one.
Casebash

4

On a side note, is there any connnection between the entropy that occurs information theory and the entropy calculations in chemistry (thermodynamics) ?

Yes, there is! You can see the work of Jaynes and many others following his work (such as here and here, for instance).

But the main idea is that statistical mechanics (and other fields in science, also) can be viewed as the inference we do about the world.

As a further reading I'd recommend Ariel Caticha's book on this topic.


1

An intuitive explanation:

If we put more probability mass into one event of a random variable, we will have to take away some from other events. The one will have less information content and more weight, the others more information content and less weight. Therefore the entropy being the expected information content will go down since the event with lower information content will be weighted more.

As an extreme case imagine one event getting probability of almost one, therefore the other events will have a combined probability of almost zero and the entropy will be very low.


0

Main idea: take partial derivative of each pi, set them all to zero, solve the system of linear equations.

Take a finite number of pi where i=1,...,n for an example. Denote q=1i=0n1pi.

H=i=0n1pilogpi(1q)logqHln2=i=0n1pilnpi(1q)lnq
Hpi=lnqpi=0
Then q=pi for every i, i.e., p1=p2=...=pn.


I am glad you pointed out this is the "main idea," because it's only a part of the analysis. The other part--which might not be intuitive and actually is a little trickier--is to verify this is a global minimum by studying the behavior of the entropy as one or more of the pi shrinks to zero.
whuber
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.