期望最大化算法的动机

20

在EM算法的方法，我们用Jensen不等式在到达

\log p (x | θ) \geq \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z - \int \log p (z | x, θ) p (z | x, θ^{(k)}) d z

$\log p(x|\theta) \geq \int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz - \int \log p(z|x,\theta) p(z|x,\theta^{(k)})dz$

$\theta^{(k+1)}$

θ^{(k + 1)} = \arg max_{θ} \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z

$\theta^{(k+1)}=\arg \max_{\theta}\int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz$

我读过EM的所有内容都只能解决这个问题，但是我一直对不了解EM算法为何自然产生的解释感到不安。我了解到可能性通常是用来处理加法而不是乘法，但是定义中的出现对我来说没有动力。为什么要考虑而不考虑其他单调函数？由于种种原因，我怀疑期望最大化背后的“含义”或“动机”在信息论和足够的统计方面有某种解释。如果有这样的解释，那将不仅仅是抽象算法而已。 $\log$ $\log$ $\theta^{(k+1)}$ $\log$

mixture expectation-maximization

— 用户名
source

3

期望最大化算法是什么？，Nature Biotechnology 26：897-899（2008）上有一张很好的图片，说明了该算法的工作原理。

— chl

@chl：我看过那篇文章。我要问的是要注意的是，它无处可解释为什么非日志方法无法正常工作

— user782220 2013年

10

EM算法具有不同的解释，并且可以在不同的应用程序中以不同的形式出现。

这一切都始于似然函数 $p(x \vert \theta)$ 或等效地，我们想最大化的对数似然函数 $\log p(x \vert \theta)$ 。（我们通常使用对数来简化计算：严格为单调，凹面，并且 $\log(ab) = \log a + \log b$ 。）在理想情况下，的值 $p$ 仅取决于模型参数 $\theta$ ，因此我们可以搜索的空间 $\theta$ 并找到一个最大化的 $p$ 。

但是，在许多有趣的现实世界应用程序中，事情变得更加复杂，因为并非观察到所有变量。是的，我们可能直接观察 $x$ ，但未观察到其他一些变量 $z$ 。由于缺少变量 $z$ ，我们处于一种“鸡与蛋”的情况：没有 $z$ 我们就无法估计参数 $\theta$ ，没有 $\theta$ 我们就无法推断的值 $z$ 。

这就是EM算法发挥作用的地方。我们从模型参数的初始猜测开始， $\theta$ 并得出缺失变量的期望值 $z$ （即，E步）。当我们拥有的值时 $z$ ，我们可以使参数的似然性最大化 $\theta$ （即M阶，对应于问题陈述中的 $\arg \max$ 方程）。有了这个 $\theta$ 我们可以得出的新期望值 $z$ （另一个E步），依此类推。换句话说，在每个步骤中，我们都假设 $z$ 和 $\theta$ ，是众所周知的。我们重复此迭代过程，直到不再增加可能性为止。

简而言之，这是EM算法。众所周知，在此迭代EM过程中，可能性永远不会降低。但是请记住，EM算法不能保证全局最优。也就是说，它可能以似然函数的局部最优结束。

出现在方程中的是不可避免的，因为在这里要最大化的函数写为log似然。 $\log$ $\theta^{(k+1)}$

— 微微
source

我看不出这如何回答问题。

— broncoAbierto

9

可能性与对数可能性

正如已经说过的那样，以最大的可能性引入对仅是因为通常优化总和比乘积容易。我们不考虑其他单调函数的原因是，对数是具有将乘积转化为和的性质的唯一函数。 $\log$

激发对数的另一种方法是：代替最大化模型下数据的概率，我们可以等效地尝试最小化数据分布和模型分布之间的Kullback-Leibler差异。， $p_\text{data}(x)$ $p(x \mid \theta)$

D_{KL} [p_{data} (x) ∣∣ p (x ∣ θ)] = \int p_{data} (x) \log \frac{p_{data} (x)}{p (x ∣ θ)} d x = c o n s t - \int p_{data} (x) \log p (x ∣ θ) d x .

$D_\text{KL}[p_\text{data}(x) \mid\mid p(x \mid \theta)] = \int p_\text{data}(x) \log \frac{p_\text{data}(x)}{p(x \mid \theta)} \, dx = const - \int p_\text{data}(x)\log p(x \mid \theta) \, dx.$

右侧的第一项在参数中是恒定的。如果我们有来自数据分布（我们的数据点）的样本，我们可以用数据的平均对数似然来近似第二项， $N$

\int p_{data} (x) \log p (x ∣ θ) d x \approx \frac{1}{N} \sum_{n} \log p (x_{n} ∣ θ) .

$\int p_\text{data}(x)\log p(x \mid \theta) \, dx \approx \frac{1}{N} \sum_n \log p(x_n \mid \theta).$

EM的替代观点

我不确定这是否是您要寻找的解释，但是我发现以下期望最大化的观点比通过詹森不等式的动机更具启发性（您可以在Neal＆Hinton（1998）中找到详细的描述）或克里斯·毕晓普（Chris Bishop）的PRML书中的第9.3章）。

不难证明

\log p (x ∣ θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z + D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)]

$\log p(x \mid \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz + D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)]$

for any $q(z \mid x)$ . If we call the first term on the right-hand side $F(q, \theta)$ , this implies that

F (q, θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z = \log p (x ∣ θ) - D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)] .

$F(q, \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz = \log p(x \mid \theta) - D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)].$

Because the KL divergence is always positive, $F(q, \theta)$ is a lower bound on the log-likelihood for every fixed $q$ . Now, EM can be viewed as alternately maximizing $F$ with respect to $q$ and $\theta$ . In particular, by setting $q(z \mid x) = p(z \mid x, \theta)$ in the E-step, we minimize the KL divergence on the right-hand side and thus maximize $F$ .

— Lucas
source

Thanks for the post! Though the given document doesn't say logarithm is the unique function turning products into sums. It says logarithm is the only function that fulfills all three listed properties at the same time.

— Weiwei

@Weiwei: Right, but the first condition mainly requires that the function is invertible. Of course, f(x) = 0 also implies f(x + y) = f(x)f(y), but this is an uninteresting case. The third condition asks that the derivative at 1 is 1, which is only true for the logarithm to base

e

$e$ . Drop this constraint and you get logarithms to different bases, but still logarithms.

— Lucas

4

The paper that I found clarifying with respect to expectation-maximization is Bayesian K-Means as a "Maximization-Expectation" Algorithm (pdf) by Welling and Kurihara.

Suppose we have a probabilistic model $p(x,z,\theta)$ with $x$ observations, $z$ hidden random variables, and a total of $\theta$ parameters. We are given a dataset $D$ and are forced (by higher powers) to establish $p(z,\theta|D)$ .

1. Gibbs sampling

We can approximate $p(z,\theta|D)$ by sampling. Gibbs sampling gives $p(z,\theta|D)$ by alternating:

θ \sim p (θ | z, D) z \sim p (z | θ, D)

$\theta \sim p(\theta|z,D) \\ z \sim p(z|\theta,D)$

2. Variational Bayes

Instead, we can try to establish a distribution $q(\theta)$ and $q(z)$ and minimize the difference with the distribution we are after $p(\theta,z|D)$ . The difference between distributions has a convenient fancy name, the KL-divergence. To minimize $KL[q(\theta)q(z)||p(\theta,z|D)]$ we update:

q (θ) \propto \exp (E [\log p (θ, z, D)]_{q (z)}) q (z) \propto \exp (E [\log p (θ, z, D)]_{q (θ)})

$q(\theta) \propto \exp (E [\log p(\theta,z,D) ]_{q(z)} ) \\ q(z) \propto \exp (E [\log p(\theta,z,D) ]_{q(\theta)} )$

3. Expectation-Maximization

To come up with full-fledged probability distributions for both $z$ and $\theta$ might be considered extreme. Why don't we instead consider a point estimate for one of these and keep the other nice and nuanced. In EM the parameter $\theta$ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, $\theta^*$ .

θ^{*} = \underset{θ}{argmax} E [\log p (θ, z, D)]_{q (z)} q (z) = p (z | θ^{*}, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(z)} \\ q(z) = p(z|\theta^*,D)$

Here $\theta^* \in \operatorname{argmax}$ would actually be a better notation: the argmax operator can return multiple values. But let's not nitpick. Compared to variational Bayes you see that correcting for the $\log$ by $\exp$ doesn't change the result, so that is not necessary anymore.

4. Maximization-Expectation

There is no reason to treat $z$ as a spoiled child. We can just as well use point estimates $z^*$ for our hidden variables and give the parameters $\theta$ the luxury of a full distribution.

z^{*} = \underset{z}{argmax} E [\log p (θ, z, D)]_{q (θ)} q (θ) = p (θ | z^{*}, D)

$z^* = \underset{z}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(\theta)} \\ q(\theta) = p(\theta|z^*,D)$

If our hidden variables $z$ are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).

5. Iterated conditional modes

Of course, the poster child of approximate inference is to use point estimates for both the parameters $\theta$ as well as the observations $z$ .

θ^{*} = \underset{θ}{argmax} p (θ, z^{*}, D) z^{*} = \underset{z}{argmax} p (θ^{*}, z, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} p(\theta,z^*,D) \\ z^* = \underset{z}{\operatorname{argmax}} p(\theta^*,z,D) \\$

To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a $k$ -means alternative, but this lucid and concise exposition of approximation.

— Anne van Rossum
source

(+1) this is a beautiful summary of all methods.

— kedarps

4

There is a useful optimisation technique underlying the EM algorithm. However, it's usually expressed in the language of probability theory so it's hard to see that at the core is a method that has nothing to do with probability and expectation.

Consider the problem of maximising

g (x) = \sum_{i} \exp (f_{i} (x))

$g(x)=\sum_i\exp(f_i(x))$ (or equivalently

\log g (x)

$\log g(x)$ ) with respect to

x

$x$ . If you write down an expression for

g^{'} (x)

$g'(x)$ and set it equal to zero you will often end up with a transcendental equation to solve. These can be nasty.

Now suppose that the $f_i$ play well together in the sense that linear combinations of them give you something easy to optimise. For example, if all of the $f_i(x)$ are quadratic in $x$ then a linear combination of the $f_i(x)$ will also be quadratic, and hence easy to optimise.

Given this supposition, it'd be cool if, in order to optimise $\log g(x)=\log \sum_i\exp(f_i(x))$ we could somehow shuffle the $\log$ past the $\sum$ so it could meet the $\exp$ s and eliminate them. Then the $f_i$ could play together. But we can't do that.

Let's do the next best thing. We'll make another function $h$ that is similar to $g$ . And we'll make it out of linear combinations of the $f_i$ .

Let's say $x_0$ is a guess for an optimal value. We'd like to improve this. Let's find another function $h$ that matches $g$ and its derivative at $x_0$ , i.e. $g(x_0)=h(x_0)$ and $g'(x_0)=h'(x_0)$ . If you plot a graph of $h$ in a small neighbourhood of $x_0$ it's going to look similar to $g$ .

You can show that

g^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x)) .

$g'(x)=\sum_i f_i'(x)\exp(f_i(x)).$ We want something that matches this at

x_{0}

$x_0$ . There's a natural choice:

h (x) = constant + \sum_{i} f_{i} (x) \exp (f_{i} (x_{0})) .

$h(x)=\mbox{constant}+\sum_i f_i(x)\exp(f_i(x_0)).$ You can see they match at

x = x_{0}

$x=x_0$ . We get

h^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x_{0})) .

$h'(x)=\sum_i f_i'(x)\exp(f_i(x_0)).$ As

x_{0}

$x_0$ is a constant we have a simple linear combination of the

f_{i}

$f_i$ whose derivative matches

g

$g$ . We just have to choose the constant in

h

$h$ to make

g (x_{0}) = h (x_{0})

$g(x_0)=h(x_0)$ .

So starting with $x_0$ , we form $h(x)$ and optimise that. Because it's similar to $g(x)$ in the neighbourhood of $x_0$ we hope the optimum of $h$ is similar to the optimum of g. Once you have a new estimate, construct the next $h$ and repeat.

I hope this has motivated the choice of $h$ . This is exactly the procedure that takes place in EM.

But there's one more important point. Using Jensen's inequality you can show that $h(x)\le g(x)$ . This means that when you optimise $h(x)$ you always get an $x$ that makes $g$ bigger compared to $g(x_0)$ . So even though $h$ was motivated by its local similarity to $g$ , it's safe to globally maximise $h$ at each iteration. The hope I mentioned above isn't required.

This also gives a clue to when to use EM: when linear combinations of the arguments to the $\exp$ function are easier to optimise. For example when they're quadratic - as happens when working with mixtures of Gaussians. This is particularly relevant to statistics where many of the standard distributions are from exponential families.

— Dan Piponi
source

3

As you said, I will not go into technical details. There are quite a few very nice tutorials. One of my favourites are Andrew Ng's lecture notes. Take a look also at the references here.

EM is naturally motivated in mixture models and models with hidden factors in general. Take for example the case of Gaussian mixture models (GMM). Here we model the density of the observations as a weighted sum of $K$ gaussians:
$p (x) = \sum_{i = 1}^{K} π_{i} N (x | μ_{i}, Σ_{i})$ $p(x) = \sum_{i=1}^{K}\pi_{i} \mathcal{N}(x|\mu_{i}, \Sigma_{i})$ where $\pi_{i}$ is the probability that the sample $x$ was caused/generated by the ith component, $\mu_{i}$ is the mean of the distribution, and $\Sigma_{i}$ is the covariance matrix. The way to understand this expression is the following: each data sample has been generated/caused by one component, but we do not know which one. The approach is then to express the uncertainty in terms of probability ( $\pi_{i}$ represents the chances that the ith component can account for that sample), and take the weighted sum. As a concrete example, imagine you want to cluster text documents. The idea is to assume that each document belong to a topic (science, sports,...) which you do not know beforehand!. The possible topics are hidden variables. Then you are given a bunch of documents, and by counting n-grams or whatever features you extract, you want to then find those clusters and see to which cluster each document belongs to. EM is a procedure which attacks this problem step-wise: the expectation step attempts to improve the assignments of the samples it has achieved so far. The maximization step you improve the parameters of the mixture, in other words, the form of the clusters.
The point is not using monotonic functions but convex functions. And the reason is the Jensen's inequality which ensures that the estimates of the EM algorithm will improve at every step.

— jpmuc
source