nup_logo

Advanced machine learning

Expectation-Minimization algorithm


Alex Avdiushenko
February 20, 2024
kl_div_mem

KL divergence

Let's start with well-known

Jensen's Inequality


If \( f(x) \) is a concave (upward-convex) function, then Jensen's inequality holds:

concave_function
\[ f(wx_1 + (1 − w)x_2) \geq \color{orange}{wf(x_1) + (1 − w)f(x_2)} \]

Or, in a more general case,

\[f(w_1x_1 + w_2x_2 + \dots + w_nx_n) \geq w_1f(x_1) + w_2f(x_2) + \dots + w_nf(x_n)\]

where \( \sum\limits_{i=1}^n w_i = 1 \).


It can also be proven that

\[ f\left(\mathbb{E}_{p(x)} u(x)\right) \geq \mathbb{E}_{p(x)} f(u(x)) \]

Kullback-Leibler Divergence Between Probability Distributions

\[ KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx \]

Properties of Kullback-Leibler Divergence:

  • \( KL(p||q) \neq KL(q||p) \) — asymmetry
  • \( KL(p||q) \geq 0 \) — non-negativity
  • \( KL(p||q) = 0 \Leftrightarrow q(x) = p(x) \)

Proof of non-negativity of KL divergence:

\[ -KL(p||q) = -\int p(x) \log\left(\frac{p(x)}{q(x)}\right)dx = \int p(x) \log\left(\frac{q(x)}{p(x)}\right)dx \leq \\ \leq \log\left( \int p(x) \frac{q(x)}{p(x)}dx\right) = \log\left( \int q(x) dx\right) = \log 1 = 0 \]

Asymmetry of KL Divergence

Example: searching for a similar Distribution

\( KL(? || ?) \) — where to place \( p \)?

example_1

First argument: mass covering

\( KL(p || q) = \int p(x) \log \frac{p(x)}{q(x)} dx \)

example_1

Second argument: mode collapsing

\[ KL(q || p) = \int q(x) \log \frac{q(x)}{p(x)} dx \]
example_2
exp_mem

EM-Algorithm: Motivating Example

How can we learn to reconstruct distributions?

Gaussian:

Gaussian distribution

Multiple Gaussians:

Multiple Gaussian distributions

Unknown Distribution:

Unknown distribution

We can introduce the Gaussian number and reduce the problem to an already solved one (as in joke with electric kettle, yeah)!

Introducing Latent Variables

We want: $$p(X|\theta) \underset{\theta}{\to} \max \text{(incomplete likelihood)}$$

We can: $$p(X, Z|\theta) \underset{\theta}{\to} \max \text{(complete likelihood)}$$

Let's say in the previous example, $z_i$ is the identifier of the Gaussian from which the $i$-th object is taken.

Variational Lower Bound

For a single observation $x$:

$$ \begin{align*} \log p(x|\theta) &= \int q(z) \log p(x|\theta) dz = \int q(z) \log \frac{p(x, z|\theta)}{p(z|x,\theta)} dz \\ &= \int q(z) \log \frac{p(x, z|\theta)q(z)}{p(z|x,\theta)q(z)} dz\\ &= \int q(z) \log \frac{p(x, z|\theta)}{q(z)} dz + \int q(z) \log \frac{q(z)}{p(z|x,\theta)} dz \geq\\ &\geq \int q(z) \log \frac{p(x, z|\theta)}{q(z)} dz = \mathscr{L}(q, \theta) \end{align*} $$

This is the variational lower bound: we simply discarded the non-negative $KL(q(z) || p(z|x,\theta))$. Furthermore, equality is reached at $q^* = p(z| x, \theta)$


For dataset $X$:

$$ \log p(X| \theta) = \sum\limits_{i=1}^n \log p(x_i|\theta) \geq \sum\limits_{i=1}^n \mathscr{L}(q_i, \theta) \underset{q, \theta}{\to} \max $$

It's easy to maximize for $\theta$ if we know $q$ and vice versa.

EM-Algorithm

E-step (Expectation):

$$ \sum\limits_{i=1}^n \mathscr{L}(q_i, \theta) \underset{q_i}{\to} \max \Leftrightarrow q_i(z_i) = p(z_i|x_i, \theta) $$

M-step (Maximization):

$$ \begin{align*} \max\limits_\theta \sum\limits_{i=1}^n \mathscr{L}(q_i, \theta) &= \max\limits_\theta \sum\limits_{i=1}^n \int q_i(z_i) \log\frac{p(x_i, z_i| \theta)}{q_i(z_i)} dz_i \\ &= \max\limits_\theta \sum\limits_{i=1}^n E_{q_i(z_i)} \log p(x_i, z_i| \theta) \end{align*} $$
em

Gaussian Mixture Model

GMM

The joint probability of $x_i$ and $z_i$ given $\theta$ can be represented as:

$$ p(x_i, z_i|\theta) = p(x_i|z_i,\theta)p(z_i|\theta) = N(x_i|\mu_{z_i}, \sigma_{z_i}^2) \pi_{z_i} $$

Parameters: $\{\mu_j, \sigma_j, \pi_j\}$, where $\pi_j$ is a priori probability of the $j$-th Gaussian (hidden variable $z_i$)

Question 1: How do you find the number of mixture components?

  • We can use simple enumeration based on their quantity (model selection method)
  • We can take a different approach — introduce a prior distribution directly on all models of different complexity. This is the main idea of Bayesian nonparametric methods
  • Also see the "Chinese Restaurant Process" and "Indian Buffet Process" for solution

EM-Algorithm for Gaussian Mixture Model

E-step: $\mu_k, \sigma_k^2, \pi_k \to q$

$$ q_i(z_i{=}j) = p(z_i{=}j|x_i, \theta) = \frac{p(x_i|z_i {=} j, \theta)p(z_i {=} j|\theta)}{p(x_i|\theta)} = \\ = \frac{p(x_i|z_i {=} j, \theta)p(z_i {=} j|\theta)}{\sum\limits_{k=1}^K p(x_i|z_i {=} k, \theta)p(z_i {=} k|\theta)} = \frac{N(x_i|\mu_j, \sigma_j^2) \pi_j}{\sum\limits_{k=1}^K N(x_i|\mu_k, \sigma_k^2) \pi_k} $$

M-step: $q \to \mu_k, \sigma_k^2, \pi_k$

$$ \begin{align*} \sum\limits_{i=1}^n \mathscr{L}(q,\theta) &= \sum\limits_{i=1}^n \sum\limits_{k=1}^K q_i(z_i=k) \log\frac{p(x_i|z_i = k, \theta)p(z_i = k|\theta)}{q_i(z_i=k)} \\ &= \sum\limits_{i=1}^n \sum\limits_{k=1}^K q_i(z_i=k) \log\frac{N(x_i|\mu_k, \sigma_k^2)\pi_k}{q_i(z_i=k)} \end{align*} $$

Next, as usual, we find the zeros of the derivative:

$$ \mu_j = \frac{\sum\limits_{i=1}^n q_i(z_i=j)x_i}{\sum\limits_{i=1}^n q_i(z_i=j)};\ \sigma_j^2 = \frac{\sum\limits_{i=1}^n q_i(z_i=j)(x_i-\mu_j)^2}{\sum\limits_{i=1}^n q_i(z_i=j)}\\ \pi_j = \frac{1}{n} \sum\limits_{i=1}^n q_i(z_i = j) $$

Principal Component Analysis: Problem Formulation

$f_1(x), \dots, f_D(x)$ — original numerical features

$g_1(x), \dots, g_d(x)$ — new numerical features, $d \leq D$

Requirement: The old features should be restored as accurately as possible linearly from the new ones on the training set $X^n$:

$$ \hat f_j(x) = \sum\limits_{s=1}^d g_s(x) u_{js}, j=1,\dots,D , \forall x \in X $$ $$ \sum\limits_{i=1}^n \sum\limits_{j=1}^D (\hat f_j(x_i) - f_j(x_i))^2 \to \min\limits_{\{g_s(x_i)\}, \{u_{js}\}} $$

Matrix Notations

Old and new "object-feature" matrices:

$$ F_{n \times D} = \left[ \begin{array}{ccc} f_1(x_1) & \dots & f_D(x_1) \\ \dots & \dots & \dots \\ f_1(x_n) & \dots & f_D(x_n) \end{array} \right] \ ,\\[.5cm] G_{n \times d} = \left[ \begin{array}{ccc} g_1(x_1) & \dots & g_d(x_1) \\ \dots & \dots & \dots \\ g_1(x_n) & \dots & g_d(x_n) \end{array} \right] $$

The transformation of old features into new ones should be linear:

$$ U_{D \times d} = \left[ \begin{array}{ccc} u_{11} & \dots & u_{1d} \\ \dots & \dots & \dots \\ u_{D1} & \dots & u_{Dd} \end{array} \right], \hat F = GU^T \underset{\text{desired}}{\approx} F $$

To find: both new features $G$, and transformation $U$:

$$ \sum\limits_{i=1}^n \sum\limits_{j=1}^D (\hat f_j(x_i) - f_j(x_i))^2 = \| GU^T - F\|^2 \to \min\limits_{G,U} $$

Probabilistic Interpretation of the Principal Component Analysis (PPCA)

$$ p(x,z|\theta) = p(x|z, \theta)p(z) = N(x|Wz + \mu, \sigma^2I)N(z|0, I), $$

where $x \in R^D$, $z \in R^d$ and $z \sim N(0, I)$

The parameters are ${W, \mu, \sigma}$, where $\mu \in R^D$, $W \in R^{D\times d}$

Note: Constraints can also be imposed on the matrix $W$, expressed by the prior distribution, to minimize $d$ during optimization.

EM-Algorithm for Probabilistic Principal Component Analysis

E-step (is calculated analytically, as conjugate distributions exist):

$$ q_i(z_i) = p(z_i|x_i, \theta) = \frac{p(x_i|z_i, \theta)p(z_i)}{\int p(x_i|z_i, \theta)p(z_i) dz_i} = \\[.2cm] = N(z_i| (\sigma^2I + W^TW)^{-1}W^T (x_i - \mu), (I + \sigma^{-2}W^TW)^{-1}) $$

M-step:

$$ E_Z\log p(X, Z|\theta) = \sum\limits_{i=1}^n E_{z_i} \log p(x_i, z_i|\theta) = \\ \sum\limits_{i=1}^n E_{z_i} \log p(x_i| z_i, \theta) + const \underset{\theta}{\to} \max $$

M-step example: finding W

$$ \sum\limits_{i=1}^n E_{z_i} \log p(x_i| z_i, \theta) = \\ = \sum\limits_{i=1}^n E_{z_i} \left[-\frac{D}{2}\log2\pi-D\log\sigma-\frac{1}{2\sigma^2}\left((x_i - \mu)-Wz_i\right)^T\left((x_i - \mu)-Wz_i\right) \right] = \\ = const(\sigma) - \frac{1}{2\sigma^2} \sum\limits_{i=1}^n E_{z_i} \left[(x_i - \mu)^T(x_i - \mu) - 2(x_i - \mu)^T Wz_i + z_i^TW^T Wz_i \right] = \\ = const(\sigma) - \\ - \frac{1}{2\sigma^2} \sum\limits_{i=1}^n \left[(x_i - \mu)^T(x_i - \mu) - 2(x_i - \mu)^T W E_{z_i}z_i + tr(W^T WE_{z_i}z_iz_i^T) \right] $$

Next, differentiate with respect to $W$:

$$ - \frac{1}{2\sigma^2} \sum\limits_{i=1}^n \left[-2(x_i - \mu) E_{z_i} z_i^T + 2W E_{z_i} z_i z_i^T \right] = 0 \\ W^* = \left( \sum\limits_{i=1}^n (x_i - \mu) E_{z_i} z_i^T \right) \left( \sum\limits_{i=1}^n E_{z_i} z_i z_i^T \right)^{-1} $$

We can calculate the expectations since we know the distribution $q$ from the previous E-step.

Comparison of Runtime Estimates

When $n > D > d$:

  • Standard PCA: $O(nD^2)$
  • A single iteration of PPCA: $O(nDd)$

Usually, 10-100 iterations are needed for convergence, so it may be faster than the standard method

The case of missing features in Probabilistic Principal Component Analysis

$$ (X, Z) = X_{\text{obs}} \cup X_{\text{hid}}, P(X_{\text{obs}}|\theta) \underset{\theta}{\to} \max $$

E-step

$$ q(X_{\text{hid}}) = p(X_{\text{hid}}|X_{\text{obs}}, \theta) $$

M-step

$$ E_{X_\text{hid}} \log p(X_{\text{obs}}, X_{\text{hid}} |\theta) \underset{\theta}{\to} \max $$

And we also obtain suitable probabilistic distributions for hidden variables

Summary

  • A common metric between probabilistic distributions is the Kullback-Leibler distance (KL-divergence)
  • The EM algorithm allows iterative solutions to complex problems by dividing them into two steps
  • Probabilistic Principal Component Analysis used the EM algorithm