Let's start with well-known
If \( f(x) \) is a concave (upward-convex) function, then Jensen's inequality holds:
Or, in a more general case,
\[f(w_1x_1 + w_2x_2 + \dots + w_nx_n) \geq w_1f(x_1) + w_2f(x_2) + \dots + w_nf(x_n)\]where \( \sum\limits_{i=1}^n w_i = 1 \).
It can also be proven that
\[ f\left(\mathbb{E}_{p(x)} u(x)\right) \geq \mathbb{E}_{p(x)} f(u(x)) \]\[ KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)} dx \]
Properties of Kullback-Leibler Divergence:
Proof of non-negativity of KL divergence:
\[ -KL(p||q) = -\int p(x) \log\left(\frac{p(x)}{q(x)}\right)dx = \int p(x) \log\left(\frac{q(x)}{p(x)}\right)dx \leq \\ \leq \log\left( \int p(x) \frac{q(x)}{p(x)}dx\right) = \log\left( \int q(x) dx\right) = \log 1 = 0 \]Example: searching for a similar Distribution
\( KL(? || ?) \) — where to place \( p \)?
First argument: mass covering
\( KL(p || q) = \int p(x) \log \frac{p(x)}{q(x)} dx \)
Second argument: mode collapsing
\[ KL(q || p) = \int q(x) \log \frac{q(x)}{p(x)} dx \]How can we learn to reconstruct distributions?
Gaussian:
Multiple Gaussians:
Unknown Distribution:
We can introduce the Gaussian number and reduce the problem to an already solved one (as in joke with electric kettle, yeah)!
We want: $$p(X|\theta) \underset{\theta}{\to} \max \text{(incomplete likelihood)}$$
We can: $$p(X, Z|\theta) \underset{\theta}{\to} \max \text{(complete likelihood)}$$
Let's say in the previous example, $z_i$ is the identifier of the Gaussian from which the $i$-th object is taken.
For a single observation $x$:
$$ \begin{align*} \log p(x|\theta) &= \int q(z) \log p(x|\theta) dz = \int q(z) \log \frac{p(x, z|\theta)}{p(z|x,\theta)} dz \\ &= \int q(z) \log \frac{p(x, z|\theta)q(z)}{p(z|x,\theta)q(z)} dz\\ &= \int q(z) \log \frac{p(x, z|\theta)}{q(z)} dz + \int q(z) \log \frac{q(z)}{p(z|x,\theta)} dz \geq\\ &\geq \int q(z) \log \frac{p(x, z|\theta)}{q(z)} dz = \mathscr{L}(q, \theta) \end{align*} $$This is the variational lower bound: we simply discarded the non-negative $KL(q(z) || p(z|x,\theta))$. Furthermore, equality is reached at $q^* = p(z| x, \theta)$
For dataset $X$:
$$ \log p(X| \theta) = \sum\limits_{i=1}^n \log p(x_i|\theta) \geq \sum\limits_{i=1}^n \mathscr{L}(q_i, \theta) \underset{q, \theta}{\to} \max $$It's easy to maximize for $\theta$ if we know $q$ and vice versa.
E-step (Expectation):
$$ \sum\limits_{i=1}^n \mathscr{L}(q_i, \theta) \underset{q_i}{\to} \max \Leftrightarrow q_i(z_i) = p(z_i|x_i, \theta) $$M-step (Maximization):
$$ \begin{align*} \max\limits_\theta \sum\limits_{i=1}^n \mathscr{L}(q_i, \theta) &= \max\limits_\theta \sum\limits_{i=1}^n \int q_i(z_i) \log\frac{p(x_i, z_i| \theta)}{q_i(z_i)} dz_i \\ &= \max\limits_\theta \sum\limits_{i=1}^n E_{q_i(z_i)} \log p(x_i, z_i| \theta) \end{align*} $$The joint probability of $x_i$ and $z_i$ given $\theta$ can be represented as:
$$ p(x_i, z_i|\theta) = p(x_i|z_i,\theta)p(z_i|\theta) = N(x_i|\mu_{z_i}, \sigma_{z_i}^2) \pi_{z_i} $$Parameters: $\{\mu_j, \sigma_j, \pi_j\}$, where $\pi_j$ is a priori probability of the $j$-th Gaussian (hidden variable $z_i$)
E-step: $\mu_k, \sigma_k^2, \pi_k \to q$
$$ q_i(z_i{=}j) = p(z_i{=}j|x_i, \theta) = \frac{p(x_i|z_i {=} j, \theta)p(z_i {=} j|\theta)}{p(x_i|\theta)} = \\ = \frac{p(x_i|z_i {=} j, \theta)p(z_i {=} j|\theta)}{\sum\limits_{k=1}^K p(x_i|z_i {=} k, \theta)p(z_i {=} k|\theta)} = \frac{N(x_i|\mu_j, \sigma_j^2) \pi_j}{\sum\limits_{k=1}^K N(x_i|\mu_k, \sigma_k^2) \pi_k} $$M-step: $q \to \mu_k, \sigma_k^2, \pi_k$
$$ \begin{align*} \sum\limits_{i=1}^n \mathscr{L}(q,\theta) &= \sum\limits_{i=1}^n \sum\limits_{k=1}^K q_i(z_i=k) \log\frac{p(x_i|z_i = k, \theta)p(z_i = k|\theta)}{q_i(z_i=k)} \\ &= \sum\limits_{i=1}^n \sum\limits_{k=1}^K q_i(z_i=k) \log\frac{N(x_i|\mu_k, \sigma_k^2)\pi_k}{q_i(z_i=k)} \end{align*} $$Next, as usual, we find the zeros of the derivative:
$$ \mu_j = \frac{\sum\limits_{i=1}^n q_i(z_i=j)x_i}{\sum\limits_{i=1}^n q_i(z_i=j)};\ \sigma_j^2 = \frac{\sum\limits_{i=1}^n q_i(z_i=j)(x_i-\mu_j)^2}{\sum\limits_{i=1}^n q_i(z_i=j)}\\ \pi_j = \frac{1}{n} \sum\limits_{i=1}^n q_i(z_i = j) $$$f_1(x), \dots, f_D(x)$ — original numerical features
$g_1(x), \dots, g_d(x)$ — new numerical features, $d \leq D$
Requirement: The old features should be restored as accurately as possible linearly from the new ones on the training set $X^n$:
$$ \hat f_j(x) = \sum\limits_{s=1}^d g_s(x) u_{js}, j=1,\dots,D , \forall x \in X $$ $$ \sum\limits_{i=1}^n \sum\limits_{j=1}^D (\hat f_j(x_i) - f_j(x_i))^2 \to \min\limits_{\{g_s(x_i)\}, \{u_{js}\}} $$Old and new "object-feature" matrices:
$$ F_{n \times D} = \left[ \begin{array}{ccc} f_1(x_1) & \dots & f_D(x_1) \\ \dots & \dots & \dots \\ f_1(x_n) & \dots & f_D(x_n) \end{array} \right] \ ,\\[.5cm] G_{n \times d} = \left[ \begin{array}{ccc} g_1(x_1) & \dots & g_d(x_1) \\ \dots & \dots & \dots \\ g_1(x_n) & \dots & g_d(x_n) \end{array} \right] $$The transformation of old features into new ones should be linear:
$$ U_{D \times d} = \left[ \begin{array}{ccc} u_{11} & \dots & u_{1d} \\ \dots & \dots & \dots \\ u_{D1} & \dots & u_{Dd} \end{array} \right], \hat F = GU^T \underset{\text{desired}}{\approx} F $$To find: both new features $G$, and transformation $U$:
$$ \sum\limits_{i=1}^n \sum\limits_{j=1}^D (\hat f_j(x_i) - f_j(x_i))^2 = \| GU^T - F\|^2 \to \min\limits_{G,U} $$where $x \in R^D$, $z \in R^d$ and $z \sim N(0, I)$
The parameters are ${W, \mu, \sigma}$, where $\mu \in R^D$, $W \in R^{D\times d}$
E-step (is calculated analytically, as conjugate distributions exist):
$$ q_i(z_i) = p(z_i|x_i, \theta) = \frac{p(x_i|z_i, \theta)p(z_i)}{\int p(x_i|z_i, \theta)p(z_i) dz_i} = \\[.2cm] = N(z_i| (\sigma^2I + W^TW)^{-1}W^T (x_i - \mu), (I + \sigma^{-2}W^TW)^{-1}) $$M-step:
$$ E_Z\log p(X, Z|\theta) = \sum\limits_{i=1}^n E_{z_i} \log p(x_i, z_i|\theta) = \\ \sum\limits_{i=1}^n E_{z_i} \log p(x_i| z_i, \theta) + const \underset{\theta}{\to} \max $$M-step example: finding W
$$ \sum\limits_{i=1}^n E_{z_i} \log p(x_i| z_i, \theta) = \\ = \sum\limits_{i=1}^n E_{z_i} \left[-\frac{D}{2}\log2\pi-D\log\sigma-\frac{1}{2\sigma^2}\left((x_i - \mu)-Wz_i\right)^T\left((x_i - \mu)-Wz_i\right) \right] = \\ = const(\sigma) - \frac{1}{2\sigma^2} \sum\limits_{i=1}^n E_{z_i} \left[(x_i - \mu)^T(x_i - \mu) - 2(x_i - \mu)^T Wz_i + z_i^TW^T Wz_i \right] = \\ = const(\sigma) - \\ - \frac{1}{2\sigma^2} \sum\limits_{i=1}^n \left[(x_i - \mu)^T(x_i - \mu) - 2(x_i - \mu)^T W E_{z_i}z_i + tr(W^T WE_{z_i}z_iz_i^T) \right] $$Next, differentiate with respect to $W$:
$$ - \frac{1}{2\sigma^2} \sum\limits_{i=1}^n \left[-2(x_i - \mu) E_{z_i} z_i^T + 2W E_{z_i} z_i z_i^T \right] = 0 \\ W^* = \left( \sum\limits_{i=1}^n (x_i - \mu) E_{z_i} z_i^T \right) \left( \sum\limits_{i=1}^n E_{z_i} z_i z_i^T \right)^{-1} $$We can calculate the expectations since we know the distribution $q$ from the previous E-step.
When $n > D > d$:
Usually, 10-100 iterations are needed for convergence, so it may be faster than the standard method
E-step
$$ q(X_{\text{hid}}) = p(X_{\text{hid}}|X_{\text{obs}}, \theta) $$M-step
$$ E_{X_\text{hid}} \log p(X_{\text{obs}}, X_{\text{hid}} |\theta) \underset{\theta}{\to} \max $$And we also obtain suitable probabilistic distributions for hidden variables