Given: set of objects $X$, set of answers $Y$
We need to find distribution $p(y | x)$ — it is discriminative model
In generative model, we need to find joint distribution $p(x, y)$ and $$p(y | x) \propto p(x, y)$$
On the one hand, Discriminative models cannot generate objects. On the other hand, they are much more expressive because they do not make any internal assumptions about the structure of the object's feature space $X$.
Naive Bayes
$$ p(x, y) = p(y) \prod\limits_{i=1}^n p(x_i | y) $$Autoregressive models (including Transformers)
$$ p(x_1, \dots, x_n) = p(x_1) p(x_2|x_1) \dots p(x_n | x_1, \dots, x_{n-1}) $$Diffusion based
$$ x \to x_1 = x + \varepsilon_1 \to \dots \to x_n \sim N(0, \Sigma) $$Variational auto encoders (VAE) are in these slides
Generative Adversarial Networks — GANs
$$ z \sim N(0, \Sigma) \to \boxed{\text{Model}} \to x$$Rumelhart, Hinton, Williams. Learning Internal Representations by Error Propagation. 1986.
David Charte et al. A practical tutorial on autoencoders for nonlinear feature fusion: taxonomy, models, software and guidelines. 2018.
Principal Component Analysis: $F = (x_1 \dots x_{\ell})^T, U^TU = I_m, G = FU,$
$$\|F - GU^T \|^2 = \sum\limits_{i = 1}^{\ell} \|{\color{orange}UU^T}x_i - x_i \|^2 \to \min\limits_{U}$$The autoencoder generalizes the principal component analysis:
Reminder from linear models: if the loss function has a kink, then we select objects. If regularizer has a kink, then we select features.
D.Arpit et al. Why regularized auto-encoders learn sparse representation? 2015.
A generative model is constructed capable of generating new objects $x$ similar to the objects of the sample $X^\ell = \{x_1,\dots,x_\ell \}$
$q_\alpha(z|x)$ — probabilistic encoder with $\alpha$ parameter
$p_\beta(\hat x|z)$ — probabilistic decoder with $\beta$ parameter
$$ \begin{align*} \mathscr{L}_{\text{VAE}}(\alpha, \beta) = \sum\limits_{i=1}^\ell \log p(x_i) = \sum\limits_{i=1}^\ell \log \int q_{\alpha} (z | x_i) \frac{p_{\beta}(x_i|z) p(z)}{q_{\alpha} (z | x_i)} dz \geq \\ \geq \sum\limits_{i=1}^\ell \int q_\alpha(z|x_i) \log p_\beta(x_i|z)dz - KL(q_\alpha(z|x_i)\| p( z)) \to \max\limits_{\alpha, \beta} \end{align*} $$where $p(z)$ is the prior distribution, usually $N(0, \sigma^2 I)$
Reparametrization $q_\alpha (z|x_i):\ z = f(x_i, \alpha, \varepsilon),\ \varepsilon \sim N(0, I)$
Stochastic gradient method:
Generation of similar objects:
$$x \sim p_\beta(x|f({\color{orange}x_i}, \alpha, \varepsilon)), \varepsilon \sim N(0, I)$$D.P.Kingma, M.Welling. Auto-encoding Variational Bayes. 2013.
C.Doersch. Tutorial on variational autoencoders. 2016.
Data: unlabeled $(x_i)_{i=1}^\ell$, labeled $(x_i, y_i)_{i=\ell+1}^{\ell + k}$
Co-learning encoder, decoder and predictive model (classification, regression, etc.)
$$ \sum\limits_{i=1}^\ell \mathscr{L}(g(f(x_i, \alpha), \beta), x_i) + \lambda \sum\limits_{i=\ell+1}^{\ell+k} \tilde{\mathscr{L}}(\hat y(f(x_i, \alpha), \gamma), y_i) \to \min\limits_{\alpha, \beta, \gamma} $$Loss functions:
Dor Bank, Noam Koenigstein, Raja Giryes. Autoencoders. 2020.
The generator $G(z)$ learns to generate objects $x$ from noise $z$. The discriminator $D(x)$ learns to distinguish them from real objects.
Chris Nicholson. A Beginner's Guide to Generative Adversarial Networks. 2019.
There is a sample of objects $\{x_i\}_{i=1}^m$ from $X$. We train
So the entire task is $ \min\limits_{\alpha, G} \max\limits_{\beta, D} L(\alpha, \beta)$
Ok, but how to train it?
It is not an easy question!
Usual SGD "as is" doesn't work here, therefore in first publications authors:
which is similar to EM algorithm.
And this also doesn't work in practice due to the gradient vanishing. The first naive workaround is changing
$$ L_G = \sum\limits_{i=1}^m \ln(1 - D(G(z_i, {\color{orange}\alpha}), \beta)) \to L_G^\prime = -\sum\limits_{i=1}^m \ln(D(G(z_i, {\color{orange}\alpha}), \beta))$$LSGANs is a type of GAN that solves a least squares problem in the process of training a GAN, thus stabilizing the training process.
Mao et al. showed that LSGAN is more robust to architecture changes and less suffer from mode collapse
WGAN is a variant of GANs that uses the Wasserstein distance to measure the difference between the distribution of the data generated by the GAN and the real data
Wasserstein distance is Earth Mover's Distance (EMD)
The EMD between probability distributions $P$ and $Q$ can be defined as an infimum over joint probabilities:
$$\text{EMD}(P,Q) = \inf\limits_{\gamma \in \Pi(P, Q)} \mathbb{E}_{(x, y) \sim \gamma}\left[d(x, y)\right]\,$$where $\Pi(P, Q)$ is the set of all joint distributions whose marginals are $P$ and $Q$. By Kantorovich-Rubinstein duality, this can also be expressed as:
$$\text{EMD}(P,Q) = \sup\limits_{\| f \|_L \leq 1} \, \mathbb{E}_{x \sim P}[f(x)] - \mathbb{E}_{y \sim Q}[f(y)]\,$$where the supremum is taken over all 1-Lipschitz continuous functions, i.e. $\| \nabla f(x)\| \leq 1 \quad \forall x$.
For evaluation of generated images (or other objects) one usually uses one of two metrics:
It is almost impossible to use these metrics as losses, since they use a bunch of objects for estimation, not one.
Papers are here: