Training sample: $X^\ell = (x_i, y_i)^\ell_{i=1}$, $x_i \in \mathbb{R}^n$, $y_i \in \mathbb{R}$
Linear regression model (weights $w \in \mathbb{R}^n$):
$a(x, w) = \left< w, x\right> = w^T x= \sum\limits_{j=1}^n w_jf_j(x)$
Squared loss function:
$\mathcal{L} (a, y) = (a - y)^2$
Training Method — Least Squares Method:
$Q(w) = \frac1\ell\sum\limits_{i=1}^\ell (a(x_i, w) - y_i)^2 \to \min\limits_w$
Testing on the test set $X^k = (\tilde{x}_i, \tilde{y}_i)_{i=1}^k$:
$\overline{Q}(w) = \frac1k\sum\limits_{i=1}^k (a(\tilde{x}_i, w) - \tilde{y}_i)^2$
Training sample: $X^\ell = (x_i, y_i)^\ell_{i=1}$, $x_i \in \mathbb{R}^n$, $\ \color{orange}{y_i \in \{-1, +1\}}$
Classification Model — Linear (weights $w \in \mathbb{R}^n$):
$a(x, w) = {\color{orange}\textrm{sign}} \left< w, x\right> = {\color{orange}\textrm{sign}}(\sum\limits_{j=1}^n w_jf_j(x))$Loss Function — Binary or its approximation:
$\mathcal{L} (a, y) = [ay < 0] = [\left< w, x\right> y < 0] \leq \overline{\mathcal{L}}(\left< w, x\right>, y)$Training Method — Minimization of Empirical Risk:
$Q(w) = \frac1\ell\sum\limits_{i=1}^\ell [\left< w, x_i\right> y_i < 0] \leq \frac1\ell\sum\limits_{i=1}^\ell \overline{\mathcal{L}} (\left< w, x_i\right>, y_i) \to \min\limits_w$Testing on the test set: $X^k = (\tilde{x}_i, \tilde{y}_i)_{i=1}^k$
$\overline{Q}(w) = \frac1k\sum\limits_{i=1}^k \color{orange}{[\left<\tilde{x}_i, w\right> \tilde{y}_i < 0]}$Separating classifier:
$a(x, w) = \mathrm{sign}\ g(x, w)$
$g(x, w)$ — separating (discriminant) function, $g(x, w) = 0$ — equation of the separating surface
$M_i(w) = g(x_i, w)y_i$ — margin of object $x_i$
$[M<0]$ — threshold loss function ( it is not continuous!)
$V(M) = (1 − M)_+$ — piecewise linear (Support Vector Machine)
$H(M) = (−M)_+$ — piecewise linear (Hebb's rule)
$L(M)=\log_2(1 + \exp(-M))$ — logarithmic (Logistic Regression)
$Q(M)=(1 − M)^2$ — quadratic
$S(M)=2(1+\exp(M))^{-1}$ — sigmoidal
$E(M)= \exp(-M)$ — exponential (let's remember for future AdaBoost)
Linear model of the McCulloch-Pitts neuron, 1943:
$a(x, w) = f(\left< w, x\right>) = f\left(\sum\limits_{j=1}^n w_j x_j - w_0 \right)$
Minimization of empirical risk (regression, classification):
$Q(w) = \frac{1}{\ell} \sum\limits_{i=1}^\ell \mathcal{L}_i(w) \to \min\limits_w$Numerical minimization by the gradient descent method:
$w^{(0)} = $ initial approximation
$w^{(t+1)} = w^{(t)} - h \nabla_w Q(w^{(t)})$
where $\nabla_w Q(w) = \left(\frac{\partial Q(w)}{\partial w_j} \right)^n_{j=0}$
$w^{(t+1)} = w^{(t)} - \frac{h}{l} \sum\limits_{i=1}^\ell \nabla_w \mathcal{L}_i(w^{(t)})$
where $h$ is the gradient step, also known as the learning rate
Idea of Speeding up Convergence: take $(x_i, y_i)$ one by one and immediately update the weight vector — stochastic gradient descent.
Two possible trajectories of gradient descent:
In version (a), the target function has "good behavior", which allows the point to smoothly move to the optimum by the gradient
In version (b), the gradient begins to oscillate as it falls into a narrow valley, thus converging more slowly
Input: sample $X^\ell$, learning rate $h$, decay rate $\lambda$
Output: weight vector $w$
Until the value $\overline{Q}$ and/or weights $w$ converge
Robbins, H., Monro S. A stochastic approximation method // Annals of Mathematical Statistics, 1951, 22 (3), p. 400—407.
Problem: Calculating the $Q$ estimate over the entire selection $x_1, \dots, x_\ell$ is much slower than the gradient step for a single object $x_i$
Solution: Use an approximate recursive formula. Arithmetic Mean:
$\overline{Q}_m = \frac1m \varepsilon_m + \frac1m \varepsilon_{m-1} + \frac1m \varepsilon_{m-2} + \dots$
$\overline{Q}_m = {\color{orange}\frac1m} \varepsilon_m + {\color{orange}\left(1 - \frac1m\right)} \overline{Q}_{m-1}$
Exponential Moving Average:
$\overline{Q}_m = \lambda \varepsilon_m + \left(1 - \lambda\right)\lambda\varepsilon_{m-1} + \left(1 - \lambda\right)^2\lambda\varepsilon_{m-2} + \dots$
$\overline{Q}_m = {\color{orange}\lambda} \varepsilon_m + {\color{orange}\left(1 - \lambda\right)} \overline{Q}_{m-1}$
The parameter $\lambda$ is the rate of forgetting the history of the series.
Input: sample $X^\ell$, learning rate $h$, decay rate $\lambda$. Output: weight vector $w$
Until the value of $\overline{Q}$ and/or weights $w$ converge
Schmidt M., Le Roux N., Bach F. Minimizing finite sums with the stochastic average gradient // arXiv.org, 2013.
This estimate of $w$ is optimal if
You will have to adjust the parameters $\mu_+, \mu_-$
$$h_t \to 0, \sum\limits_{t=1}^{\infty} h_t = \infty, \sum\limits_{t=1}^{\infty} h_t^2 < \infty,$$
in particular, you can set $h_t = 1/t$
$$\mathcal{L}_i(w - h\nabla \mathcal{L}_i(w)) \to \min\limits_h$$
allows you to find an adaptive step $h^*$. For a quadratic loss function, $h^* = \|x_i\|^{-2}$
Possible reasons for overfitting:
Manifestations of overfitting:
Main way to reduce overfitting:
Regularization (weight decay)
Penalty for increasing the norm of the weight vector:
$\mathcal{\tilde L}_i(w) = \mathcal{L}_i(w) + \frac{\tau}{2} \|w\|^2 = \mathcal{L}_i(w) + \frac{\tau}{2}\sum\limits_{j=1}^n w_j^2 \to \min\limits_w$
Gradient: $\nabla \mathcal{\tilde L}_i(w) = \mathcal{\tilde L}_i(w) + \tau w$
Modification of the gradient step:
$w = w{\color{orange}(1 - h\tau)} - h \nabla \mathcal{L}_i(w)$
Methods for selecting the regularization coefficient $\tau$:
Advantages:
Disadvantages:
Linear classifier with an arbitrary number of classes \(|Ү|\):
\(a(x) = \arg\max\limits_{y \in Y} < w_y, x>\), \(x, w_y \in \mathbb{R}^n\)The probability that the object \(x\) belongs to the class \(y\):
\(P(y|x,w) = \frac{\exp< w_y, x>}{\sum\limits_{z \in Y} \exp< w_z, x>} = \underset{y \in Y}{\mathrm{softmax}} < w_y, x>\)The \(\mathrm{softmax}: \mathbb{R}^Y \to \mathbb{R}^Y\) function converts any vector into a normalized vector of a discrete distribution.
Maximization of a likelihood (log-loss) with regularization is the same as minimization of approximated empirical risk:
\( L(w) = \sum\limits_{i=1}^\ell \log P(y_i|w, x_i) - \frac{\tau}{2} \sum\limits_{y \in Y} \|w_y\|^2 \to \max\limits_w \)Let $X \times Y$ be a probability space, and the model (i.e., its parameters $w$) is also generated by some probability distribution.
Bayes' theorem:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
$A \equiv w$, $B \equiv Y, X\ \Rightarrow$
$P(w|Y, X) = \frac{P(Y, X|w)P(w)}{P(Y, X)} = \frac{P(Y | X, w)P(X|w)P(w)}{P(Y | X)P(X)}$
$\boxed{P(w|Y,\ X) = \frac{P(Y|w, X)P(w)}{P(Y | X)}}$
Here: $P(w|Y, X)$ — posterior distribution of parameters
$P(Y|X, w)$ — likelihood, $P(w)$ — prior distribution of parameters
Maximum likelihood
$L(w) = \sum\limits_{i=1}^\ell {\color{orange}\log P(y_i|w, x_i)} \to \max\limits_w$
Minimization of the approximated empirical risk
$Q(w) = \sum\limits_{i=1}^\ell {\color{orange}\mathcal{L}(y_i, x_i, w)} \to \min\limits_w$
These two principles are equivalent if we set
$ -\log P(y_i|w, x_i) = \mathcal{L}(y_i, x_i, w)$
Model $P(y|x,w) \equiv $ Model $g(x,w)$ and $\mathcal{L}(M)$
\(P(y|x, w)\) — probabilistic data model
\(P(w; \gamma)\) — prior distribution of model parameters, \(\gamma\) — vector of hyperparameters;
Joint likelihood of data and model
\[P(X^\ell, w) = P(X^\ell | w)P(w;\gamma)\]Maximum a Posteriori Probability (MAP) principle:
\[ L(w) = \log P(X^\ell, w) = \sum\limits_{i=1}^\ell \log P(y_i|w, x_i) + \underbrace{\color{orange}{\log P(w; \gamma)}}_{\text{regularizer}} \to \max\limits_w \]Let the weights \(w_j\) be independent, \(Ew_j = 0\), \(Dw_j = С\)
Gaussian distribution and quadratic (\(L_2\)) regularizer:
$$P(w; С) = \frac{1}{(2\pi C)^{n/2}} \exp \left(-\frac{{\|w\|^2_2}}{2C}\right)$$ $${\|w\|^2_2 = \sum\limits_{j=1}^n w_j^2}$$ $$-\log P(w; С) = \frac{1}{2C}\|w\|^2_2 + const$$Laplace distribution and absolute (\(L_1\)) regularizer:
$$P(w; С) = \frac{1}{(2C)^{n}} \exp \left(-\frac{{\|w\|_1}}{C}\right)$$ $${\|w\|_1 = \sum\limits_{j=1}^n |w_j|}$$ $$-\log P(w; С) = \frac{1}{C}\|w\|_1 + const$$\(C\) — hyperparameter, \(\tau = \frac{1}{C}\) — regularization coefficient