 
					| 
                                Images
                                  25% $\to$ 3.5% errors versus 5% in humans | 
                                Text
                                   | 
                                Voice
                                   | 
| 
                                Go, 2016
                                   | 
                                StarCraft, 2019
                                   | 
                                Protein structure, 2020
                                   | 
$x_j: X \to \mathbb{R}$ — numerical features
$$a(x, w) = f(\left< w, x\right>) = f\left(\sum\limits_{j=1}^n w_j x_j + b \right)$$where $w_1, \dots, w_n \in \mathbb{R}$ — feature weights, $b$ — bias
$f(z)$ — activation function, for example, $\text{sign}(z),\ \frac{1}{1+e^{-z}},\ (z)_+$
 
                                The functions AND, OR, NOT from binary variables $x^1$ and $x^2$:
$x^1 \wedge x^2 = [x^1 + x^2 - \frac{3}{2} > 0]$
                        $x^1 \vee x^2 = [x^1 + x^2 - \frac{1}{2} > 0]$
                        $\neg x^1 = [-x^1 + \frac{1}{2} > 0]$
                        
 
                        Function $$x^1 \bigoplus x^2 = [x^1 \neq x^2]$$ is not implementable by a single neuron. There are two ways to implement:
 
                        Function $\sigma(z)$ is sigmoidal, if $\lim\limits_{z \to -\infty} \sigma(z) = 0$ and $\lim\limits_{z \to +\infty} \sigma(z) = 1$
If $\sigma(z)$ is a continuous sigmoidal function, then for any continuous function $f(x)$ on $[0,1]^n$, there exist such parameter values $w_h \in \mathbb{R}^n, b_h \in \mathbb{R}, \alpha_h \in \mathbb{R}$ that a single-layer network $a(x) = \sum\limits_{h=1}^H \alpha_h \sigma\left(\left< x, w_h\right> + b_h\right)$ approximates $f(x)$ with any desired accuracy $\varepsilon$: $|a(x) - f(x)| < \varepsilon$, for all $x \in [0, 1]^n$
G. Cybenko. Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals, and Systems (MCSS) 2 (4): 303-314 (Dec 1, 1989)
Prediction $y_{pred} = x \cdot W + b$
 
                            $\quad x\quad\quad\cdot\quad W\quad\quad+\quad b\quad$
 
                        In our example, the space is 784-dimensional: $\mathbb{R}^{784}$
Question: How to find the best parameters: the weight matrix $W$ and the bias $b$ ?
If $y_{true, i} \in \mathbb{R}$ (that is, the task of linear regression), then to minimize the sum of the squares of differences (least squares method), the answer is calculated analytically by the formula:
$$\hat{W} = (X^TX)^{-1}X^T y_{true}$$In general, it is solved numerically by minimizing the loss function. Most often by stochastic gradient descent.
 
                        We transform our responses of the linear model into class probabilities:
$$ p(c=0| x) = \frac{e^{y_0}}{e^{y_0}+e^{y_1}+\dots+e^{y_n}} = \frac{e^{y_0}}{\sum\limits_i e^{y_i}} \\ p(c=1| x) = \frac{e^{y_1}}{e^{y_0}+e^{y_1}+\dots+e^{y_n}} = \frac{e^{y_1}}{\sum\limits_i e^{y_i}} \\ \dots $$This is cross-entropy loss for the case $y_i \in \{0, 1\}$. In our case:
$$ L(W, b) = - \sum\limits_j \log \frac{e^{(x_jW + b)_{y_j}}}{\sum\limits_i e^{(x_jW + b)_{i}}}$$We find the minimum of the function by stochastic gradient descent:
$$ W^{k+1} = W^{k} - \eta \frac{\partial L}{\partial W} \\ b^{k+1} = b^{k} - \eta \frac{\partial L}{\partial b} \quad\quad$$We reduce the variance of the gradient.
Input: sample $X^\ell$, learning rate $\eta$, forgetting rate $\lambda$
Output: weight vector $w \equiv (w_{jh}, w_{hm})$
Repeat:
Until the value of $Q$ and/or the weight $w$ converges
