The Nobel Prize in Physiology or Medicine, 1981
Convolutions and activations have already been used, but without gradient descent optimization and supervised learning
Note: Then quite good results were obtained using the LeNet architecture.
The Winner of the ImageNet contest of the 2012 year
Note: The first real breakthrough in image classification was made by the AlexNet architecture.
$f_j: X \to \mathbb{R}$ — numerical features
$$a(x, w) = f(\left< w, x\right>) = f\left(\sum\limits_{j=1}^n w_j f_j(x)+b \right)$$where $w_1, \dots, w_n \in \mathbb{R}$ — feature weights, $b$ — bias
$f(z)$ — activation function, for example, $\text{sign}(z),\ \frac{1}{1+e^{-z}},\ (z)_+$
Convolution in neural networks — the sum of products of elements
$\ \ $Question 1: Why is it called "convolution"?
The implementation of convolution effectively multiplies a matrix by a vector. Here, for example, an article with the implementation of Winograd transformation in cuDNN
Kernel $3\times3\times3$ (Width $\times$ Height $\times$ Channel numbers)
Output size = (I - F + 2 * P) / S + 1 = (28 - 3 + 2*0) / 1 + 1 = 26
Output size = 26 $\to 26\times 26$
The pooling layer is maybe the simplest layer of all: we choose the maximum element from the filter. Or there is average pooling, where we take the average, but it is used quite rarely.
Or a brief history of ImageNet
The Winner of the ImageNet contest of the 2012 year
Top5 final accuracy on ImageNet — 25.8% $\to$16.4%
Momentum accumulation method [B.T.Polyak, 1964] — exponential moving average of the gradient over $\frac{1}{1-\gamma}$ last iterations:
$$\nu = {\color{orange}\gamma} \nu + {\color{orange}(1-\gamma)} \mathcal{L}_i^\prime(w)$$$w = w - \eta \nu$
From a physical point of view, the derivative of the loss function becomes the acceleration of the change in model parameters, and not the speed as in classical SGD.
What else can you watch?