ML is built using the tools of mathematical statistics, numerical methods, mathematical analysis, optimization methods, probability theory, graph theory, and various techniques for working with data in digital form.
What is it?
\[\begin{cases} \frac{\partial \rho}{\partial t} + \frac{\partial(\rho u_{i})}{\partial x_{i}} = 0 \\ \\ \frac{\partial (\rho u_{i})}{\partial t} + \frac{\partial[\rho u_{i}u_{j}]}{\partial x_{j}} = -\frac{\partial p}{\partial x_{i}} + \frac{\partial \tau_{ij}}{\partial x_{j}} + \rho f_{i} \end{cases} \]In machine learning, there is no pre-set model with equations...
$X$ — set of objects
$Y$ — set of answers
$y: X \to Y$ — unknown dependence (target function)
The task is to find, based on the training sample $\{x_1,\dots,x_\ell\} = X_\ell \subset X$
with known answers $y_i=y(x_i)$,
an algorithm ${\color{orange}a}: X \to Y$,
which is a decision function that approximates $y$ over the entire set $X$.
The whole ML course is about this:
$f_j: X \to D_j$
The vector $(f_1(x), \dots, f_n(x))$ — feature description of the object $x$
Types of features:
Data matrix (objects and features as rows and columns)
$F = ||f_j(x_i)||_{\ell\times n} = \left[ {\begin{array}{ccc}
f_1(x_1) & \dots & f_n(x_1) \\
\vdots & \ddots & \vdots \\
f_1(x_\ell) & \dots & f_n(x_\ell)
\end{array} } \right]$
A model (predictive model) — a parametric family of functions
$A = \{g(x, \theta) | \theta \in \Theta\}$,
where $g: X \times \Theta \to Y$ — a fixed function, $\Theta$ — a set of allowable values of parameter $\theta$
Linear model with vector of parameters $\theta = (\theta_1, \dots, \theta_n), \Theta = \mathbb{R}^n$:
$g(x, \theta) = \sum\limits_{j=1}^n \theta_jf_j(x)$ — for regression and ranking, $Y = \mathbb{R}$
$g(x, \theta) = \mathrm{sign}\sum\limits_{j=1}^n \theta_jf_j(x)$ — for classification, $Y = \{-1, +1\}$
$X = Y = \mathbb{R}$, $\ell = 50$ objects
$n = 3$ features: $\{1, x, x^2\}$ or $\{1, x, \sin x\}$
The method $\mu$ constructs an algorithm $a = \mu(X_\ell, Y_\ell)$ from the sample $(X_\ell, Y_\ell) = (x_i, y_i)_{i=1}^\ell$
$\boxed{ \left[ {\begin{array}{ccc} f_1(x_1) & \dots & f_n(x_1) \\ \dots & \dots & \dots \\ f_1(x_\ell) & \dots & f_n(x_\ell) \end{array} } \right] \xrightarrow{y} \left[ {\begin{array}{c} y_1 \\ \dots \\ y_\ell \end{array} }\right] \thinspace} \xrightarrow{\mu} a$
The algorithm $a$ produces answers $a(x_i^\prime)$ for new objects $x_i^\prime$
$\mathcal{L}(a, x)$ — loss function. The error magnitude of the algorithm $a \in A$ on the object $x \in X$
Empirical risk — functional quality of the algorithm $a$ on $X^\ell$:
$Q(a, X^\ell) = \frac{1}{\ell} \sum\limits_{i=1}^\ell \mathcal{L}(a, x_i)$
Method of minimizing empirical risk
$\mu(X^\ell) = \arg\min\limits_{a \in A} Q(a, X^\ell)$
$\mu(X^\ell) = \arg\min\limits_{\theta} \sum\limits_{i=1}^{\ell} (g(x_i, \theta) - y_i)^2$
Dependency $y(x) = \frac{1}{1+25x^2}$ on the interval $x \in \left[-2, 2\right]$
Feature description $x \to (1, x, x^2, \dots, x^n)$
$a(x, \theta) = \theta_0 + \theta_1 x + \dots + \theta_n x^n$ — a polynomial of degree $n$
$Q(\theta, X^\ell) = \sum\limits_{i=1}^\ell (\theta_0 + \theta_1 x_i + \dots + \theta_n x_i^n - y_i)^2 \to \min\limits_{\theta_0,\dots,\theta_n}$
Training sample: $X^\ell = \{x_i = 4\frac{i-1}{\ell-1} - 2 | i = 1, \dots, \ell \}$
Test sample: $X^k = \{x_i = 4\frac{i-0.5}{\ell-1} - 2 | i = 1, \dots, \ell-1 \}$
What happens to $Q(\theta, X^\ell)$ and $Q(\theta, X^k)$ as $n$ increases?
When test_score >> train_score
1997: IBM Deep Blue defeats world chess champion Garry Kasparov
2004: self-driving cars competition — DARPA Grand Challenge
2006: Google Translate launched
2011: 40 years of DARPA CALO (Cognitive Assistant that Learns and Organizes) development
2011-2015: ImageNet — error rate reduced from 25% to 3.5% versus 5% in humans
2015: Creation of the open company OpenAI by Elon Musk and Sam Altman, pledged to invest $1 billion
2016: Google DeepMind beat the world champion of the game Go
2018: At the Christie's auction, a painting, formally drawn by AI, sold for $432,500
2020: AlphaFold 2 predicts the structure of proteins with over 90% accuracy for about two-thirds of the proteins in the dataset
2021: DALL-E appears — an AI system developed by OpenAI that can generate images from textual descriptions, which has potential applications in creative industries
2022: ChatGPT (Generative Pre-trained Transformer) developed by OpenAI — fastest-growing consumer software application in history