$X^\ell = (x_i, y_i)^\ell_{i=1} \subset X \times Y$ — training set, $y_i = y^*(x_i)$
$a(x) = C(b(x))$ — algorithm, where
$b: X \to {R}$ — base algorithm (operator)
$C: {R} \to Y$ — decision rule (composition)
${R}$ — space of estimates
Ensemble of base algorithms \(b_1, \dots, b_T\)
\(a(x) = C(F(b_1(x), \dots, b_T(x)))\)
where \(F: R^T \to R\) — correcting operation
Why introduce correcting operation \(R\)?
In classification tasks, the set of mappings \(\{F: R^T \to R\}\) is significantly wider than \(\{F: Y^T \to Y\}\)
To make sure the algorithms in the composition are distinct:
Let's take \(Y = \{\pm 1\}, b_t: X \to \{-1, {\color{orange}[0]}, +1\}, C(b) = \text{sign}(b)\)
\(a(x) = \text{sign}\left( \sum\limits_{t=1}^T \alpha_t b_t (x)\right), x \in X\)
Number of errors in \(X^\ell\): \(Q_T = \sum\limits_{i=1}^\ell \left[y_i \sum\limits_{t=1}^T \alpha_t b_t(x_i) < 0 \right]\)
The quality functional \(Q_T\) is estimated from above:
\[Q_T \leq \tilde Q_T = \sum_{i=1}^\ell \underbrace{\exp(-y_i \sum_{t=1}^{T-1} \alpha_t b_t (x_i))}_{w_i} \exp(-y_i \alpha_T b_T (x_i))\]Normalized weights: \[\tilde W^\ell = (\tilde w_1, \dots, \tilde w_\ell), \tilde w_i = w_i / \sum\limits_{j=1}^\ell w_j\]
Weighted number of erroneous (negative) and correct (positive) classifications for weights vector \(U^\ell = (u_1, \dots, u_\ell)\):
\(N(b, U^\ell) = \sum\limits_{i=1}^\ell u_i [b(x_i) = -y_i]\)
\(P(b, U^\ell) = \sum\limits_{i=1}^\ell u_i [b(x_i) = y_i]\)
\((1 - N - P)\) — weighted number of classification refusals
Assume there are no rejections, \(b_t: X \to \{\pm 1 \}\). Then \(P = 1 − N\)
Suppose that for any normalized weight vector \(U^\ell\) there exists an algorithm \(b \in B\), classifying the sample at least slightly better than at random: \(N(b, U^\ell) < \frac12\).
Then the minimum of the functional \(\tilde Q_T\) is reached when
\[b_T = \arg\min_{b \in B} N (b, \tilde W^\ell), \quad \alpha_T = \frac12 \ln\frac{1-N(b_T, \tilde W^\ell)}{N(b_T, \tilde W^\ell)}\]Input: Training sample \(X^\ell\), parameter \(T\)
Output: Base algorithms and their weights \(\alpha_t b_t, t = 1, \dots, T\)
Train the basic algorithm: \(b_t = \arg \min\limits_b N(b, W^\ell)\)
\(\alpha_t = \frac12 \ln \frac{1-N(b, W^\ell)}{N(b, W^\ell)}\)
Ways to eliminate:
Linear (convex) combination of base algorithms:
\[a(x) = \sum\limits_{t=1}^T \alpha_t b_t(x), \ x \in X, \ \alpha_t \in \mathbb{R}_+\]Quality functional with arbitrary loss function $\mathcal{L}(a, y)$:
\[Q(\alpha, b, X^\ell) = \sum\limits_{i=1}^\ell \mathcal{L} \underbrace{\overbrace{(\sum\limits_{t=1}^{T-1} \alpha_t b_t(x_i)}^{f_{T-1,i}} + \alpha b(x_i)}_{f_{T,i}} , y_i) \to \min\limits_{\alpha, b}\]$f_{T-1, i}$ — current approximation, $f_{T, i}$ — next approximation
Gradient method of minimization $Q(f) \to \min, f \in \mathbb{R}^\ell$:
$f_0 =$ initial approximation
$f_{T,i} = f_{T-1,i} - \alpha g_i, \ i = 1, \dots, \ell$
$g_i = \mathcal{L}^\prime (f_{T-1, i}, y_i)$ — components of the gradient vector, $\alpha$ — gradient step
Observation: this is very similar to one boosting iteration!
$f_{T,i} = f_{T-1,i} + \alpha b(x_i), \ i = 1, \dots, \ell$
Idea: we will look for such a basic algorithm $b_T$ that the vector $(b_T(x_i))_{i=1}^\ell$ will approximate the antigradient vector $(-g_i)_{i=1}^\ell$ $$b_T = \arg\min\limits_b \sum\limits_{i=1}^\ell (b(x_i) + g_i)^2$$Input: training set $X^\ell$, parameter $T$
Output: base algorithms and their weights $\alpha_t b_t, t = 1, \dots, T$
Basic algorithm approximating antigradient:
$$b_t = \arg \min\limits_b \sum\limits_{i=1}^\ell (b(x_i) + \mathcal{L}^\prime(f_i, y_i))^2$$Idea: In steps 2-4, use not the entire sample \(X^\ell\), but a random subsample without replacement.
Advantages:
Friedman G. Stochastic Gradient Boosting, 1999
Regression: \(\mathcal{L}(a,y) = (a-y)^2\)
Classification: \(\mathcal{L}(a,y) = e^{-ay}, b_t \in \{-1, 0 , +1\}\)
Regression and classification trees (CART):
$$b(x) = \sum\limits_{j=1}^J w_j [x \in R_j]$$where $R_j$ is the area of space covered by leaf $j$, $w_j$ is the leaf weights, and $J$ is the number of leaves in the tree.
Quality functional with a total of $L_0, L_1, L_2$ regularizers:
$Q(b, \{w_j\}_{j=1}^J, X^\ell) = \sum\limits_{i=1}^\ell \mathcal{L} \left(\sum\limits_{t=1}^{T-1} \alpha_t b_t(x_i) + \alpha b(x_i, y_i)\right) + \gamma \sum\limits_{j=1}^J [w_j \neq 0] + \mu \sum\limits_{j=1}^J |w_j| + \frac{\lambda}{2} \sum\limits_{j=1}^J w_j^2 \to \min\limits_{b, w_j}$The task has an analytic solution for $w_j$.
Other popular implementations of gradient boosting over random trees: LightGBM, CatBoost
Training a random forest:
The number of trees $T$ can be selected based on the out-of-bag criterion — the number of errors on objects $x_i$, if we don't consider the votes of trees for which $x_i$ was training:
$$\text{out-of-bag}(a) = \sum\limits_{i=1}^\ell \left[\text{sign}(\sum\limits_{t=1}^T [x_i \notin U_t] b_t(x_i)) \neq y_i \right] \to \min$$This is an unbiased estimate of generalization ability.
Strengthening the concept of the error rate of the algorithm $a(x) = \text{sign}(b(x))$
$$\nu_\theta (a, X^\ell) = \frac{1}{\ell} \sum\limits_{i=1}^\ell [b(x_i) y_i \leq \theta]$$The usual error rate $\nu_0 (a, X^\ell) \leq \nu_\theta (a, X^\ell)$ when $\theta > 0$
If $|B| < \infty$, then $\forall \theta > 0, \forall \eta \in (0,1)$ with a probability $1-\eta$
$$P[y a(x) < 0] \leq \nu_\theta (a, X^\ell) + C\sqrt{\frac{\ln |B| \ln \ell}{\ell\theta^2} + \frac{1}{\ell}\ln \frac{1}{\eta}}$$The estimate depends on $|B|$, but not on $T$. Voting does not increase the complexity of the effectively used set of algorithms.
Distribution of margins: the share of objects with a margin smaller than a fixed $\theta$ after 5, 100, 1000 iterations (UCI classification task: vehicle)