 
                        $X^\ell = (x_i, y_i)_{i=1}^\ell \subset X \times Y$ — training sample, $y_i = y(x_i)$
A logical rule (regularity) is a predicate $R: X \to \{0, 1\}$, satisfying two requirements:
If $R(x) = 1$, then we say "$R$ covers $x$"
 
                        If "age > 60" and "patient has previously had a heart attack", then do not perform surgery, the risk of negative outcome is 60%
If "home phone is listed on the form" and "salary > $2000" and "loan amount < $5000" then the loan can be issued, default risk is 5%
Regularity is a highly informative, interpretable, single-class classifier with rejections.
Parameters ${\color{orange}J, a_j, b_j, d, w_j}$ are tuned using the training set via optimization of the information criterion.
Pareto front — a set of non-improvable regularities. A point is non-improvable if there are no points to the left and below it
Problem: it would be desirable to have a single scalar criterion
 
                                    $ \text{IGain}(p, n) = h\left(\frac{P}{\ell} \right) - \frac{p+n}{\ell} h\left(\frac{p}{p+n} \right) - \frac{\ell-p-n}{\ell} h\left(\frac{P-p}{\ell-p-n} \right) \to \max $
$ h(q) = -q \log_2 q - (1-q) \log_2 (1 - q)$ — entropy of “two-class distribution”
$\text{IGini}(p, n) = \text{IGain}(p, n)$ when $h(q) = 4q(1 - q)$
 
                    $\text{IStat}(p, n) = -\frac{1}{\ell} \log_2 \frac{C_P^pC_N^n}{C_{P+N}^{p+n}} \to \max$ — the argument of logarith is simply the probability of realizing such a pair of $p, n$, when choosing $(p+n)$ elements from $(P+N)$
Input: Training sample $X^\ell$
Output: Set of rules $Z$
Example. Family of conjunctions of threshold conditions:
$$ R(x) = \bigwedge\limits_{j \in {\color{orange}J}} [{\color{orange}a_j} \leq f_j(x) \leq {\color{orange}b_j}]$$Local modifications of a conjunctive rule:
$a(x) = \arg\max_{y \in Y} \sum_{t=1}^{T} w_{yt} R_{yt}(x)$
$a(x) = \arg\max_{y \in Y} \frac{1}{T_y}\sum_{t=1}^{T_y} R_{yt}(x)$
 
                                  A decision tree is a classification algorithm $a(x)$, defined by a tree (a connected acyclic graph).
Special case: $D_v = {0, 1}$ — binary decision tree
Example: $f_v(x) = [f_j(x) \ge \theta_j]$
 
                            A simple well-known dataset for classification
 
                            Classes:
Features:
 
                    Let's write a function to train on data, predict the answer for each lattice point, and visualize the result. Because the data is much better divided by petals; we visualize only them (we take the average parameters of the sepals)
 
                     
                     
                     
                     
                     
                    $v_0$ := TreeGrowing($X_\ell$)
$f_v = \arg\max\limits_{f \in F}$ Gain($f, U$)
Majority rule: Major($U$) := $\arg\max P(y|U)$
The frequency estimate of the class \(y\) at the node \(v \in V_{\text{internal}}\):
\[ p_y \equiv P(y|U) = \frac{1}{|U|} \sum\limits_{x_i \in U} [y_i = y] \]
\(\Phi(U)\) — measure of uncertainty (impurity) of the distribution \(p_y\):
 
                        where \(\mathcal{L}(p)\) decreases and \(\mathcal{L}(1) = 0\), for example: \(-\log p, 1-p, 1-p^2\)
The uncertainty of the distributions \(P(y_i | U_k)\) after branching the node \(v\) using the feature \(f\) and partitioning \[U = \bigsqcup\limits_{k \in D_v} U_k\]
\[ \Phi(U_1, \dots , U_{|D_v|}) = \frac{1}{|U|} \sum\limits_{x_i \in U} \mathcal{L}(P(y_i| U_{f(x_i)})) = \\ = \frac{1}{|U|} \sum\limits_{k \in D_v} \sum\limits_{x_i \in U_k} \mathcal{L}(P(y_i| U_{k})) = \sum\limits_{k \in D_v} \frac{|U_k|}{|U|} \Phi(U_k) \]
The gain from branching the node \(v\):
\[ \text{Gain} (f, U) = \Phi(U) - \Phi(U_1, \dots , U_{|D_v|}) = \Phi(U) - \sum\limits_{k \in D_v} \frac{|U_k|}{|U|} \Phi(U_k) \to \max\limits_{f \in F} \]Two classes, \(Y = \{0, 1\}\), \(P(y|U) = \begin{cases} q, &y = 1 \\ 1-q, &y = 0 \end{cases} \)
Generalization for regression: \( Y = \mathbb{R}, y_v \in \mathbb{R} \)
\[ C(a) = \sum\limits_{i=1}^\ell (a(x_i) - y_i)^2 \to \min\limits_a \]Let \( U \) be the set of objects \( x_i \) that reach vertex \( v \)
Uncertainty measure — mean squared error:
\[ \Phi(U) = \min_{y \in Y} \frac{1}{|U|} \sum\limits_{x_i \in U} (y - y_i)^2 \]The value \( y_v \) in the terminal vertex \( v \) is the LSE solution: $ y_v = \frac{1}{|U|} \sum\limits_{x_i \in U} y_i $
Regression tree \( a(x) \) is a piecewise constant function.
The more complex the tree (the greater its depth), the higher the influence of noise in the data and the higher the risk of overfitting.
 
                        Mean squared error with a penalty for tree complexity:
\[ C_\alpha(a) = \sum\limits_{i=1}^\ell (a(x_i) - y_i)^2 + \alpha |V_{\text{leaf}}| \to \min\limits_\alpha \]As \(\alpha\) increases, the tree is progressively simplified. Moreover, the sequence of nested trees is unique. From this sequence, the tree with the minimum error on the test sample (Hold-Out) is selected.
For the case of classification, a similar pruning strategy is used, but with the Gini criterion.
Purpose: to reduce the enumeration of predicates of the form \([{\color{orange}\alpha} \leq f(x) \leq {\color{orange}\beta}]\)
Given: a set of real-valued feature vectors \(f(x_i), x_i \in X^l\)
Find: the best (in any sense) division of the feature value range into a relatively small number of zones:
 
                                Greedy strategy overcomplicates tree structure, so we need..
\(X^q\) — Independent validation set, \(q \approx 0.5\ell\)