Given
The task is to find
such that
The solution to the clustering task is fundamentally ambiguous:
Given: set of objects $X$, set of classes $Y$
$X^k = \begin{array}{c} \{x_1, \dots, x_k \} \\ \color{orange}{\{y_1, \dots, y_k \}} \\ \end{array}$ — labeled dataTwo variants of the task formulation:
Example 1: Class densities restored.
using labeled data $X^k$ | using full data $X^\ell$ |
Example 2: Classification methods do not account for the cluster structure of unlabeled data
Example 3: Clustering methods do not consider the priority of labeling over cluster structure. How to classify objects in clusters:
Suppose only pairwise distances between objects are known.
Average intra-cluster distance:
$$F_0 = \frac{\sum\limits_{i < j}[a_i = a_j] \rho(x_i, x_j)}{\sum\limits_{i < j}[a_i = a_j]} \to \min$$Average inter-cluster distance:
$$F_1 = \frac{\sum\limits_{i < j}[a_i \neq a_j] \rho(x_i, x_j)}{\sum\limits_{i < j}[a_i \neq a_j]} \to \max$$Ratio of functionals: $F_0 / F_1 \to \min$
Let the objects \(x_i\) be represented by vectors of features \( (f_1(x_i),\dots,f_n(x_i)) \)
Sum of average intra-cluster distances:
\[ \Phi_0 = \sum\limits_{a \in Y} \frac{1}{|X_a|} \sum\limits_{i: a_i = a} \rho(x_i, \mu_a) \to \min \]
\( X_a = \{x_i \in X^\ell | a_i = a\} \) — cluster \( a \), \( \mu_a \) — centroid of cluster \( a \)
Sum of inter-cluster distances:
\[ \Phi_1 = \sum\limits_{a, b \in Y} \rho(\mu_a, \mu_b) \to \max \]
Ratio of functionals: $ \Phi_0 / \Phi_1 \to \min $
Weights on pairs of objects (proximity): \( w_{ij} = \exp(-\beta \rho(x_i, x_j)) \), where \( \rho(x, x^\prime) \) is the distance between objects, \( \beta \) is a parameter
Find cluster labels \(a_i:\ \sum\limits_{i=1}^\ell \sum\limits_{j=i+1}^\ell w_{ij} [a_i \neq a_j] \to \min\limits_{\{a_i \in Y\}}\)
where \(\lambda\) is another parameter
Minimization of the sum of squared intra-class distances:
\[ \sum\limits_{i=1}^\ell \|x_i - \mu_{a_i} \|^2 \to \min\limits_{\{a_i\}, \{\mu_a\}},\\ \|x_i - \mu_{a} \|^2 = \sum\limits_{j=1}^n (f_j(x_i) - \mu_{a_j})^2 \]
Input: \(X^\ell, K = |Y|\)
Output: cluster centers \( \mu_a \) and labels \( a \in Y \)
* Here \( D(x) \) is the distance from \( x \) to the nearest cluster
David Arthur and Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding, 2007
Reason: unfortunate initial approximation and "ribbon-like" clusters
Given labeled objects \( \{x_1, \dots, x_k\} \), with \( \color{orange}{U} \) being the set of unlabeled objects
Input: \( X^\ell, K = |Y| \). Output: cluster centers \( \mu_a, a \in Y \)
DBSCAN — Density-Based Spatial Clustering of Applications with Noise
Consider an object \( x \in U \), and its \( \varepsilon \)-neighborhood \( U_\varepsilon(x) = \{ u \in U: \rho(x,u) \leq \varepsilon\} \)
Each object can be one of three types:
Ester, Kriegel, Sander, Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD-1996
Input: dataset \( X^\ell = \{x_1, \dots, x_\ell\} \), parameters \( \varepsilon \) and \( m \)
Output: partition of the dataset into clusters and noise points
\( U = X^\ell \) — unlabeled, \( a = 0 \)
While there are unlabeled points in the dataset, i.e. \( U \neq \emptyset \):
Algorithm of hierarchical clustering (Lance, Williams, 1967):
Iterative recalculation of distances \( R_{UV} \) between clusters \( U, V \)
1. Nearest neighbor distance:
\( R^{\text{nn}}_{WS} = \min_{w \in W, \in S} \rho (w,s) \)
\( \alpha_{U} = \alpha_{V} = \frac12, \beta = 0, \gamma = -\frac12 \)
2. Farthest neighbor distance:
\( R^{\text{fn}}_{WS} = \max_{w \in W, \in S} \rho (w,s) \)
\( \alpha_{U} = \alpha_{V} = \frac12, \beta = 0, \gamma = \frac12 \)
3. Group average distance:
\( R^{\text{g}}_{WS} = \frac{1}{|W| |S|} \sum\limits_{w \in W} \sum\limits_{s \in S} \rho(w, s) \)
\( \alpha_{U} = \frac{|U|}{|W|}, \alpha_{V} = \frac{|V|}{|W|}, \beta = \gamma = 0 \)
\( R^{\text{c}}_{WS} = \rho^2\left(\sum\limits_{w \in W} \frac{w}{|W|}, \sum\limits_{s \in S} \frac{s}{|S|} \right) \)
\( \alpha_{U} = \frac{|U|}{|W|}, \alpha_{V} = \frac{|V|}{|W|}, \beta = -\alpha_{U}\alpha_{V}, \gamma = 0 \)
\( R^{\text{w}}_{WS} = {\color{orange}\frac{|S||W|}{|S|+|W|}} \rho^2\left(\sum\limits_{w \in W} \frac{w}{|W|}, \sum\limits_{s \in S} \frac{s}{|S|} \right) \)
\( \alpha_{U} = \frac{|S|+|U|}{|S|+|W|}, \alpha_{V} = \frac{|S|+|V|}{|S|+|W|}, \beta = \frac{-|S|}{|S|+|W|}, \gamma = 0 \)
Embedding Diagram | Dendrogram |
---|---|
Embedding Diagram | Dendrogram |
---|---|
Theorem (Milligan, 1979): Clustering is monotonic if the conditions \( \alpha_U \geq 0, \alpha_V \geq 0, \alpha_U + \alpha_V + \beta \geq 1, \min\{\alpha_U, \alpha_V\} + \gamma \geq 0 \) are met
Let \( \mu: X^k \to a \) be a method for learning classification
Classifiers are of the form \( a(x) = \arg\max\limits_{y \in Y} \Gamma_y(x) \) , where \(\Gamma_y(x)\) is raw score
Pseudo-margin — the degree of confidence in the classification \( a_i = a(x_i) \):
\[ \boxed{ M_i(a) = \Gamma_{a_i}(x_i) - \max\limits_{y \in Y/a_i} \Gamma_y(x_i)} \]Self-Training Algorithm — a wrapper over the method \( \mu \):
\( \color{orange}{M_0} \) can be determined, for example, from the condition \( |\Delta| = 0.05 |U| \)
Let \( \mu_t: X^k \to a_t \) be different learning methods, \( t = 1, \dots, T \)
Co-learning Algorithm is self-training for a composition — simple voting of base algorithms \( a_1, \dots, a_T \):
\[ \boxed{a(x) = \arg\max\limits_{y in Y} \Gamma_y(x),\ \Gamma_y(x_i) = \sum\limits_{t=1}^T[a_t(x_i) = y]} \]Let \( \mu_1: X^k \to a_1,\ \mu_2: X^k \to a_2 \) be two substantially different learning methods, using either:
What else to look at?