Why is it possible to remove part of a big model?
In short: we don’t need all the parameters for the inference, but we need them for training.
We know for sure that half the stocks in our portfolio are worthless — the only problem is, we just don't know which half!
Post-Training Quantization
For 1d array [ 0.5, 20, -0.0001, -0.01, -0.1 ]
Remove parameters from the model after training
[See et al. 2016], [Han et al. 2015]
Graph showing the BLEU score versus percentage of parameters pruned
Train one model (the “student”) to replicate the behavior of another model (the “teacher”)
Distillation vs Quantization vs Pruning
Performance comparison between models
$0 \cdot [a, b, c] = [0, 0, 0]$
$$ G(x) = \text{softmax}(\text{keep\_top\_k}(f_\text{gating}(x), k)) $$
$$ \text{keep\_top\_k}(v, k)_i = \begin{cases} v_i & \text{if } v_i \text{ is in the top k elements of } v \\ -\infty & \text{otherwise} \end{cases} $$
The SMoE (Sparse-Mixture-of-Experts) MLP is Mixtral's distinct feature. It is the reason for its exceptional performance at low compute cost.