Model Compression and Agents

Test your understanding

Back to Course

Model Compression and Agents

Question 1 of 15
Score: 0/0
1 Which three families of techniques were presented as the main approaches to model compression?
  • Normalization, Dropout, Data Augmentation
  • Quantization, Pruning, Distillation
  • Tokenization, Embedding, Fine-tuning
  • Regularization, Early Stopping, Cross-validation
Explanation: The lecture introduced three complementary model compression techniques: (1) Quantization keeps the model architecture the same but reduces the number of bits used to represent weights (e.g., Int8 or 4-bit via Q-LoRA), (2) Pruning removes parts of the model (individual weights for unstructured pruning, or entire components for structured pruning) while retaining performance, (3) Distillation trains a smaller student model to imitate the behavior of a larger teacher model (e.g., DistilBERT). The motivation behind all three is the observation that LLMs are becoming massive and one week of ChatGPT usage is comparable to its training cost — so we need to deploy models cheaply, efficiently, and equitably without sacrificing performance. The underlying insight is that we do not actually need all the parameters for inference, only for training.