Model Compression and Agents Quiz

1 Which three families of techniques were presented as the main approaches to model compression?

Normalization, Dropout, Data Augmentation
Quantization, Pruning, Distillation
Tokenization, Embedding, Fine-tuning
Regularization, Early Stopping, Cross-validation

Explanation: The lecture introduced three complementary model compression techniques: (1) Quantization keeps the model architecture the same but reduces the number of bits used to represent weights (e.g., Int8 or 4-bit via Q-LoRA), (2) Pruning removes parts of the model (individual weights for unstructured pruning, or entire components for structured pruning) while retaining performance, (3) Distillation trains a smaller student model to imitate the behavior of a larger teacher model (e.g., DistilBERT). The motivation behind all three is the observation that LLMs are becoming massive and one week of ChatGPT usage is comparable to its training cost — so we need to deploy models cheaply, efficiently, and equitably without sacrificing performance. The underlying insight is that we do not actually need all the parameters for inference, only for training.

2 In absolute-maximum (absmax) Int8 quantization, what scaling factor is applied to every element of an Fp16 array?

255 / sum of absolute values
127 / max of absolute values
128 / mean of absolute values
1 / standard deviation

Explanation: The absmax Int8 quantization formula is X_i8 = round(127 · X_f16 / max_ij(|X_f16ij|)). Dividing by the maximum absolute value in the tensor and multiplying by 127 rescales the range into the signed Int8 interval [−127, 127]. For example, in the lecture's worked example [0.5, 20, −0.0001, −0.01, −0.1], the maximum absolute value is 20, so every element is multiplied by 127/20 and rounded, producing [3, 127, 0, 0, −1]. Using the maximum (rather than the mean or a constant) guarantees that no value gets clipped outside the representable range, while still using the full dynamic range of Int8.

3 What is the headline result of Q-LoRA (Dettmers et al., 2023)?

It removes the need for any fine-tuning data
It doubles the accuracy of full-precision fine-tuning
It enables fine-tuning a 65B-parameter model on a single 48GB GPU
It eliminates the need for the attention mechanism

Explanation: Q-LoRA combines 4-bit quantization of the frozen base model with LoRA adapters that are trained in higher precision. The celebrated result highlighted in the slides is that you can train (fine-tune) a 65B-parameter model on a single 48GB GPU. The trick is to (1) quantize the huge base model's weights into 4 bits (massively reducing memory for the frozen backbone), (2) keep the LoRA adapters in higher precision so the learning signal is not degraded, and (3) dequantize weights on-the-fly during matmul. This makes fine-tuning of very large models accessible on commodity hardware — one of the most important practical breakthroughs in democratising LLM research.

4 What is the key distinction between pruning and quantization?

Pruning is applied during training, quantization is only post-training
Quantization changes the precision of all parameters; pruning sets a subset of parameters to zero and leaves the rest unchanged
Pruning works only on CNNs; quantization works only on Transformers
Quantization always produces a smaller file size than pruning

Explanation: As stated explicitly in the slides: "Quantization: no parameters are changed, up to k bits of precision. Pruning: a number of parameters are set to zero, the rest are unchanged." In quantization, every parameter is kept but represented with fewer bits (e.g., Int8 instead of Fp32), so the precision of every weight is reduced. In pruning, the parameters that survive keep their full precision, but a subset of weights are forced to zero. Distillation is a third category in which essentially all parameters change because we train a new, smaller student model from scratch to imitate the teacher. Understanding this taxonomy matters because the three approaches can be combined (e.g., prune + quantize) and because the deployment implications (sparse matmul kernels, low-bit arithmetic units, retraining pipelines) differ.

5 Magnitude pruning (See et al. 2016, Han et al. 2015) is classified as which kind of pruning?

Unstructured pruning
Structured pruning
Gradient-based pruning
Reinforcement-learning pruning

Explanation: Magnitude pruning zeroes out the X% of weights with the smallest absolute value anywhere in the model. Because the selected weights can sit anywhere in any matrix, the resulting sparsity pattern has no regular structure — this is unstructured pruning. Structured pruning, in contrast, removes entire components: whole neurons, attention heads, filters, or even full layers (e.g., Xia et al. 2022). Unstructured pruning often preserves accuracy at very high sparsity levels, but its irregular pattern is hard to accelerate on standard GPUs; structured pruning is easier to speed up in practice because it yields a smaller, dense model, but it usually causes a larger accuracy drop at the same level of parameter reduction, so requires more careful tuning.

6 In the "Pruning with only forward passes" method (Dery et al. 2024), why is the avoidance of gradients a valuable property?

Gradients are unavailable for Transformer architectures
Structured pruning of big models via backprop is extremely memory-hungry; forward-only evaluation removes that bottleneck
Forward passes always produce a better solution than gradient-based methods
It allows pruning to be done without any data at all

Explanation: The slide's motivation is literally: "Structured pruning of big models requires a lot of memory — can we avoid using gradients?" Computing gradients requires storing the full computation graph and intermediate activations, which for an LLM with billions of parameters can easily exceed GPU memory. The proposed idea is to (1) mask different modules and measure the resulting performance using only forward passes, and (2) fit a regression model that predicts the impact of each mask. This lets you search for good pruning masks on very large models without ever needing a backward pass, trading a little statistical efficiency for a large reduction in memory footprint. The method does still require data (to measure performance) and does not claim to be universally better than gradient-based approaches.

7 Which combination of tricks best summarises how DistilBERT (Sanh et al. 2019) is trained?

Train from scratch with the masked-LM loss only
Initialise from alternating BERT layers, then combine a distillation loss with a cosine similarity loss on hidden states (supervised loss adds little)
Randomly initialise and train only with reinforcement learning from human feedback
Use quantization-aware training with no teacher model at all

Explanation: DistilBERT keeps roughly half the layers and 60% of the parameters of BERT while retaining most of its performance. The recipe in the paper, echoed on the slide, has three ingredients: (1) Warm-start the student by initialising its layers from alternating layers of the full BERT teacher — this gives training a much better starting point than random init. (2) Combine supervised and distillation losses: the student is trained on the masked-language-model objective and on a KL loss matching the teacher's output distribution. Empirically the supervised loss alone did not help much; most of the gain comes from distillation. (3) Add a cosine-similarity loss between the teacher's and student's hidden-state vectors, which forces internal representations to align, not just the output probabilities. Together these tricks let a much smaller model stay close in quality to its teacher.

8 In a Sparsely Gated Mixture-of-Experts layer (Shazeer et al. 2017), the gating function is G(x) = softmax(keep_top_k(f_gating(x), k)). What is the role of the keep_top_k operation?

It normalises the expert outputs to sum to one
It sets all but the top-k gating scores to −∞ so that softmax assigns zero weight to the non-selected experts, yielding sparse computation
It averages the outputs of all experts before softmax
It randomly drops k experts as a regularizer

Explanation: The formula is keep_top_k(v, k)_i = v_i if v_i is among the top-k entries of v, and −∞ otherwise. Since softmax(−∞) = 0, any expert that is not in the top k receives exactly zero gating weight. Because scalar × tensor = 0 contributes nothing to the final sum, we do not have to compute the outputs of those experts at all. That is the core trick that makes Sparse MoE cheap: we keep a huge total parameter count (all experts) but only execute k of them per token (typically k=1 or 2 in modern MoE models). The other options describe operations that would either destroy the sparsity (option A normalises after softmax rather than before it; option C averages everything; option D is a dropout-style idea unrelated to the gating formula).

9 What is the core trade-off that a Sparse Mixture-of-Experts (SMoE) model like Mixtral-8x7b exploits?

It trades memory for latency by caching activations aggressively
It trades high total parameter count for low per-token compute by running only a few experts per input
It trades accuracy for interpretability by using shallow experts
It trades training cost for inference cost by retraining on every query

Explanation: Mixtral-8x7b has 8 experts per MoE block but activates only 2 per token. The total parameter count is therefore large (≈ 47B parameters), yet the active parameter count per forward pass is much smaller (≈ 13B) — giving roughly the compute cost of a 13B model with the capacity of a 47B one. That is the SMoE bargain: capacity without proportional compute. The gating network learns to route tokens to the most appropriate experts, so specialisation emerges across experts. The cost of this deal is (a) the memory footprint of keeping all experts in memory, and (b) the added complexity of the learned gating mechanism, which must be kept balanced so that no single expert collapses or is starved of tokens.

10 In the LangChain notebook, what does the @tool decorator (with parse_docstring=True) on a Python function achieve?

It caches the function's return value for faster repeated calls
It wraps the function as a LangChain Tool whose schema (name, description, argument types) is built from the signature and docstring so an LLM can call it
It converts the function into an async generator for streaming
It fine-tunes the underlying LLM on the function's examples

Explanation: The @tool decorator registers a regular Python function as a Tool that an LLM can invoke. LangChain automatically builds a JSON-schema description from (a) the function name, (b) its type-annotated arguments, and (c) the docstring (with parse_docstring=True it even parses the "Args:" section to produce per-argument descriptions). That schema is then passed to the model via llm.bind_tools([...]). When the user asks something like "What is the temperature in London?", the model responds with a structured tool-call (name + arguments) instead of free text, and the notebook demonstrates executing that call with selected_tool.invoke(tool_call) and appending the result back into the message history so the LLM can produce the final answer. This is the foundation of tool-using (function-calling) agents.

11 In the notebook, why are Pydantic BaseModel classes such as CityWeather combined with llm.with_structured_output(...)?

To train the LLM on domain-specific data
To force the LLM to emit output that conforms to a validated, typed schema so downstream code can rely on fields like city and temperature
To encrypt the LLM's response in transit
To translate the response into another human language

Explanation: Raw LLM text is fragile to parse — small variations in wording can break a pipeline. Pydantic models solve this by defining a strict, typed schema (e.g., city: str, temperature: float, wrapped in a list via CityWeatherList). LangChain's with_structured_output uses the model's built-in function/tool-calling to constrain the generation to match that schema, and automatically validates the result. This gives you two benefits: (1) the programmer can access values as typed Python attributes (e.g., answer.cities[0].temperature) or convert to JSON via model_dump(), and (2) the LLM is much less likely to produce malformed output, because both the schema and validation are part of the request. Pydantic itself is not training or encrypting anything; it is a structural contract between the LLM and your code.

12 What is a vector store such as FAISS in the context of LangChain, and why is it central to Retrieval-Augmented Generation (RAG)?

A relational database that stores raw text and queries it with SQL
A database that stores dense embedding vectors and retrieves the most similar chunks for a query, providing external context to the LLM
A cache of previously generated answers from the LLM
A compiler that turns English prompts into SQL

Explanation: A vector store efficiently indexes high-dimensional embedding vectors and supports approximate nearest-neighbour search. The notebook's workflow is: (1) load text with a loader (WebBaseLoader, CSVLoader, UnstructuredLoader, ...), (2) split it into chunks with a text splitter like RecursiveCharacterTextSplitter, (3) embed each chunk with OpenAIEmbeddings, (4) insert the embeddings into FAISS via FAISS.from_documents. At query time, the user question is embedded and the store returns the most similar chunks via similarity_search or a retriever. Those retrieved chunks are then injected into the LLM's prompt. This is why vector stores are the heart of RAG: they let the LLM "look things up" in an external, up-to-date knowledge base instead of relying only on its fixed training data, dramatically reducing hallucinations and enabling domain-specific answers.

13 In the agentic RAG graph built with LangGraph, the pipeline contains nodes such as agent, retrieve, grade_documents, rewrite, and generate. What is the purpose of grade_documents and rewrite?

They train the retriever on new user queries
They judge whether retrieved documents are relevant; if not, the query is reformulated and retrieval is retried before generation
They convert the LLM's text into speech for a voice assistant
They prune unused parameters of the underlying LLM

Explanation: Classical RAG is a one-shot pipeline: embed → search → generate. Agentic RAG adds a feedback loop, which is exactly what LangGraph is designed to express — a stateful, multi-node graph with conditional edges. grade_documents uses an LLM to judge whether the retrieved chunks actually answer the user's question. If they do, control flows to generate to produce the final answer. If not, control flows to rewrite, which asks the LLM to reformulate the original question into a better search query, and the loop repeats. This makes retrieval more robust against poorly worded questions and lets the system "reason about" its own search results instead of blindly trusting the first hit. The agent node orchestrates when to call tools (including the retriever tool) versus when to answer directly.

14 The slides classify modern LLM-based systems into "Text agent" (e.g., ELIZA), "LLM agent" (e.g., SayCan), and "Reasoning agent" (e.g., ReAct, AutoGPT). What does the Reasoning-agent level add on top of a plain LLM agent?

It replaces the LLM with symbolic rules
It uses the LLM to both reason (e.g., chain-of-thought) and act (tool use, retrieval, code, environment interaction) in interleaved steps
It removes the environment and operates purely on static prompts
It restricts the LLM to a single pre-defined action per session

Explanation: The slides present three levels: Level 1 (text agents like ELIZA) react with hard-coded rules; Level 2 (LLM agents like SayCan) use an LLM as the action policy but do not explicitly reason; Level 3 (Reasoning agents, ReAct/AutoGPT) interleave Thought → Action → Observation → Thought → ... steps, letting the LLM plan and self-correct while also calling tools or interacting with environments. ReAct's empirical advantage is shown in the table: on PaLM-540B, Reason-only scores 29.4 on HotpotQA and Act-only 25.7, but combining them (ReAct) lifts the score to 35.1. The same pattern holds for FEVER (fact checking) and ALFWorld (text games). The intuition is that reasoning helps the agent decide which action to take next, while acting lets the agent fetch external information that cannot be hallucinated.

15 Which statement about LangGraph best matches how it was used to build the agentic RAG system in the notebook?

LangGraph is a pre-trained language model competing with GPT-4
LangGraph is a framework for building stateful, multi-actor LLM applications as explicit graphs of nodes and (possibly conditional) edges, with a shared state such as AgentState
LangGraph is a visualisation library for neural-network computation graphs
LangGraph replaces the need for any LLM by using symbolic rules

Explanation: In the notebook, LangGraph is imported as from langgraph.graph import END, StateGraph, START, and a workflow is assembled with workflow.add_node("agent", agent), workflow.add_node("retrieve", ToolNode([retriever_tool])), etc. Nodes are connected by edges, some of them conditional (e.g., after grade_documents, branch to either generate or rewrite). The graph shares an AgentState (a TypedDict whose messages field uses add_messages to accumulate conversation state). That state threads through every node as the graph is executed. This graph-of-actors abstraction is what distinguishes LangGraph from plain LangChain chains: loops, branches, retries and multi-agent coordination become first-class, while the underlying reasoning is still done by LLMs — LangGraph is orchestration, not a model.