The Opaque Box · Part II · Chapter 14

Passing the Torch — Distillation

Chapter 13 ended in a data center: a frontier model practicing against an answer key until long, careful reasoning emerged on its own. That model is enormous, and the training that made it is out of almost everyone’s reach — including ours. But the reasoning it learned did not stay locked inside it. Here is the crack in the vault: what was grown by reinforcement learning can be written down — and what can be written down can be taught, to a model small enough to hold. This chapter is about the apprentice.

14.1 The problem with the frontier

Take stock of what Chapter 13 actually cost, because the bill is the whole reason this chapter exists. Reinforcement learning with verifiable rewards runs a frontier-scale model through enormous numbers of practice problems, sampling groups of long answers for each one, checking them, and nudging 100-billion-plus parameters after every batch. That is data-center work, measured in weeks of cluster time. We said so plainly then, and it stays true now: nothing about that loop fits on the machine in front of you except its shape.

And the product of all that work has the same problem. DeepSeek-R1 — the open-weight reasoning model whose training the previous chapter walked through — is far too large for most people to run at all. A capability that only exists at one size, in one place, behind one API or one rack of datacenter GPUs, is a capability most of the world can only rent — on the landlord’s terms, for as long as the landlord keeps the lights on.

So here is the question this chapter answers: can the capability move? Not the weights — the weights are just numbers sized for a machine you don’t have. The capability. The learned habit of working a problem in steps, checking, backtracking, landing on an answer.

The deep-learning literature has held an answer since 2015, and it is one of the oldest and most elegant tricks in the book. The master does not hand the apprentice its hands — you cannot give away hands. It hands over the next best thing: its worked examples.

14.2 Distillation, the classic form

The founding paper is Hinton, Vinyals & Dean (2015), “Distilling the Knowledge in a Neural Network” — presented at the NIPS 2014 Deep Learning Workshop and posted to arXiv in March 2015. The setup: you have a large, cumbersome model (or an ensemble of models) that performs well, and you want a small model that performs nearly as well. The move: train the small model — the student — to match the outputs of the large model — the teacher — rather than training it from scratch on the raw labels alone.

Why should copying outputs beat learning from the original data? Because the teacher’s output is not just an answer — it is a full probability distribution, and that distribution encodes structure the hard label throws away. When an image classifier says “this is a cat: 90%,” it also says “dog: 6%, fox: 3%, car: 0.0001%.” Those small probabilities on the wrong answers carry the teacher’s hard-won sense of which mistakes are almost right — a cat is much more like a dog than like a car. The community came to call this rich structure in the soft outputs dark knowledge, and matching it (the paper softens the distributions with the same temperature dial you met in Chapter 8) gives the student a far denser training signal per example than a bare label ever could.

Dark knowledge: an ordinary hard label is one-hot — cat 100%, everything else flat zero. The teacher’s soft distribution keeps small probabilities on dog and fox (and a trace on car), and those glowing near-misses are the extra structure the student learns by matching the distribution rather than the bare label. Values are the illustrative figures used in this chapter’s text.

Hold on to the shape of that idea, because the 2025 version of it is about to do something the 2015 authors were not talking about at all. In classic distillation, the transmissible artifact is a distribution over answers. In reasoning distillation, the transmissible artifact is going to be the reasoning itself.

14.3 The 2025 form: the teacher writes it out

A reasoning model has a property that an image classifier never had: its intermediate work is text. The long chain Chapter 13’s reinforcement learning grew — the trying, the checking, the “wait, let me reconsider” — is emitted token by token, in language. Which means it can be captured, curated, and put in a file.

That is exactly what the DeepSeek-R1 paper did. The recipe, as the paper reports it:

The teacher writes. R1 — the full RL-trained reasoning model — generates worked solutions: problem, chain of reasoning, final answer.
The lab curates. Roughly 800,000 samples are assembled — about 600k reasoning traces plus about 200k non-reasoning examples — keeping the good ones.
The students study. Six small dense models — 1.5B, 7B, 8B, 14B, 32B, and 70B parameters, built on open Qwen and Llama bases — are fine-tuned on those traces. Crucially, the paper is explicit that the distilled models are trained via SFT only: plain supervised fine-tuning, with no reinforcement learning stage at all.

Read that last point again, because it is the punchline of the whole chapter and it is almost too clean to trust. The hardest, most expensive, most exotic training process in this book — Chapter 13’s verifiable-reward RL — produces a capability that transfers to small models through the most ordinary process in this book: the supervised fine-tuning you built in Chapter 10. The RL was needed to discover the reasoning. Once discovered and written out, the reasoning travels as ordinary training text — and text does not care how much a data center cost to produce it.

And the transfer is not a consolation prize — the small model does not get a watered-down copy. The paper runs the comparison directly (its Section 4.1, “Distillation v.s. Reinforcement Learning”): take a 32B base model and either (a) train it with large-scale RL directly, the Chapter 13 way, or (b) simply fine-tune it on the big teacher’s traces. The distilled model — DeepSeek-R1-Distill-Qwen-32B — performs significantly better than the directly-RL-trained one across all their benchmarks. At small scale, learning from a great teacher’s worked examples beat doing the practice alone. The apprentice with the master’s notebooks outpaced the apprentice locked in a room with only the answer key.

An honest asymmetry lives inside that result: somebody still has to be the master. Distillation moves capability down from a model that already has it; it does not create the capability. The expensive discovery step at the top of the chain — the RL — is not eliminated. It is amortized: paid once, at frontier scale, and then spread across every student that learns from the traces.

Same 32B base, two paths: train it directly with large-scale RL (the Chapter 13 way) or simply fine-tune it on the frontier teacher’s traces. The R1 paper’s Section 4.1 comparison found the distilled model significantly better across all their benchmarks — cheaper and higher. Bar heights are qualitative; they show the direction of the finding, not benchmark numbers.

The self-taught ancestor

This idea — that a model’s own written-out reasoning is training data — did not appear from nowhere in 2025, and honesty demands naming the ancestor. Zelikman et al. (2022), “STaR: Bootstrapping Reasoning With Reasoning,” ran the loop with the teacher and student as the same model: generate reasoning chains, keep the ones that reach correct answers, fine-tune on your own successes, repeat. A model pulling itself up by its own solved problems. R1-style distillation is the two-model version of the same insight, with a much stronger teacher doing the writing.

Either way, the deep fact underneath is the one to keep: the chain of thought is a transmissible artifact. Reasoning that was learned can be taught — in text, the same medium the whole book has been made of.

Reasoning distillation, R1-style: the RL-trained teacher writes its chains out; ~800k curated traces become an ordinary fine-tuning dataset; small students learn the habit by plain SFT — no reinforcement learning stage of their own.

14.4 Full circle to your bench

Here is the part that should feel like coming home — and it is the promise this book has been quietly paying down since Chapter 1. “Fine-tune a small model on (problem, chain, answer) text” is not a new machine. It is Chapter 10’s machinery, verbatim: a prompt/response template, loss masked to the response tokens, Chapter 7’s training loop at a gentle learning rate. The entire Part II arc — scratch paper, drafts, practice against an answer key — lands back on code you already own.

First, the data shape. A reasoning trace is just an instruction pair where the chain rides inside the response:

# ── Chapter 14: a reasoning trace as an SFT example ────────────────────
# Chapter 10's template, unchanged. The only new idea is WHAT goes
# in the response: the teacher's worked steps, then the answer.

TEMPLATE = (
    "### Instruction:\n"
    "{problem}\n\n"
    "### Response:\n"
)

def format_trace(problem, chain, answer):
    """
    One distillation example.
    Input:  problem (str), chain (str, the teacher's worked steps), answer (str)
    Output: (prompt, response) pair of strings.
    The chain rides INSIDE the response -- that is the entire trick.
    """
    prompt   = TEMPLATE.format(problem=problem)
    response = chain + "\n\nAnswer: " + answer
    return prompt, response


# a toy trace, in the shape the teacher writes hundreds of thousands of:
problem = "A shelf holds 3 boxes of 12 books and 5 loose books. How many books?"
chain   = ("3 boxes of 12 books is 3 * 12 = 36 books.\n"
           "Adding the 5 loose books: 36 + 5 = 41.")
answer  = "41"

prompt, response = format_trace(problem, chain, answer)

# ── sanity check ─────────────────────────────────────────────────────────────
print(prompt)      # ends with "### Response:\n" -- the model writes from here
print(response)    # chain first, answer last: the order teaches the habit

Notice the order inside the response: chain first, answer last. That is not cosmetic. The student is a next-token predictor; whatever comes earlier in the text conditions whatever comes later. Putting the worked steps before the answer means the model learns to produce the answer conditioned on its own reasoning — the same buy-compute-by-generating-tokens mechanics Chapter 11 explained. Flip the order and you would teach it to blurt the answer and then decorate it with a justification. (Chapter 15 has more to say about models that do exactly that.)

Second, the masking — Chapter 10’s one non-obvious line, brought forward. We grade the model only on producing the response, never on echoing the problem:

import torch
import torch.nn.functional as F

IGNORE = -100          # F.cross_entropy skips positions labelled -100

def make_example(prompt, response, encode_bpe, eos_id=0):
    """
    Turn one (prompt, response) pair into training tensors.
    Input:  two strings + the Chapter 1 tokenizer.
    Output: input_ids (T,) int64, labels (T,) int64 -- same length.
    Loss is masked to the response: the student is graded only on
    writing the chain and the answer, never on echoing the problem.
    """
    prompt_ids   = encode_bpe(prompt)                 # list of ints
    response_ids = encode_bpe(response) + [eos_id]    # teach it to stop

    input_ids = torch.tensor(prompt_ids + response_ids)                  # (T,)
    labels    = torch.tensor([IGNORE] * len(prompt_ids) + response_ids)  # (T,)
    return input_ids, labels


def sft_loss(model, input_ids, labels):
    """
    input_ids, labels: (B, T) int64. Returns scalar loss over response
    tokens only. Same next-token objective as Chapter 7 -- logits at
    position t are graded against the token at position t+1.
    """
    logits = model(input_ids)                            # (B, T, vocab_size)
    return F.cross_entropy(
        logits[:, :-1, :].reshape(-1, logits.size(-1)),  # predictions
        labels[:, 1:].reshape(-1),                       # shifted targets
        ignore_index=IGNORE,
    )

# ── sanity check ────────────────────────────────────────────────────────────
fake_encode = lambda s: [ord(c) % 512 for c in s]   # stand-in tokenizer
ids, labs = make_example(prompt, response, fake_encode)
print(ids.shape, labs.shape)          # equal lengths
print((labs == IGNORE).sum().item())  # = number of prompt tokens (masked)

Third, the loop. There is nothing to show that Chapter 7 did not already show — which is the point. Small learning rate, because this is Chapter 10’s bend, not a rebuild:

# ── the fine-tune loop: Chapter 7's loop, gentler steps ──────────────────
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for step in range(num_steps):
    xb, yb = next_batch(trace_dataset)   # (B, T) input_ids, (B, T) labels
    loss = sft_loss(model, xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

Scale honesty, as always: run this on our 11M-parameter model from Chapter 6 and you will teach it the format of reasoning — steps, then answer — but an 11M model has very little reasoning to receive; Chapter 11 already warned that the chain amplifies capability rather than creating it. Run it on the GPT-2 weights you loaded in Chapter 9 and the miniature becomes genuinely meaningful: a real pretrained model, on your machine, learning from worked examples through code you wrote. That is the honest laptop-sized end of the exact pipeline that produced the 1.5B–70B distilled models.

14.5 What spread in 2025

Now zoom back out, because the consequences of this chapter’s trick were not academic — they landed in public, all at once, and they are still landing. The DeepSeek-R1 release put the whole ladder in the open: the code repository and the model weights under the MIT license — commercial use allowed, modification and derivative works allowed, no permission to ask for — with the distilled students additionally derived from open Qwen2.5 and Llama3 bases. Not just the giant teacher: the 1.5B, 7B, 8B, 14B, 32B, and 70B apprentices too, sized for everything from a rack to a gaming laptop.

Think about what that means mechanically, given what you now know. Because the chain of thought is text, and because text is all a student needs, reasoning capability propagates at the cost of fine-tuning rather than at the cost of discovery. One lab pays the Chapter 13 bill once. After that, the traces travel — and every open base model of every size is a candidate student. That is why the months after January 2025 saw open-weight reasoning models at essentially every size: not because everyone suddenly acquired data centers, but because nobody had to. The torch passes for the price of an SFT run.

The house will not pretend this settles anything about markets, safety, or who wins — those are different essays, differently sourced. The mechanical fact is what belongs in this book, and it is remarkable enough stated plainly: the most advanced capability in the field, grown by the most expensive training process in the field, moves between models as ordinary text.

Which leaves exactly one question standing — the one this book was named for, and the one it has been circling for fourteen chapters. The student writes out its reasoning now. The box narrates its work, step by step, in plain language you can read. So: is the box still opaque? You have earned the real answer, and it is the last chapter.

14.6 The thing to actually understand

Distillation moves capability from teacher to student. The classic form (Hinton, Vinyals & Dean 2015) trains a small model to match a big model’s soft outputs, whose near-miss probabilities carry structure that hard labels discard — the so-called dark knowledge.
A reasoning model’s work product is text, so it can be a dataset. The chain of thought is a transmissible artifact: the R1 recipe is teacher writes traces → ~800k curated (problem, chain, answer) samples → students fine-tuned on them.
The students need no RL. The distilled 1.5B–70B models were trained via SFT only. Discovery is expensive and happens once; transmission is cheap and happens everywhere.
At small scale, distillation beat direct RL. The R1 paper’s own comparison: a 32B model fine-tuned on the teacher’s traces significantly outperformed the same base trained directly with large-scale RL, across all their benchmarks.
It is Chapter 10’s machinery, verbatim. Template, response-masked loss, gentle learning rate, Chapter 7’s loop. Chain before answer in the response — the order is the lesson.
Someone must still be the master. Distillation amortizes the discovery cost; it never erases it. No teacher, no traces — the torch has to be lit before it can be passed.

14.7 Exercises

Format three traces of your own. Pick three multi-step problems you can solve by hand (arithmetic word problems work well). Write the worked steps and answer, run each through format_trace, and inspect the strings. Confirm the chain sits inside the response, before the answer.
Design the ablation: why the chain and not just the answer? Sketch the experiment that would prove the chain matters: dataset A = (problem, chain, answer), dataset B = (problem, answer only), same student, same steps. Predict what each student does on a new multi-step problem, and say precisely why in next-token-prediction terms.
Verify the mask. Using the fake_encode stand-in, check that the number of IGNORE labels equals the number of prompt tokens, and that flipping ignore_index off changes the loss. What failure mode does grading the model on the prompt invite?
Find the failure mode of the apprentice. Distillation copies the teacher — including its mistakes. Work out what happens to a student trained on traces where the teacher’s chains reach right answers by wrong steps. Which of Chapter 13’s tools (the answer key) could filter the traces, and what kinds of error would still slip through?
Read the source. Read the distillation section and Section 4.1 of the R1 paper, then the opening sections of Hinton, Vinyals & Dean (2015). Ten years — and what feels like several eras of deep learning — separate their vocabularies; the idea is the same. Say it in one sentence.

What’s next

Ch 15 — What Stays Opaque

Read Ch 15 →

A 37th-Chamber original. Methods cited: Hinton, Vinyals & Dean (2015), “Distilling the Knowledge in a Neural Network,” arXiv:1503.02531 (soft-target distillation — confirmed); DeepSeek-AI (2025), “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv:2501.12948 (six distilled dense models 1.5B/7B/8B/14B/32B/70B on Qwen and Llama bases, ~800k curated samples ≈ 600k reasoning + 200k non-reasoning, SFT-only distillation, and Section 4.1’s distillation-beats-direct-RL comparison — all confirmed); MIT licensing of the R1 code and weights per the official repository, github.com/deepseek-ai/DeepSeek-R1 (confirmed); Zelikman et al. (2022), “STaR: Bootstrapping Reasoning With Reasoning,” arXiv:2203.14465 (confirmed); instruction template and SFT machinery follow Chapter 10’s pattern after Stanford Alpaca, github.com/tatsu-lab/stanford_alpaca (confirmed), with instruction tuning per Wei et al. (2021), “Finetuned Language Models Are Zero-Shot Learners,” arXiv:2109.01652 (confirmed). All prose and code written fresh.