The Opaque Box · Chapter 10

Bending It to a Task

Chapter 9 ended with a working giant on the bench: 124 million borrowed numbers speaking coherent English through code you wrote line by line. But listen to what it says. It continues. That is all pretraining ever taught it to do — and a machine that only continues is not a machine that helps; it is a very expensive parrot. This chapter takes the trained metal and bends it: same weights, same architecture, same loop from Chapter 7 — and at the far end of the bend, an assistant. It is the last chapter of Part I, and it closes on a confession the labs rarely make out loud.

10.1 What pretraining actually bought

Be precise about what you own after Chapter 9, because the marketing never is. Pretraining bought a next-word predictor over everything — a machine that, given any stretch of text, produces a distribution over what plausibly comes next in text like the text it studied. That is a genuinely astonishing object. It is also, on its own, useless as a helper.

The classic demonstration — and we describe the behavior class here rather than fabricate a specific run — goes like this: hand a base model a question, and it may well answer with more questions. Why? Because in the wild text it studied, one question is very often followed by others — a list of FAQ entries, a quiz sheet, a forum thread of pile-ons. The model is doing its job perfectly. Its job was never “answer”; its job was “continue.” Ask it for a recipe and it may give you a plausible food-blog preamble instead, because that is what precedes recipes in its world. The base model is a mirror of its corpus, and a mirror does not take requests.

Here is the load-bearing picture for this chapter: pretraining forged the metal. All the capability — the grammar, the facts-shaped patterns, the ability to hold a topic across a paragraph — is in the bar of steel now. What the bar lacks is a shape. Fine-tuning is the bend: you do not smelt new metal, you do not change the alloy; you take the same weights and press them — gently, with more training — toward a new form. Same metal, new shape. Every section of this chapter is a bend at a different angle.

The bend, made literal: pretraining forges a straight bar of weights that only continues; fine-tuning presses that same bar — gently, at a tiny learning rate — into the shape of an assistant. Same metal, new form.

10.2 Fine-tuning is not a new mechanism

The first honest thing to say about fine-tuning is how little of it is new — which is exactly the part the mystique leaves out. You already own every part:

The loss is Chapter 7’s cross-entropy, unchanged.
The loop is Chapter 7’s loop, unchanged: batch, forward, loss, backward, step.
The optimizer is the same AdamW.

Only two things change. First, the data: instead of a mountain of everything, a small, deliberate set of examples of the behavior you want. Second, the learning rate: much smaller — for instruction tuning, typically in the neighborhood of 1e-5 to 1e-4, rather than the 3e-4 we pretrained with. The intuition follows the metaphor exactly: the metal is already forged, and you are bending it, not re-smelting it. Press too hard — too high a learning rate, too long a run on too narrow a dataset — and you deform what pretraining built; the model degrades at everything general in its rush to please the new data. Press gently and the general shape survives while the surface takes your form.

The first bend: a classification head

The quickest way to see fine-tuning clearly is the least glamorous one. Suppose you do not want generation at all — you want to read a snippet of text and assign it one of a few labels. The transformer body you built is already a powerful text-reading machine: after six blocks, the vector at the last position has attended over everything before it. So unbolt the vocabulary head and bolt on a smaller one:

import torch
import torch.nn as nn

class GPTClassifier(nn.Module):
    """
    The first bend: same body, new head.
    Wraps the Chapter 6 GPT; swaps the vocab head for a class head.
    Input:  (B, T) integer token ids
    Output: (B, num_classes) logits over the labels
    """
    def __init__(self, gpt, d_model, num_classes):
        super().__init__()
        self.gpt = gpt
        self.gpt.lm_head = nn.Identity()        # unbolt the vocab head
        self.class_head  = nn.Linear(d_model, num_classes)

    def forward(self, idx):                     # idx: (B, T)
        h = self.gpt(idx)                       # (B, T, d_model) — body unchanged
        h_last = h[:, -1, :]                    # (B, d_model)  last position saw it all
        return self.class_head(h_last)          # (B, num_classes)


# ── sanity check ─────────────────────────────────────────────────────────────
# with lm_head replaced by Identity, the GPT's forward returns the
# post-ln_f hidden states, (B, T, 384); the class head maps the last
# position to, say, 3 labels:
#   idx: (4, 8) -> h: (4, 8, 384) -> h_last: (4, 384) -> logits: (4, 3)

Train it with cross-entropy over the labels — Chapter 7’s loop again, tiny learning rate — and the eleven million pretrained parameters do almost all of the work; the new head just learns to read them. This is one code sketch, not the main event, but it makes the principle unmissable: the body is general; the head is the task. (One honest note on the sketch: it assumes the Chapter 6 GPT pipes its final hidden states through lm_head as its last step, so replacing that head with nn.Identity() exposes the (B, T, d_model) stream. That is how we built it.)

The main event, though, is the bend that turned this technology into the thing you talk to.

10.3 Instruction tuning: teaching the mirror to answer

If the base model continues whatever it is fed, then the path to an assistant is almost insultingly direct — no secret sauce, no wizardry: feed it thousands of examples where an instruction is followed by a good response, and keep training. The model, doing exactly what it has always done — predict the next token — learns that in text shaped like this, what follows an instruction is the answer to it. This is supervised fine-tuning (SFT), and the research lineage is short and recent. Wei et al. (2021, arXiv:2109.01652) named the idea instruction tuning: they took a 137B-parameter model, fine-tuned it on more than sixty NLP tasks rephrased as natural-language instructions, and found the result (FLAN) beat zero-shot GPT-3 on most benchmarks they tested — the model had learned not sixty tasks but the shape of being instructed. Two years later, Stanford’s Alpaca project (Taori et al., 2023) showed how far the recipe had commoditized: a 7B LLaMA base model, fine-tuned on 52,000 instruction–response pairs generated by the self-instruct technique using OpenAI’s text-davinci-003, with a total data-generation bill under $500.

Alpaca’s prompt template became a de-facto standard, and it is the one we will use. It is nothing but text — a frame the model learns to recognize:

Below is an instruction that describes a task.
Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}

Two details carry all the craft, and one of them is easy to miss — miss it and you burn compute teaching the model to recite the question back at you.

The mask: grade the answer, not the question

Naively, you would train on the whole formatted string with Chapter 7’s loss — every token predicting the next. But think about what that spends gradient on: the model would be pushed to get better at predicting the instruction itself, token by token. That is wasted (the user supplies the instruction; the model never needs to produce one) and mildly harmful (you are teaching it to parrot prompt boilerplate). The fix is to mask the prompt out of the loss: compute the loss only at positions whose target token belongs to the response. PyTorch’s cross-entropy has a built-in convention for this — any target set to ignore_index contributes nothing.

import torch
import torch.nn.functional as F

IGNORE_INDEX = -100     # F.cross_entropy skips positions with this target

TEMPLATE = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Response:\n"
)

def build_example(instruction, response, encode):
    """
    One SFT pair -> (input_ids, labels).
    Targets are the input shifted left by one (exactly as in Chapter 2);
    every position whose target is a PROMPT token is masked out.
    """
    prompt_ids   = encode(TEMPLATE.format(instruction=instruction))
    response_ids = encode(response)
    input_ids    = prompt_ids + response_ids

    labels = input_ids[1:] + [IGNORE_INDEX]      # shift-by-one targets
    for i in range(len(prompt_ids) - 1):
        labels[i] = IGNORE_INDEX                 # no loss on the prompt

    return input_ids, labels


# the training step is Chapter 7's loop, verbatim, at a gentler learning
# rate (~1e-5 to 1e-4), with one change to the loss call:
#
#   loss = F.cross_entropy(
#       logits.view(-1, vocab_size),     # (B*T, vocab_size)
#       targets.view(-1),                # (B*T,)
#       ignore_index=IGNORE_INDEX,       # masked positions contribute nothing
#   )

Read build_example slowly, because the indexing is where the understanding lives. labels is input_ids shifted left by one — position i’s target is token i+1, the same next-token contract as Chapter 2’s dataset. The mask then blanks every position whose target is still inside the prompt. The last prompt position survives unmasked — deliberately: its target is the first response token, and predicting the start of the answer from the end of the question is precisely the skill we are buying.

The template is just text with a shading rule underneath it: everything through ### Response: is the prompt and is masked out of the loss (ignore_index); only the answer is graded. The one exception is the seam — the last prompt position, whose target is the first answer token — which stays graded on purpose.

That is instruction SFT in full — the whole of the first great bend, laid bare. No new mathematics, no new architecture, nothing behind a curtain — a template, a mask, and the loop you already wrote. On our 11M-parameter model the effect would be modest (there is only so much competence in the metal to bend); on the GPT-2 weights we loaded in Chapter 9 it is genuinely runnable on your machine; and at frontier scale this exact recipe — scaled up in data and care — is the first of the two great bends that produce every assistant you have used.

10.4 The second bend: learning from preference

SFT has a ceiling, and no amount of engineering muscle lifts it — it is baked into the idea. Demonstrations teach the model what a good answer looks like — but for most prompts there is no single good answer, and writing gold-standard demonstrations is slow, expensive expert work. Worse, some of what we want from an assistant is not a property of any one answer but a ranking over possibilities: this reply is more honest than that one, less evasive, clearer, safer. Humans are noticeably better at comparing two answers than at authoring the perfect one. The second bend is built on exactly that asymmetry.

The idea predates language models. Christiano et al. (2017, arXiv:1706.03741) trained reinforcement-learning agents — Atari players, simulated robots — without any hand-written reward function, using only human answers to “which of these two clips is better?” From those pairwise preferences they fit a reward model: a network that predicts what the human would prefer. The agent then optimizes against the learned reward. Strikingly, the humans needed to label under 1% of the agent’s interactions for this to work.

Where SFT needs a written gold answer, preference tuning needs only a choice: a human marks answer A better than B, and that pairwise verdict trains a reward model to emit a single score for any answer. Reinforcement learning then optimizes the assistant against that learned score.

Stiennon et al. (2020, arXiv:2009.01325) carried the recipe to language: collect human comparisons between candidate summaries of Reddit posts, train a reward model on the comparisons, then use reinforcement learning to tune the language model against that reward. The tuned models produced summaries that human judges preferred to the human-written reference summaries — and to the outputs of much larger models trained on supervised data alone.

Then Ouyang et al. (2022, arXiv:2203.02155) — the InstructGPT paper — assembled the full modern pipeline on GPT-3, and it is the pipeline in this chapter’s diagram: (1) SFT on labeler-written demonstrations (Section 10.3’s bend); (2) a reward model trained on human rankings of model outputs; (3) reinforcement learning against that reward model. This whole family is RLHF — reinforcement learning from human feedback. The result that stopped the room cold: human evaluators preferred the outputs of the 1.3B-parameter InstructGPT over those of the 175B-parameter GPT-3 — a model over a hundred times larger. Alignment with what people actually want, it turns out, is not the same axis as raw scale — and for a while it was far cheaper to buy. The whole industry noticed.

The pipeline that makes an assistant: pretraining forges the metal, supervised fine-tuning bends it toward instructions, and preference tuning — human comparisons feeding a learned reward — sets the final shape. One set of weights, bent twice.

Constitutional AI: the same bend, with the principles written down

Anthropic’s variant replaces part of the human labor with something more legible. In Constitutional AI (Bai et al., 2022, arXiv:2212.08073), the training runs in two phases: first, a supervised phase in which the model critiques and revises its own responses against an explicit list of written principles — the constitution — and is fine-tuned on the revisions; second, a reinforcement-learning phase in which the preference comparisons are made by an AI judge guided by those same principles (RL from AI feedback) rather than by human raters. The paper’s stated aim is an assistant that is harmless but non-evasive — one that engages with a hard question and explains its objection rather than stonewalling. For this book’s purposes the interesting move is philosophical: the values being pressed into the metal are written in a document a person can read and argue with, instead of living implicitly in a million unrecorded rating decisions.

An honest scope note before we close the section. What we have given you here is the concept level, and the concept level is real: reward model from comparisons, RL against it, principles guiding the feedback. The production details — the RL algorithms, the data mixtures, the guardrails, the many rounds — vary by lab, evolve fast, and are in several cases simply not disclosed; where the paper trail ends, we say so rather than guess. And none of this second bend fits on a laptop: it is data-center work, both in raw compute and in the armies of human comparisons behind the reward. Where Part II builds training methods in miniature, it will say exactly what fits on your bench and what does not. That discipline starts now.

Notice what the reward model really is: a machine trained to predict human taste, grading a machine trained to predict human text. Nothing in the loop “knows” what helpful means — the loop presses the outputs toward what raters preferred. That this works as well as it does is the empirical surprise the whole assistant era is built on.

10.5 End of Part I: the honest inventory

Stop and take stock, because you have earned every line of it — and nobody handed it to you. Over ten chapters you have personally: built the machine (tokenizer, embeddings, attention, the block, the full GPT); trained it (cross-entropy, backprop, AdamW, the loop); made it speak (temperature, top-k, nucleus — the sampler is yours too); verified it against a giant (GPT-2’s real weights, running in your class); and now bent it to a task (a head swap, a template, a mask, and the two bends that make an assistant). Nothing in that list was taken on faith. That was the deal in Chapter 0, and Part I has kept it.

Now the confession the epigraph promised — the one the demos never dwell on. Look at what every one of those machines — ours at 11M, GPT-2 at 124M, the aligned assistants at frontier scale — does when you ask it something hard. It answers on reflex, without breaking stride. One forward pass per token. The same fixed stack of blocks for every token, whether that token is the “the” in a pleasantry or the final digit of a calculation everything depends on. There is no pause. There is no going back. There is no scratch paper — no place where the machine can set down an intermediate result, look at it, and reconsider. The architecture you now understand completely is also the architecture of its own limit: a fixed amount of thinking per token, spent whether the token needs it or not.

For an astonishing range of things, reflex is enough — that is Part I’s genuine miracle. But hard problems — multi-step arithmetic, plans, proofs, anything where step four depends on getting step three right — need more than reflex. They need a way to work things out. Part II is about the discovery, half accidental and now deliberate, that the machine you just built can be handed exactly that — and about what it costs, what it buys, and what it still hides from you even as it shows its work.

10.6 The thing to actually understand

A base model continues; it does not help. Pretraining bought a next-word predictor over everything — a mirror of its corpus, and a mirror takes no requests. Asked a question, it may fire back more questions, because that is what its world looks like. Nothing is broken; nothing is aligned either.
Fine-tuning is the same mechanism, aimed. Chapter 7’s loss, loop, and optimizer, with new data and a smaller learning rate (~1e-5 to 1e-4). The bend reshapes the metal without re-forging it; press too hard and you deform what pretraining built.
The template is the interface; the mask is the lesson. Instruction SFT is a text frame the model learns to recognize, plus a loss computed only where the target is a response token (ignore_index). Grade the answer, not the question.
Preferences scale where demonstrations stall. Humans compare more reliably than they author. RLHF turns comparisons into a reward model and trains against it (Christiano 2017 → Stiennon 2020 → InstructGPT 2022, where 1.3B aligned beat 175B raw in human preference). Constitutional AI runs the same bend with the principles written down and an AI judge applying them.
The assistant is a shape, not a new substance. Same skeleton, same arithmetic underneath. What changed is the form pressed into it — which is why everything Part I taught you still describes the thing you talk to.
The limit is architectural: reflex only. One forward pass per token, fixed depth, no pause, no scratch paper — the same effort spent on “the” as on the answer everything hinges on. This is the boundary Part II exists to cross.

10.7 Exercises

Write your own five. Compose five instruction–response pairs about something you actually know — your city, your trade, your kitchen. Run each through build_example with the Chapter 1 tokenizer and print input_ids next to labels. Verify by eye that every prompt position is masked and that the last prompt position’s target is the first response token.
Mask versus no mask. Fine-tune the Chapter 7 model on your five pairs twice — once with the prompt mask, once training on all positions — deliberately overfitting both. Then prompt each with a bare ### Instruction: header and sample. Which one has learned to produce template boilerplate instead of answers?
Bolt on a class head. Invent a toy 3-label task (for instance: does this snippet end mid-sentence, at a sentence boundary, or at a paragraph break?). Wrap your trained model in GPTClassifier and train briefly. Compare against the same classifier wrapped around an untrained GPT — the gap is the value of the pretrained body.
Read the recipe. Read the abstract of the InstructGPT paper (arXiv:2203.02155). Identify the three training stages it describes and map each one to a section of this chapter. Then find the sentence about the 1.3B model and the 175B model, and say precisely what was measured — and what was not.

What’s next · Part II opens

Ch 11 — Thinking Out Loud — Chain-of-Thought

Ch 11 →

A 37th-Chamber original. Methods cited: Ouyang et al. (2022), “Training language models to follow instructions with human feedback,” arXiv:2203.02155 (SFT + reward model + RL pipeline; 1.3B-preferred-over-175B result — confirmed); Christiano et al. (2017), “Deep reinforcement learning from human preferences,” arXiv:1706.03741 (reward from pairwise preferences; feedback on under 1% of interactions — confirmed); Stiennon et al. (2020), “Learning to summarize from human feedback,” arXiv:2009.01325 (comparisons → reward model → RL; preferred over human reference summaries — confirmed); Bai et al. (2022), “Constitutional AI: Harmlessness from AI Feedback,” arXiv:2212.08073 (two-phase constitution-guided training; harmless-but-non-evasive aim — confirmed); Taori et al. (2023), Stanford Alpaca, github.com/tatsu-lab/stanford_alpaca (template; 52K self-instruct pairs via text-davinci-003; under $500 — confirmed); Wei et al. (2021), “Finetuned Language Models Are Zero-Shot Learners,” arXiv:2109.01652 (instruction tuning / FLAN; 137B model, 60+ tasks — confirmed). Base-model behavior in §10.1 is described as a class, not reported as a specific run. All prose and code written fresh.