The Opaque Box · Chapter 8

Making It Speak

Chapter 7 ended with a trained model and a falling loss curve. And yet nothing speaks. The GPT you built is a machine that, shown a sequence, computes a probability distribution over what comes next — and then stops. There is no voice inside it, no loop, no mechanism for continuing. Generation is not part of the network at all: it is a dozen lines of ordinary Python wrapped around the network, and every one of those lines is a decision. This chapter writes them. Along the way you will meet the strange truth of decoding: the model hands you probabilities and nothing else, and every word it “says” is a choice about how to gamble on them. The weights make the distribution. You make the bet. Nobody inside the box is speaking — you are pulling the words out, one lap at a time.

8.1  What training left you with

Be precise about the object on the table, because the whole chapter turns on not fooling yourself about it. After Chapter 7, your trained GPT maps (B, T) token ids to (B, T, vocab_size) logits. Row t of that output is the model’s scores for what token t+1 should be, given tokens 0..t. During training we never looked at those scores directly — we shoved all B×T rows through cross-entropy and collapsed them into a single scalar loss. The distribution the model computes was always there; we just never watched it.

So the honest framing is this: the model is not a text generator. It is a next-token distribution calculator. Given a context, it produces 512 numbers (our vocabulary size), one per token in the vocabulary, expressing how plausible each continuation is. Then the forward pass ends and the tensor sits there. Text generation is an algorithm built on top of that calculator: ask for a distribution, pick one token from it, append the pick to the context, ask again. Every “spoken” sentence is that loop running.

This is also why sampling is the most revealing chapter in the book — it is the one place the box stops hiding. Everywhere else, the model’s probability distribution is out of sight — squashed into a loss number, or buried under gradient plumbing. In generation it becomes visible: you can print the top ten candidates at every step and watch the machine hesitate between “the” and “a”, commit to a name it has seen before, paint itself into a corner. The loop you are about to write is the closest thing this book has to a window cut into the box.


8.2  The loop that makes it speak

One generation step has a fixed anatomy, and it does not care which sampling strategy you bolt on top of it — the skeleton is always the same five moves:

  1. Crop the running sequence to the last context_length tokens — the model’s position-embedding table (self.position_embedding, which Chapter 6 absorbed from Chapter 2’s GPTFrontEnd into GPT itself) only has 256 rows, so a longer input would index off the end of it.
  2. Forward the cropped ids through the model: (1, T) → (1, T, 512).
  3. Slice the last position: logits[:, -1, :], shape (1, 512). That row — and only that row — is the prediction for the next token.
  4. Shape the distribution: divide by temperature, optionally mask everything outside the top-k or top-p set.
  5. Softmax the shaped logits into probabilities, sample one token id from them, and append it to the sequence.

Then the loop repeats, with the freshly sampled token now part of the context. This is autoregression: the model’s output becomes the model’s input, and the sentence pulls itself into existence one token at a time. Here is the whole thing, including the two filter helpers we will unpack in §8.5. The GPT class is the one you assembled in Chapter 6, byte-for-byte unchanged — generation adds zero learned parameters. Not one. Everything that makes the model speak lives outside the model.

import torch
import torch.nn.functional as F

# The GPT class from Chapter 6 is used exactly as written there.
# Nothing below has learnable parameters — this is pure algorithm.

def top_k_filter(logits, k):
    """Keep the k highest-scoring tokens; push the rest to -inf.
    Input/output: (B, vocab_size)."""
    v, _ = torch.topk(logits, k, dim=-1)             # (B, k)  top-k values, sorted desc
    thresh = v[:, [-1]]                               # (B, 1)  the k-th largest logit
    return logits.masked_fill(logits < thresh, float('-inf'))


def top_p_filter(logits, p):
    """Nucleus filter (Holtzman et al. 2020): keep the smallest set of
    tokens whose cumulative probability reaches p; -inf the tail.
    Input/output: (B, vocab_size)."""
    sorted_logits, sorted_idx = torch.sort(logits, descending=True, dim=-1)
    probs = F.softmax(sorted_logits, dim=-1)          # (B, vocab_size)
    cum = torch.cumsum(probs, dim=-1)                 # (B, vocab_size)
    # mass strictly BEFORE each token; once that already exceeds p,
    # the nucleus is complete and this token is tail
    mask = (cum - probs) > p                          # first token is always kept
    sorted_logits[mask] = float('-inf')
    out = torch.full_like(logits, float('-inf'))
    out.scatter_(dim=-1, index=sorted_idx, src=sorted_logits)
    return out                                        # (B, vocab_size)


@torch.no_grad()
def generate(model, idx, max_new_tokens, context_length,
             temperature=1.0, top_k=None, top_p=None):
    """Autoregressive decoding loop.
    idx: (B, T) token ids — the prompt.
    Returns: (B, T + max_new_tokens)."""
    model.eval()                                      # dropout off
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_length:]           # (B, ≤context_length)
        logits = model(idx_cond)                      # (B, T', vocab_size)
        logits = logits[:, -1, :]                     # (B, vocab_size) — last position only
        logits = logits / temperature                 # τ < 1 sharpens, τ > 1 flattens
        if top_k is not None:
            logits = top_k_filter(logits, top_k)      # (B, vocab_size)
        if top_p is not None:
            logits = top_p_filter(logits, top_p)      # (B, vocab_size)
        probs = F.softmax(logits, dim=-1)             # (B, vocab_size) — sums to 1
        idx_next = torch.multinomial(probs, num_samples=1)   # (B, 1) — the gamble
        idx = torch.cat((idx, idx_next), dim=1)       # (B, T+1) — append, repeat
    return idx


# ── sanity check (untrained weights — gibberish is the CORRECT result) ──────
model = GPT(vocab_size=512, context_length=256, d_model=384,
            num_heads=6, num_layers=6, dropout=0.1)

prompt = torch.zeros((1, 1), dtype=torch.long)        # (1, 1) — a single start token
out = generate(model, prompt, max_new_tokens=20, context_length=256)
print(out.shape)                                      # torch.Size([1, 21])
print(decode(out[0].tolist()))                        # random junk — the weights are random

Line-by-line walk

Why recompute the whole prefix every step? Our loop re-runs the full forward pass over the entire cropped context for every new token — the same keys and values, recomputed hundreds of times. Production inference engines cache each layer’s K and V tensors between steps (the KV cache) and only compute the new position. We keep the naive loop because it is the readable one; nothing about the mathematics changes, only the bookkeeping.
One generation step: from ids to one sampled token, and back around A vertical pipeline of boxes shows the tensor shapes at each stage of a single decoding step. The only blue element is the loop-back arrow on the left, the autoregression: each sampled token is appended and the whole pipeline runs again.
One generation step, top to bottom: crop, forward, slice the last position, shape the distribution, sample one token, append. The blue arc is the whole trick — the sampled output loops back to become input, one token per lap.

8.3  Greedy decoding and the repetition trap

The most obvious decoding rule is also the most tempting, and it is a trap: at every step, take the single most probable token. Replace the multinomial line with idx_next = torch.argmax(probs, dim=-1, keepdim=True) and you have greedy decoding. It is deterministic — the same prompt yields the same text every run — and it feels like it should be optimal. If the model knows best, why not always take its best guess?

Because the likeliest next token, chosen myopically forever, does not produce the likeliest text — and, more damningly, likely text is not the same thing as good text. Holtzman et al. (2020), “The Curious Case of Neural Text Degeneration,” documented this with GPT-2 Large: decoding by pure probability maximization (beam search, greedy’s look-ahead cousin) produces text that falls into loops of repeated phrases, even from a model whose distribution is demonstrably high quality. Worse, they observed the trap is self-tightening: each time a phrase repeats, the model assigns the next repetition even higher probability — repetition is, after all, strong evidence that the text is the kind of text that repeats. Once the loop starts, maximization can’t leave it.

Their second finding cuts deeper, and it is the one to carry out of this chapter. Measure the per-token probability that a model assigns to human-written text and you find it fluctuates constantly — people say surprising things, then obvious things, then surprising things again. Maximization-decoded text sits on an unnaturally flat, unnaturally high probability line. The most probable path through language is a path no human writes. The model’s distribution knows this — the variety is right there in the probabilities. Greedy decoding just refuses to use it. That is the argument for sampling: not decoration, but fidelity to the distribution you spent Chapter 7 training.

Greedy decoding falls into a repeating loop; sampling branches forward Schematic, not measured output. Greedy decoding takes the single most probable token every step and tends to close into a repeating cycle. Sampling draws from the distribution and keeps opening new branches, so the path does not lock into repetition.
The degeneration trap, drawn. Greedy decoding (left) bends its own path into a closed cycle — a phrase whose every repeat the model scores higher, so it cannot escape. Sampling (right) draws fresh each step, and the path keeps branching instead of biting its own tail. Schematic, not measured.

8.4  Temperature

The first knob on the sampler is almost insultingly simple — a single scalar division, softmax(logits / τ) — and it earns a physics name for it. It does not change the ranking of tokens — the most likely token stays most likely at any τ > 0. It changes the contrast. Softmax exponentiates, so scaling every logit up (dividing by τ < 1) widens the gaps between scores and concentrates probability onto the leaders; as τ → 0, the distribution collapses toward a one-hot spike on the argmax and sampling degenerates into greedy. Scaling the logits down (τ > 1) compresses the gaps and pushes the distribution toward uniform — the tail tokens, including the genuinely bad ones, get a real chance. At τ = 1 you sample from exactly the distribution the model learned.

Why is it called temperature? The name is imported from statistical mechanics. The Boltzmann distribution weights a physical state of energy E by e−E/kT: at high temperature T the system visits high-energy states freely (disorder), at low temperature it freezes into its lowest-energy states (order). Softmax-with-temperature is the same functional form with logits playing the role of negative energies. The usage entered neural networks at least as far back as Boltzmann machines — Ackley, Hinton & Sejnowski (1985), Cognitive Science 9, where an explicit temperature parameter controlled how stochastically units flipped. When an API exposes a temperature slider, you are turning an analogy that neural networks have carried since 1985 — and that physics has owned for well over a century.

Temperature is a blunt instrument, though, and its failure mode is instructive. Raise τ to fight repetitive, over-cautious text and you brighten the entire distribution at once — the plausible alternatives get more probability, but so does every deranged token in the tail. Temperature cannot say “be adventurous among the sensible options only” — it turns the whole dial or none of it. For that you need to cut the tail off outright.

Temperature reshapes contrast, not ranking: sharpen, honest, flatten Illustrative bar charts, not measured output. The same ordered set of tokens is shown at temperature below one (sharpened onto the leader), at one (the honest learned distribution), and above one (flattened toward uniform, the tail lifted). The order of the bars never changes; only their relative heights do.
Temperature is a contrast knob, not a ranking knob. Below one it widens the gaps and concentrates mass on the leader (toward greedy); at one you sample the honest learned distribution; above one it levels the bars and lifts the tail (toward uniform). The order of the tokens never changes. Illustrative shapes, not measured output.

8.5  Cutting the tail: top-k and top-p

Top-k: a fixed guest list

Top-k sampling keeps only the k highest-probability tokens, sets everything else to -inf (so softmax assigns it exactly zero), renormalizes, and samples from what survives. It entered the generation toolkit through neural story generation — Fan, Lewis & Dauphin (2018, ACL) sampled from the k = 10 most likely candidates precisely because beam search kept collapsing into repetition — and it is the strategy behind the famous GPT-2 demo samples: Radford et al. (2019) generated their WebText-conditioned samples with top-k truncation at k = 40. Look back at top_k_filter: torch.topk finds the k-th largest logit, and masked_fill executes everything below it. Simple, effective, and it fixes temperature’s failure mode directly — the deranged tail can no longer be sampled at any temperature, because it is gone.

Why cut the tail at all? Holtzman et al. name the problem the unreliable tail: below the plausible candidates sit thousands of tokens that are each individually near-impossible, but whose probabilities sum to something that gets sampled with regularity. And in an autoregressive loop, one absurd token is never just one absurd token — it is appended, believed, and conditioned on, forever, as if the model had meant it. Every later prediction inherits the error. Truncation buys coherence by refusing that gamble entirely.

Top-p: a guest list that reads the room

But a fixed k has a fatal blind spot, and naming it is the whole contribution of Holtzman et al. (2020): the shape of the next-token distribution changes wildly from step to step. After a context like “The Eiffel Tower is in…” the distribution is peaked — one or two tokens carry nearly all the mass, and k = 40 forces 38 junk candidates onto the guest list. After a context like “I went to…” the distribution is flat — dozens of continuations are genuinely reasonable, and the same k = 40 may amputate perfectly good ones. One number cannot serve both moments.

Their fix is nucleus sampling, or top-p: instead of keeping a fixed count of tokens, keep the smallest set whose cumulative probability reaches a threshold p (say 0.9 — the paper’s own experiments use p = 0.95), and sample from that set renormalized. The nucleus resizes itself every step — one token wide when the model is certain, dozens wide when the model is open-minded. In top_p_filter, the line (cum - probs) > p does the work: cum - probs is the probability mass strictly before each sorted token, so a token is masked only when the set above it already covers p on its own. The token that crosses the threshold stays in; the first token can never be masked, so the filter always leaves something to sample. The scatter_ at the end just puts the sorted verdicts back in vocabulary order.

Why a fixed k fails: peaked versus flat next-token distributions Illustrative bar charts, not measured model output. A peaked distribution needs a tiny candidate set; a flat one needs a large set. Top-k uses the same set size for both; the nucleus adapts.
The Holtzman argument in one picture: next-token distributions swing between peaked and flat, so a fixed top-k is wrong somewhere on every sentence. The nucleus (blue bracket) resizes itself to cover probability mass p, however many tokens that takes. Bar heights are illustrative, not measured.

In practice the knobs compose. The classic GPT-2 recipe is temperature plus top-k; most modern inference APIs expose temperature plus top-p, applied in exactly the order our generate applies them — temperature reshapes the contrast, the filter trims the tail, softmax renormalizes whatever survives. None of it touches a single weight. Same model, different bet.


8.6  The thing to actually understand


8.7  Exercises

  1. Confirm the shape contract. Instantiate an untrained GPT at our config, call generate(model, torch.zeros((1, 1), dtype=torch.long), max_new_tokens=50, context_length=256), and assert the output shape is (1, 51). Decode it. It should be fluent nonsense at the character level and nonsense at every other level — random weights give a near-arbitrary distribution, and the loop faithfully samples it. That is the correct behavior.
  2. Break the crop. Delete the idx_cond = idx[:, -context_length:] line and pass idx straight to the model, then generate more than 256 tokens from a short prompt. Watch where and how it fails, and trace the error back to the position-embedding lookup (self.position_embedding inside GPT.forward). Now you know exactly what a “context window” is made of.
  3. Greedy versus sampled. On the tiny model you overfit in Chapter 7, implement the greedy variant (argmax instead of multinomial) and generate 200 tokens from the same prompt twice with each method. Greedy should be byte-identical across runs; note whether it falls into a repeating phrase, and how the sampled runs differ from each other.
  4. Sweep the temperature. With the same trained model and prompt, sample at τ = 0.2, 0.7, 1.0, 1.5. Describe the progression you observe in your own corpus terms — where does it get boring, where does it fall apart? Then repeat the τ = 1.5 run with top_p = 0.9 and note what the nucleus rescues.
  5. Instrument the nucleus (stretch). Modify generate to record, at every step, how many tokens survive the top_p = 0.9 filter. Plot that count over a 200-token generation. If Holtzman et al. are right about peaked and flat contexts, the curve should swing by orders of magnitude — check whether the widest moments line up with genuinely open-ended positions in the text.
What’s next
Ch 9 — Standing on giants: loading GPT-2 weights
Read Ch 9 →

A 37th-Chamber original. Methods cited: Holtzman, Buys, Du, Forbes & Choi (2020), “The Curious Case of Neural Text Degeneration,” ICLR 2020, arXiv:1904.09751 (nucleus sampling, p = 0.95 in experiments; self-reinforcing repetition, Fig. 4; GPT-2 Large — confirmed); Fan, Lewis & Dauphin (2018), “Hierarchical Neural Story Generation,” ACL 2018, arXiv:1805.04833 (top-k sampling for generation, k = 10 — confirmed); Radford, Wu, Child, Luan, Amodei & Sutskever (2019), “Language Models are Unsupervised Multitask Learners,” OpenAI technical report (GPT-2; WebText-conditioned samples at top-k = 40 — confirmed); Ackley, Hinton & Sejnowski (1985), “A Learning Algorithm for Boltzmann Machines,” Cognitive Science 9, 147–169 (temperature in neural networks — confirmed). All prose and code written fresh.

Written by a Fable · Edited by bobby-dig8al