Teaching It to Predict
Chapter 6 left a complete machine on the bench, humming with eleven million random numbers. Ask it anything and it answers with pure noise, priced at almost exactly ln(512) nats of surprise. Chapter 0 promised the fix in one breath — guess, get scored, nudge the numbers, repeat. This is the chapter that pays that promise, in full and to the cent: the score gets a formula, the nudge gets an algorithm, and the noise begins, measurably, to become a voice.
7.1 The promise from Chapter 0, now due
Back in Chapter 0, before a single line of code, this book stated the whole of training in one breath:
Show the model a lot of text. Ask it to guess the next word. When it guesses wrong, nudge the numbers slightly in the direction that would have made a better guess. Repeat this an enormous number of times.
That paragraph was an IOU. Every chapter since has quietly prepared to redeem it. Chapter 1 turned text into tokens so there is something to guess. Chapter 2 built the windows — x and its shifted-by-one target y — so every position carries its own answer key. Chapters 3 through 6 built the machine that does the guessing, ending with logits: 512 scores for the next token, at every position.
Three phrases in the promise still have no precise meaning. “When it guesses wrong” — wrong by how much? That is the loss (7.2). “In the direction that would have made a better guess” — which direction, for eleven million separate numbers? That is backpropagation (7.3). “Nudge slightly” — how slightly, and how exactly? That is the optimizer (7.4). Give those three their real names, wrap them in a loop (7.5), and the promise is kept. There is nothing else in training. No secret fourth ingredient, no withheld sauce — three moving parts and a loop. That is the honest, slightly scandalous truth of this chapter, and the labs that guard their training runs like state secrets are guarding the data and the scale, not this.
7.2 The loss — the model pays in surprise
The scoring rule for language models is cross-entropy, and the honest way to read it is as a bill for surprise. At every position, the model’s logits — after the softmax turns them into a probability distribution — assign some probability p to the token that actually came next in the text. The loss at that position is:
loss = -ln(p_true_next_token)
Read the bill three ways. If the model was confident and right — it gave the true token p = 0.99 — the charge is -ln(0.99) ≈ 0.01. Nearly free. If it hedged — p = 0.10 — the charge is -ln(0.10) ≈ 2.30. If it was confident and wrong, giving the true token p = 0.0001 because it bet hard on something else, the charge is -ln(0.0001) ≈ 9.2 — and it grows without bound as p approaches zero. The unit of this currency is the nat: surprise measured with the natural logarithm. (Measure with log base 2 and you get bits; same idea, different ruler.)
This is why the loss is the right thing to minimize: the only way to pay less, on average, over an entire corpus, is to actually put probability on what tends to come next. There is no way to fake it and no way to game it. Certainty about the wrong thing is punished savagely; honest calibrated spread is charged fairly; knowledge, and only knowledge, is rewarded. The bill does not care how the model feels about its answer — only whether reality agreed.
Now the number this book has been trailing since last chapter. A model that knows nothing — that spreads its belief evenly across all 512 vocabulary tokens — gives every true next token p = 1/512, and pays -ln(1/512) = ln(512) ≈ 6.24 nats on every single prediction. That is pure arithmetic. It is the price of total ignorance at our vocabulary size, and it is where our freshly assembled Chapter 6 model stands right now. Watch:
import torch
import torch.nn.functional as F
B, T, V = 32, 256, 512
# a perfectly clueless model: identical logits for every token
uniform_logits = torch.zeros(B * T, V) # (8192, 512)
targets = torch.randint(0, V, (B * T,)) # any targets at all
print(F.cross_entropy(uniform_logits, targets).item()) # 6.2383... = ln(512), exactly
# the real call, on real model output, looks like this:
# logits: (B, T, V) from the Chapter 6 GPT; yb: (B, T) from the Chapter 2 loader
# loss = F.cross_entropy(logits.view(B * T, V), yb.view(B * T))
Two mechanics worth naming. First, F.cross_entropy takes raw logits and applies the softmax internally (in a numerically safer form) — this is why Chapter 6’s model deliberately returns scores, not probabilities. Second, the .view() calls flatten the batch: a (32, 256, 512) logits tensor becomes 8,192 independent predictions, each scored against its own true next token, and the function returns their mean. One number. The bill for the whole batch.
7.3 Backpropagation in one honest breath
The loss is one number. The model is 11,132,672 numbers. The question that makes training possible is: for each of those eleven million parameters, if I nudged it up a hair, would the loss go up or down — and how steeply?
The answer, for every parameter at once, is the gradient. And the algorithm that computes it efficiently — by walking the chain rule of calculus backward through the network, layer by layer, reusing intermediate results — is backpropagation, set out for neural networks by Rumelhart, Hinton & Williams in 1986 (Nature 323, 533–536). It is the single most important algorithm in this book that we will not implement, and we are going to say so out loud instead of smuggling it past you. The calculus is real, mechanical, and readable elsewhere — the receipt is right there in Nature, 1986; what this chapter needs is exactly what the gradient means:
The gradient is a direction in eleven-million-dimensional space: the direction, for every parameter simultaneously, in which the loss increases fastest. Step against it, and the loss falls fastest — locally, for this batch.
In PyTorch, the entire machinery is one call. Every tensor operation in the forward pass — every matmul in every head, every ReLU, every layer norm — was silently recorded by autograd as it happened. loss.backward() replays that tape in reverse, and when it finishes, every parameter p in the model holds its own answer in p.grad: this way lies more loss. Two practical notes: gradients accumulate by default, so the loop must zero them between steps; and remember the residual connections from Chapter 5 — the gradient highway — are precisely what lets this backward flow reach the early blocks intact. Every pipe you laid back then was laid for this exact moment.
7.4 The optimizer — how the nudge is made
The gradient gives a direction. The optimizer decides the step. The simplest possible rule is plain stochastic gradient descent: p = p - lr * p.grad, one global learning rate for everything. It works, and it is worth holding in your head as what every optimizer fundamentally is. But plain SGD is rarely used for transformers, for an honest reason: in a model like ours, gradients arrive at wildly different scales — the embedding row for a rare token gets touched occasionally and gently; a layer norm gain deep in block 4 gets hammered every single step. One step size cannot honestly fit all eleven million parameters at once — feed the same nudge to a whisper and a jackhammer and you break one of them.
Adam (Kingma & Ba, 2014, published at ICLR 2015 — arXiv:1412.6980) fixes this by keeping, for every parameter, running averages of its recent gradients and their squares, and scaling each parameter’s step accordingly — in effect, a per-parameter, self-adjusting learning rate with momentum. AdamW (Loshchilov & Hutter, published at ICLR 2019 — arXiv:1711.05101) repairs a subtle flaw in how Adam mixed in weight decay — the gentle pull of parameters toward zero that discourages memorization — by decoupling the decay from the gradient-based step. AdamW is the default optimizer of the GPT lineage, and it is what we use:
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
The learning rate 3e-4 — the size of the nudge — is the most consequential hyperparameter in the loop, and this value is a battle-tested default for small transformers rather than a law of nature. Too high and the loss lurches and diverges; too low and you wait forever. Exercise 3 has you feel both failure modes yourself.
7.5 The split and the loop
One more piece of discipline before the loop: hold some data back. We cut the token stream 90/10 — the first 90% to train on, the last 10% the model never trains on. Here is why this matters so much: an 11M-parameter model has more than enough capacity to partially memorize a small corpus. Its loss on text it has seen can fall indefinitely without the model getting any better at language. Loss on the held-out slice — validation loss — is the only score that measures what we actually want: prediction of text it has never laid eyes on. Train loss is the practice exam taken with the answer key in your lap. Val loss is the one that counts.
Everything below is brought forward: GPT from Chapter 6, NextTokenDataset from Chapter 2, both exactly as written. The only new code is the split, the evaluator, and the loop itself — and the loop is Chapter 0’s paragraph, line for line.
context_length window across the train side, shuffling the windows into batches. The last 10% stays sealed — the only text whose loss tells the truth.import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader
# GPT is the Chapter 6 class; NextTokenDataset is Chapter 2's. Unchanged.
# -- the split: 90% train / 10% held-out validation ---------------------------
# token_ids: the whole corpus as one list of ints, from the Chapter 1 tokenizer
n = int(0.9 * len(token_ids))
train_ds = NextTokenDataset(token_ids[:n], context_length=256)
val_ds = NextTokenDataset(token_ids[n:], context_length=256)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True, drop_last=True)
val_loader = DataLoader(val_ds, batch_size=32, shuffle=True, drop_last=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = GPT(vocab_size=512, context_length=256, d_model=384,
num_heads=6, num_layers=6, dropout=0.1).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
@torch.no_grad() # no tape recording: evaluation, not learning
def estimate_loss(loader, max_batches=50):
"""Mean loss over up to max_batches batches. Returns a float."""
model.eval() # dropout off for a fair reading
losses = []
for i, (xb, yb) in enumerate(loader):
if i >= max_batches:
break
xb, yb = xb.to(device), yb.to(device)
logits = model(xb) # (B, T, vocab_size)
B, T, V = logits.shape
loss = F.cross_entropy(logits.view(B * T, V), yb.view(B * T))
losses.append(loss.item())
model.train() # dropout back on for training
return sum(losses) / len(losses)
# -- the loop: Chapter 0's paragraph, made executable --------------------------
max_steps = 5000
step = 0
model.train()
while step < max_steps:
for xb, yb in train_loader:
xb, yb = xb.to(device), yb.to(device)
logits = model(xb) # guess (B, T, V)
B, T, V = logits.shape
loss = F.cross_entropy(logits.view(B * T, V), # score the surprise
yb.view(B * T))
optimizer.zero_grad(set_to_none=True) # clear old directions
loss.backward() # directions, all 11M
optimizer.step() # the nudge
if step % 250 == 0:
print(f"step {step:5d} | "
f"train {estimate_loss(train_loader):.3f} | "
f"val {estimate_loss(val_loader):.3f}")
step += 1
if step >= max_steps:
break
Walk the five lines at the heart of it. model(xb) is the guess. F.cross_entropy(...) is the score — 8,192 predictions per batch, billed in nats, averaged. zero_grad clears last step’s directions (they accumulate otherwise — the classic bug). loss.backward() is Section 7.3: a direction for every parameter. optimizer.step() is Section 7.4: the nudge, sized per-parameter by AdamW. Repeat an enormous number of times is the while loop. The one-breath promise from Chapter 0 is no longer a promise — it is running code on your own machine, and you can read every line of it.
The estimate_loss helper carries two easy-to-miss switches. @torch.no_grad() tells autograd not to record the tape during evaluation — faster, lighter, and we have no intention of learning from the val set. model.eval() / model.train() toggle dropout: measuring loss with 10% of activations randomly zeroed would overstate the bill, so evaluation turns the noise off and training turns it back on. The step-0 print, fired before the first update, is your free look at the untrained model — expect the neighborhood of 6.24.
7.6 What you will watch happen
Run it on any corpus you like — your own writing, a public-domain book, or the classic hobby corpus, the tiny-shakespeare file that has been the “hello world” of small language models since Karpathy’s char-rnn days and still ships with nanoGPT. This book will not fabricate a training log for you — a made-up number is a lie no matter how pretty it looks in a table — because the numbers depend on your corpus, your seed, and your patience. But the shape of what you will see is not speculation; it falls straight out of what the loss is, and you should know that shape cold before you watch it arrive.
The start is known exactly. Your first printed losses will sit almost exactly at 6.24 nats — the ln(512) of total ignorance, the one point on the curve that mathematics fixes in advance. If you see something wildly different at step 0, something is wrong with your wiring, and that is a genuinely useful debugging fact.
The fall is fast, then slow. The cheapest surprise gets bought first: within the earliest steps the model learns the corpus’s letter frequencies and common short tokens, and the loss drops steeply. Then the curve bends. Each further nat is more expensive than the last — spelling costs less than grammar, grammar costs less than sense — and the descent settles into a long, slowing grind. That bend is not failure; it is the model exhausting easy structure and working on harder structure.
The two curves part company. For a while, train and val loss fall together. Then — on a small corpus, inevitably — they separate: train keeps sliding while val flattens and eventually turns upward. That gap is overfitting, live on your screen: eleven million parameters beginning to memorize the specific text rather than learn its patterns. On the corpora this book trains on, you will see it, and honesty says so up front. The frontier labs hit the exact same wall — they just push it back with oceans of data instead of less capacity; you can push it back the same way, or simply stop training when val loss stops improving. That last move is not a beginner’s shortcut. It is, in miniature, precisely what everyone with a supercomputer does.
ln(512) ≈ 6.24, fall fast then slow, and eventually part — train sliding on, val flattening and turning up. Only the start is an exact number; the rest is the honest shape, because the real values depend on your corpus, seed, and patience.7.7 What just happened, philosophically
Stop and audit what you did — and, more importantly, what you never did.
You never wrote a rule of English, or of whatever language your corpus is in. No grammar table, no dictionary, no list of facts, no hand-coded shred of what any word means. Search the loop for the knowledge — go ahead, it is fifteen lines — and it is not there. There is a surprise meter, a direction-finder, and a nudger, running in a circle. That is the entire input of intelligence into this system, and not one part of it is about language at all.
And yet: eleven million numbers drifted downhill on the gradient of surprise, and structure appeared. Falling loss is the model discovering regularities — first that some letters are common, then that certain tokens follow certain others, then patterns with longer reach. Nobody put the structure in. It condensed out of the corpus, into the weights, under pressure from a scoring rule.
Here is the part that earns this book its title. Open the trained checkpoint. You can print any of the 11,132,672 numbers; you built the tensors they live in; you can name every wire they travel. And you cannot point to where anything the model “knows” is stored. The knowledge is smeared across millions of weights that were each nudged a hair at a time, jointly, by a process you now understand completely. Chapter 0 described this opacity from the outside, as the box’s defining property. You have now grown it yourself, on your own machine, with your own hands on every part. The box is opaque, and you are the one who made it — which means the opacity was never a trick, never a marketing mystery, and never a lock somebody else holds the key to. It is simply what this kind of knowing looks like from the outside. The honest response to a box nobody can read is not to trust the people holding it. It is the one you are already making: build one, and look.
One thing the machine still cannot do: talk to you. It has opinions — a full, considered probability distribution over what comes next — but no way to pick a word and commit to it. Turning that distribution into actual speech is a small, sneakily consequential piece of code that lives entirely outside the model, and it is next.
7.8 The thing to actually understand
- Loss is surprise, billed in nats. Cross-entropy charges
-ln(p)for the probability the model gave the true next token. It cannot be gamed, flattered, or talked down: the only way to pay less on unseen text is to genuinely predict better. - ln(512) ≈ 6.24 is the price of total ignorance — the uniform-distribution bill at our vocabulary size, fixed by arithmetic. It is where every training run of this model starts, and your first honest debugging check.
- The gradient answers exactly one question — for each of 11,132,672 parameters, which direction reduces the loss — and backpropagation (Rumelhart, Hinton & Williams, 1986) computes all the answers in one backward sweep.
loss.backward()is that sweep; we use it with open eyes rather than deriving it. - The optimizer is the nudge, sized per parameter. Plain SGD uses one step size for eleven million unequal citizens; Adam adapts each one; AdamW fixes Adam’s weight decay and is the GPT-lineage default.
- Validation loss is the only honest judge. Train loss can be memorized; the held-out 10% cannot. When the curves part, you are watching capacity beat data — overfitting — and it is the expected ending on a small corpus.
- Nobody inserted a rule. The loop contains a scorer, a direction, and a nudge — nothing about language, nothing anyone chose. Everything the model ends up knowing condensed into the weights from the data alone. That, precisely, is where the opacity comes from — and why it is a fact about the method, not a secret anyone is keeping.
7.9 Exercises
- Verify the starting line. Before any training, run
estimate_loss(val_loader)on the untrained model, and computemath.log(512). How close are they, and why is the match not exact? (Section 6.6 named the reason.) - Overfit on purpose. Train on a deliberately tiny corpus — a few kilobytes — for several thousand steps. Watch train loss sink toward zero while val loss climbs. Describe, in one paragraph, what the model has actually done, and why its near-zero train loss is not knowledge.
- Feel the learning rate. Run 500 steps each at
lr = 3e-3,3e-4, and3e-5from a fresh model each time. Compare the three loss trajectories. Which fails by violence, which by patience, and how would you pick a rate for a model you had never trained before? - Remove the judge. Delete the split and train on 100% of the tokens. What information have you lost, exactly? Write the one-sentence reason train loss alone cannot tell you whether the model is learning language or memorizing text.
- Keep Chapter 2’s promise. Chapter 2, exercise 1 had you measure cosine similarity between two token embeddings before training and predict what training would do. The moment has come: measure the same pair after training and check your prediction. Geometry that was noise should have become opinion.
A 37th-Chamber original. Methods cited: Rumelhart, Hinton & Williams (1986), “Learning representations by back-propagating errors,” Nature 323, 533–536 (backpropagation — confirmed); Kingma & Ba (2014, ICLR 2015), “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980 (confirmed); Loshchilov & Hutter (ICLR 2019), “Decoupled Weight Decay Regularization,” arXiv:1711.05101 (AdamW — confirmed); the tiny-shakespeare corpus lineage runs from Karpathy’s char-rnn (2015) to nanoGPT (both confirmed). ln(512) ≈ 6.2383 is arithmetic, not a citation. No training curves or model outputs in this chapter are from a claimed run; the known starting point is mathematical and the curve shapes are stated qualitatively. All prose and code written fresh.