The Opaque Box · Chapter 9

Standing on Giants — Loading GPT-2

Chapter 8 ended on a bet: the model prices every token, and the sampler gambles on the spread. Run that loop over our eleven million parameters and one small corpus, and the winnings are honest but modest — spelled words, local grammar, meaning that drifts. That smallness lives in the numbers and the data — not in the architecture. So here is the experiment that settles whether this book has been telling you the truth or selling you a story. Take the skeleton you built line by line, and pour a giant’s memory into it. If our class really is GPT-2’s architecture — not inspired by, but the same machine at a different size — then GPT-2’s public weights should wake up inside it and speak coherent English through code you wrote yourself. There is no partial credit on a bet like that. It works or it doesn’t.

9.1 The claim to test

This book has been making a quietly enormous claim since Chapter 1, and it is time to put it on the table where it can be shot at: that the tokenizer, the embeddings, the causal attention head, the multi-head wrapper, the pre-LN block, and the assembled GPT class are not a toy version of the real thing — they are the real thing, instantiated small. Claims like that should be testable, and this one is, because of a rare piece of luck: the weights of a real GPT are public.

A trained model, on disk, is nothing mystical. It is a checkpoint: a dictionary from parameter names to tensors. If our class truly has GPT-2’s architecture, then every tensor in OpenAI’s released checkpoint has exactly one home in our state_dict(), every shape matches after the right transformations, and the transplanted model produces fluent English through the very generate() loop you wrote in Chapter 8. If we got any wiring wrong — a norm in the wrong place, a projection transposed, a mask misapplied — the transplant produces garbage. There is no partial credit and nowhere to hide. A giant’s weights are the most unforgiving unit test ever written for an architecture, and it is the reason this chapter exists.

Call the operation what it is: a transplant. Same skeleton, borrowed memory. And one honesty note before we begin: this is the single chapter in Part I where something external enters the build — a downloaded checkpoint and its matching tokenizer. That is not a compromise of the from-scratch ethic; it is the entire point of the chapter’s title. Newton stood on the shoulders of giants and got the phrase remembered. We did the harder, humbler thing first: we built the shoulders’ exact shape by hand. Now we stand on them.

9.2 GPT-2, precisely

GPT-2 is the model described in Radford et al. (2019), “Language Models are Unsupervised Multitask Learners.” The paper’s abstract states the headline plainly: the largest model in the family is “a 1.5B parameter Transformer” that set state-of-the-art results on 7 of 8 tested language-modeling datasets, zero-shot.

It arrived in public in stages. In February 2019 OpenAI announced the work and released only the smallest model, withholding the larger ones over misuse concerns; the medium model followed in May, the large in August, and the full 1.5B in November 2019 as the final step of the staged release. That episode — a lab publicly worrying its own model was too dangerous to release — reads differently now than it did then, in an era where far larger models ship on a Tuesday. But the weights have been fully open ever since, which is the only fact this chapter needs from that history.

A naming correction worth teaching, because you will meet both numbers in the wild: the paper’s Table 2 lists the four family members as 117M, 345M, 762M, and 1542M parameters — and those counts were wrong. The openai/gpt-2 repository README says it directly: “our original parameter counts were wrong due to an error.” The corrected sizes, used by the released checkpoints and everything downstream, are 124M, 355M, 774M, and 1558M. This book says 124M throughout. (A freshness note: that repository was archived read-only in April 2026 — still live, still serving download_model.py, no longer maintained.)

The 124M model, precisely, per the paper (§2.3 and Table 2) and the released configuration:

Vocabulary: 50,257 tokens, byte-level BPE — the paper: “The vocabulary is expanded to 50,257.” Ours is 512. Same algorithm family as Chapter 1, vastly larger table.
Context length: 1024 — “We also increase the context size from 512 to 1024 tokens” (the 512 being GPT-1’s). Ours is 256.
12 layers, d_model 768 — straight from Table 2’s first row. Ours: 6 layers, 384 wide.
12 attention heads — the paper’s table does not print head counts; 12 comes from the released code and configuration lineage (nanoGPT’s from_pretrained config and the Hugging Face defaults). Note the delightful coincidence that is not a coincidence: 768 / 12 = 64, and 384 / 6 = 64. Our per-head width is GPT-2’s per-head width. Not close — identical. We have been building at the giant’s proportions all along, and nobody told you, because the point was for you to find it here.

The four family members climb a steep ladder — and our 11M model stands one rung below the smallest of them. It is the same machine at every rung; only the numbers grow. Seeing the sizes side by side is the point: the distance from our toy to GPT-2 124M is real, but it is a distance in scale, not in kind.

The size ladder, on a log axis: our 11M model (blue) sits one rung below GPT-2’s smallest. The four released sizes — 124M, 355M, 774M, 1558M — are the same architecture at growing width and depth, not different machines.

9.3 The five differences, and how we absorb them

Between our Chapter-6 machine and GPT-2 124M stand exactly five differences that matter, plus one boring one. Five. That is the whole distance between the toy on your desk and a model a research lab announced to the world — and four of the five are things this book already told you to build. Taken in order:

1. Configuration. Vocabulary, context, width, depth — all just constructor arguments. No code changes at all:

# GPT-2 124M, in our config vocabulary (Radford et al. 2019, Table 2;
# released-checkpoint naming: "124M")
gpt2_config = dict(
    vocab_size     = 50257,   # byte-level BPE; ch 1 built ours at 512
    context_length = 1024,    # ours: 256
    d_model        = 768,     # ours: 384
    num_heads      = 12,      # head_size = 768 // 12 = 64 — same as ours
    num_layers     = 12,      # ours: 6
    dropout        = 0.0,     # inference only: dropout off
)

model = GPT(**gpt2_config)    # the ch 6 class, at the giant's size

Set the two configurations side by side and the whole thesis of the chapter is visible in a table: the column of field names is identical, the column of values is all that grew. Same skeleton, heavier fill.

Same skeleton, different fill: the six constructor fields are identical; only the values grew. And the last row did not grow at all — head_size = 64 on both sides. We built at the giant’s per-head proportions from Chapter 3.

2. GELU, not ReLU. Chapter 5 built the feed-forward network with ReLU, following Vaswani et al. (2017), and flagged the deviation then. GPT-2 uses GELU, the Gaussian Error Linear Unit of Hendrycks & Gimpel (2016). A nuance worth knowing: the GPT-2 paper never names its activation — the choice is documented in the released code, which defines GELU in its tanh-approximation form. Chapter 5’s Exercise 3 already had you make this exact swap; now we make it permanent by giving FeedForward a switch. This is the one module we bring forward modified, and the modification is marked:

import torch
import torch.nn as nn

class FeedForward(nn.Module):
    """
    Position-wise FFN — ch 5's module with one new switch: the activation.
    Input:  (B, T, d_model)
    Output: (B, T, d_model)
    """
    def __init__(self, d_model, dropout=0.0, activation="relu"):
        super().__init__()
        if activation == "gelu":
            act = nn.GELU(approximate="tanh")   # GPT-2's released formulation
        else:
            act = nn.ReLU()                     # ch 5 / Vaswani et al. 2017
        self.net = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),    # (B, T, d) -> (B, T, 4d)
            act,
            nn.Linear(4 * d_model, d_model),    # (B, T, 4d) -> (B, T, d)
            nn.Dropout(dropout),
        )

    def forward(self, x):                       # (B, T, d_model)
        return self.net(x)                      # (B, T, d_model)

Thread activation="gelu" through Block and GPT the same way dropout already travels. Note the approximate="tanh": PyTorch’s default GELU uses the exact formulation; GPT-2’s code uses the tanh approximation. For a faithful transplant, match the giant’s arithmetic, quirks included.

3. Weight tying is ON. Chapter 6 presented tying — reusing the token-embedding matrix as the output head, per Press & Wolf (2016) — and kept ours untied for clarity. GPT-2 ties: its released code has no separate output matrix at all; the logits are computed by multiplying the final hidden states against the token-embedding table wte directly. So the checkpoint contains one matrix doing two jobs, and our loader will copy it into both of our slots.

4. Learned positional embeddings — already ours. GPT-2’s position table wpe is a plain trainable variable in the released code — not the sinusoidal functions of the 2017 paper. That is precisely the nn.Embedding position table we built in Chapter 2. Nothing to do.

5. Pre-LN — already ours too. The paper, §2.3: “Layer normalization… was moved to the input of each sub-block… and an additional layer normalization was added after the final self-attention block.” That is the pre-LN convention Chapter 5 chose, deviation-marked, and the final ln_f Chapter 6 installed. The deviations we made from Vaswani in Chapters 5 and 6 were, all along, the moves that make us checkpoint-compatible with GPT-2. Every one of those “deviation-marked” footnotes was a promise the book was quietly making to this chapter. Here is where it pays them back, in full.

And the boring sixth: biases. Chapter 3 built the query/key/value projections with bias=False for minimalism. GPT-2’s checkpoint carries a bias vector for every projection. Flip the three flags to bias=True in SelfAttentionHead so the transplant has somewhere to put them. One keyword argument, three lines, no drama.

The classic gotcha: the Conv1D transpose

One trap remains, and it is famous enough to deserve its own heading. OpenAI’s code implements every projection with a layer it calls Conv1D — a TensorFlow-convention layer that, in Hugging Face’s own documentation, “basically works like a linear layer but the weights are transposed.” The checkpoint therefore stores those weight matrices transposed relative to nn.Linear. Exactly four weight families need a .t() on the way in — the list, straight from nanoGPT’s loader: attn.c_attn.weight, attn.c_proj.weight, mlp.c_fc.weight, mlp.c_proj.weight.

Why this bug is the classic one: attn.c_proj.weight is 768×768 — square. Forget its transpose and every shape still fits, no error is raised, and the model fluently generates noise. A wrong-shape bug announces itself and dies in your face; a wrong-orientation bug on a square matrix says nothing, runs clean, and quietly destroys the machine from the inside. That is the failure that eats afternoons. This is the single most instructive failure in the whole transplant: run the sabotage in Exercise 4 and see it once, on purpose, so it can never cost you a real one.

There is one more fused-weights wrinkle: GPT-2 stores query, key, and value as a single matrix, c_attn, of transposed shape (2304, 768) — three 768-row stacks, q then k then v. Our Chapter-3 design keeps per-head query/key/value modules. So the loader must transpose, then split into thirds, then slice each third into 64-row bands, one per head. That is not a difference in the mathematics — it is the same linear map, filed differently — and unpacking it is exactly the kind of bookkeeping that proves you understand what the fused matrix meant.

Picture the flip itself, because it is the whole trap in one gesture. The checkpoint stores a projection with its axes swapped relative to what nn.Linear expects; the .t() puts them back. On a rectangular matrix, skip the flip and the shapes refuse to line up — a loud, immediate crash. On a square one, they line up anyway, and nothing warns you.

The transpose trap in one picture: on a rectangular matrix a forgotten .t() crashes on shape; on the square attn.c_proj it lines up anyway and quietly generates fluent noise. Of the four transposed families, only that one is square.

9.4 The loading sketch

A checkpoint’s interface is the state dict: an ordered dictionary mapping parameter names to tensors. Loading a giant is therefore nothing more exotic than renaming keys, transposing four families, splitting one fused matrix, and calling load_state_dict. Here is the whole map, first as a picture, then as code.

The complete key map of the transplant: every tensor in the GPT-2 124M checkpoint has exactly one home in our class. Four weight families cross transposed; the fused qkv matrix additionally splits into thirds and then into 64-row bands per head; the bottom row is one matrix with two jobs.

And the same map as code — a faithful sketch with every mapping shown and error handling omitted:

import torch
from transformers import GPT2LMHeadModel     # the honest external dependency

hf = GPT2LMHeadModel.from_pretrained("gpt2") # downloads the 124M checkpoint once
sd_theirs = hf.state_dict()

def load_gpt2_into_ours(model, sd_theirs, num_layers=12, num_heads=12, head_size=64):
    """
    Pour the GPT-2 124M checkpoint into our GPT class.
    Requires: activation="gelu", bias=True on q/k/v (see 9.3).
    Returns the same model, weights replaced.
    """
    sd = model.state_dict()

    # embeddings: names change, tensors copy straight over
    sd['token_embedding.weight']    = sd_theirs['transformer.wte.weight']   # (50257, 768)
    sd['position_embedding.weight'] = sd_theirs['transformer.wpe.weight']   # (1024, 768)

    for i in range(num_layers):
        p = f'transformer.h.{i}.'            # their prefix
        q = f'blocks.{i}.'                   # our prefix

        # layer norms: rename only
        sd[q + 'ln1.weight'] = sd_theirs[p + 'ln_1.weight']
        sd[q + 'ln1.bias']   = sd_theirs[p + 'ln_1.bias']
        sd[q + 'ln2.weight'] = sd_theirs[p + 'ln_2.weight']
        sd[q + 'ln2.bias']   = sd_theirs[p + 'ln_2.bias']

        # fused qkv: transpose (Conv1D), split into thirds, then slice per head
        w = sd_theirs[p + 'attn.c_attn.weight'].t()      # (2304, 768) after .t()
        b = sd_theirs[p + 'attn.c_attn.bias']            # (2304,)
        w_q, w_k, w_v = w.split(768, dim=0)              # each (768, 768)
        b_q, b_k, b_v = b.split(768, dim=0)              # each (768,)
        for h in range(num_heads):
            rows = slice(h * head_size, (h + 1) * head_size)
            sd[q + f'sa.heads.{h}.query.weight'] = w_q[rows]   # (64, 768)
            sd[q + f'sa.heads.{h}.query.bias']   = b_q[rows]   # (64,)
            sd[q + f'sa.heads.{h}.key.weight']   = w_k[rows]
            sd[q + f'sa.heads.{h}.key.bias']     = b_k[rows]
            sd[q + f'sa.heads.{h}.value.weight'] = w_v[rows]
            sd[q + f'sa.heads.{h}.value.bias']   = b_v[rows]

        # attention output projection: transpose
        sd[q + 'sa.proj.weight'] = sd_theirs[p + 'attn.c_proj.weight'].t()  # (768, 768)
        sd[q + 'sa.proj.bias']   = sd_theirs[p + 'attn.c_proj.bias']

        # feed-forward, expand then project: both transposed
        sd[q + 'ffwd.net.0.weight'] = sd_theirs[p + 'mlp.c_fc.weight'].t()   # (3072, 768)
        sd[q + 'ffwd.net.0.bias']   = sd_theirs[p + 'mlp.c_fc.bias']
        sd[q + 'ffwd.net.2.weight'] = sd_theirs[p + 'mlp.c_proj.weight'].t() # (768, 3072)
        sd[q + 'ffwd.net.2.bias']   = sd_theirs[p + 'mlp.c_proj.bias']

    # final layer norm
    sd['ln_f.weight'] = sd_theirs['transformer.ln_f.weight']
    sd['ln_f.bias']   = sd_theirs['transformer.ln_f.bias']

    # the head: GPT-2 ties it to the token embedding — one matrix, two jobs
    sd['lm_head.weight'] = sd_theirs['transformer.wte.weight']   # (50257, 768)
    # GPT-2 has NO bias here (the tied wte does both jobs); ch 6's nn.Linear
    # default gave us one — zero it so it contributes nothing
    sd['lm_head.bias'] = torch.zeros_like(sd['lm_head.bias'])    # (50257,)

    model.load_state_dict(sd)
    return model

model = load_gpt2_into_ours(model, sd_theirs)

Line-by-line walk

from_pretrained("gpt2"): the crane. Hugging Face’s openai-community/gpt2 hosts the 124M checkpoint and downloads it in one line; the original alternative is OpenAI’s own download_model.py in the archived repo, which serves TensorFlow-format files and leaves the parsing to you. We use the crane to lift the tensors — and then every one of them lands in our class. The model that speaks at the end of this chapter is the one you built.
The embedding rows: wte is (50257, 768) — our nn.Embedding stores exactly that orientation, so embeddings copy without transposing. Only the Conv1D-born matrices need .t().
The qkv unpack: after the transpose, rows 0–767 of w are the query map, 768–1535 the key map, 1536–2303 the value map; within each, rows h*64:(h+1)*64 belong to head h, because an nn.Linear(768, 64) stores its weight as (64, 768) — one output row per output feature. The fused matrix and our twelve small ones describe the identical linear map, filed differently.
The head needs no transpose — and seeing why is the best single test of your Chapter-6 understanding. GPT-2 computes logits as hidden-states-times-wte-transposed; nn.Linear computes x @ W.t() by definition. So storing wte itself as lm_head.weight gives exactly the tied computation.
The zeroed head bias: the checkpoint has no lm_head.bias at all — the tied embedding is the whole head. Our Chapter-6 nn.Linear came with a bias by default, and if we left it untouched it would keep its random initialization and quietly corrupt every logit. Zeroing it is the honest neutralization. (Bugs of this family — a tensor the map forgot — are why real loaders assert that every key was visited.)
A subtlety, marked honestly: this loader makes lm_head.weight an equal copy of the embedding, not the same tensor object — sufficient and exact for inference; true tying (one shared tensor) is what GPT-2 has and what matters if you resume training. Exercise 2 closes that gap in one line.
Keys we skip: the checkpoint also carries a few entries that are not learned weights — cached causal-mask buffers. Our class builds its own tril mask in Chapter 3’s register_buffer, so those never need to cross.

One honest scope note: this sketch is complete as a map, and it is deliberately not industrial. The battle-tested loaders — every dtype edge case, every version drift handled — live free in nanoGPT, whose from_pretrained is the lineage this map follows, and in the book-companion repositories credited on the library shelf. Read them after you have written yours; they will read like an old friend’s handwriting.

9.5 What you get

Here is the reward, and it comes with a flourish: Chapter 8’s generate() does not change by a single character — and this is the moment its design pays off. It never hard-coded the model’s size; the window is an argument, so you simply pass the giant’s context_length of 1024. What must change is the tokenizer: our Chapter-1 encoder speaks a 512-token language; the transplanted weights expect GPT-2’s 50,257-token byte-level BPE. The tokenizer is part of the giant’s memory too, so it comes from the same place:

from transformers import GPT2Tokenizer

tok = GPT2Tokenizer.from_pretrained("gpt2")        # GPT-2's 50,257-token BPE
idx = torch.tensor([tok.encode("The clearest way to understand a machine is")])
out = generate(model, idx, max_new_tokens=60, context_length=1024,   # ch 8, verbatim
               temperature=0.8, top_k=50)
print(tok.decode(out[0].tolist()))

Run it. We are not printing a sample continuation here, because we would have to invent one, and this book does not do that. What you should see, and what tells you the transplant took: coherent English — grammatical sentences that hold a topic across clauses, at a fluency your 11M model cannot approach. Sweep the Chapter-8 dial and the same personalities emerge: cautious at low temperature, inventive at 1.0, unhinged at 5. The instrument panel you built generalizes; only the mind behind it changed.

And if you see fluent garbage instead — word-shaped noise, or the same token forever — your map has a bug, and the usual suspect is a missing transpose on a square matrix. Do not curse it. That failure is not a setback; it is the verification working exactly as designed. The transplant test is binary precisely because the architecture either matches or it does not — and a binary test that just told you “no” is still telling you the truth.

Sit with what just happened, because it is the whole argument of Part I collapsing into one running program. A model trained by a research lab, on hardware you have never seen, burning money you have never spent, is running — correctly — inside a few hundred lines of Python you wrote and understand completely. The claim from 9.1 is settled, and it did not settle by an author’s say-so: what this book built is not like GPT-2. It is GPT-2’s architecture, and the weights waking up inside it are the proof you ran with your own hands.

9.6 Open weights, closed book

Now for the part this book is named after — the reckoning the title has been promising since the cover. This is the chapter that earns it.

You currently hold roughly 124 million trained numbers in a class you wrote line by line. Total access: you can print any tensor, histogram any layer, trace any forward pass value by value. Nothing is hidden from you. So look. Open model.state_dict() and stare at blocks.7.ffwd.net.0.weight — two million floats, each one individually inspectable. Somewhere in this machine is whatever lets it complete sentences about bridges, and grammar, and the order adjectives go in English. Point to it. Which numbers know that? The question has no address, and every tool you have to answer it comes back empty-handed. It is not that the answer is classified — classified you could subpoena. It is that the question is not answerable in the vocabulary the machine is written in. The knowledge is smeared across matrices whose individual entries mean nothing on their own.

This is the distinction the public conversation about AI keeps fumbling — sometimes by accident, sometimes because the fumble sells a policy — and you have now earned the correction by construction, not by trusting anyone’s press release: open weights are not transparency. “Open” answers who may look; it says nothing whatsoever about what looking reveals. GPT-2 has been fully open since November 2019 — every parameter public, every line of architecture reproducible by a careful reader of a free book — and it is exactly as opaque as the day it was released. The box was never locked. It was always opaque. Those are different problems, and only one of them is solved by publishing a checkpoint. Anyone who tells you releasing the weights makes a model understood is selling you the lock and calling it a window.

Chapter 0 promised that the honest response to an unreadable machine is to build one and look. You have kept both halves of that promise. What the looking reveals is the mechanism — attention routing, positions thinking alone, a residual stream accumulating contributions — and the mechanism is fully, beautifully understandable. What it does not reveal is the content: what any particular weight contributes to any particular capability. You have verified every arithmetic step of a mind you cannot read a single fact out of. That sentence is this book’s thesis, and as of this chapter you own it — not because an author asserted it, but because you built the box, poured in the giant, and watched it stay opaque with your own eyes.

9.7 The thing to actually understand

Loading a giant’s weights is the strongest verification of an architecture. A checkpoint only wakes up if every module, shape, and convention matches — it cannot be fooled and it does not grade on a curve. Fluent English out of your class is proof, not luck.
Scale is configuration, not architecture. Our 11M model and GPT-2 124M differ in four constructor numbers, one activation function, one tying choice, and nothing else — and the per-head width, 64, is identical.
The transpose quirk is the classic trap. The checkpoint stores four weight families in TF Conv1D orientation; the square ones fail silently if you forget the .t(). Fused qkv additionally splits into thirds, then per-head bands.
Tying means one matrix, two jobs. GPT-2 has no separate output matrix; the token embedding doubles as the head — and lands in nn.Linear without a transpose, because Linear already computes x @ W.t().
Open weights ≠ transparent model. Full access to every number does not buy you the ability to read a single fact out of them. Opacity survives publication — open the checkpoint and the box is still shut. That is the book’s title, earned first-hand.

9.8 Exercises

Configure the 355M on paper. From the paper’s Table 2, the second family member has 24 layers and d_model 1024. Write its gpt2_config. Which fields change and which stay? One field — the head count — is not in the paper’s table at all; where would you have to look to get it, and what does that tell you about papers versus released code as sources of truth?
Close the tying gap. After loading, run torch.equal(model.lm_head.weight, model.token_embedding.weight) — then check whether they are the same tensor with model.lm_head.weight.data_ptr() == model.token_embedding.weight.data_ptr(). Explain the difference between equal and shared, then make the tie real in one assignment.
Count the giant from the config. Apply Chapter 6’s parameter arithmetic at GPT-2’s size — embeddings, twelve blocks with biases, final norm, tied head — and compare your total to 124M. How much of the total do the embeddings alone carry, compared to our 11M model? What does that shift say about where capacity lives at scale?
Sabotage one transpose. Deliberately skip the .t() on attn.c_proj.weight only — a square matrix, so nothing errors — and generate. Describe what comes out. Then explain why this bug family is feared: which of the four transposed families would fail loudly instead, and why?
Stretch: sweep the dial on the giant. Re-run Chapter 8’s temperature-sweep exercise against the transplanted model with the same prompts and seeds. Compare how temperature feels at 124M versus 11M: at which τ does each model stop producing language? Write three sentences on what that difference suggests about how much of “coherence” lives in the distribution’s shape.

What’s next

Ch 10 — Bending It to a Task — Fine-tuning

Read Ch 10 →

A 37th-Chamber original. Methods cited: Radford et al. (2019), “Language Models are Unsupervised Multitask Learners” (1.5B largest model, vocab 50,257, context 1024, pre-LN + final layer norm, Table 2 sizes — confirmed against the official PDF); the openai/gpt-2 repository (weight tying in src/model.py, the GELU tanh formulation, learned wpe, the parameter-count correction 117M→124M, download_model.py; archived read-only April 2026 — confirmed); OpenAI, “Better language models and their implications” (Feb 2019) and “GPT-2: 1.5B release” (Nov 2019) (staged release — confirmed); Hendrycks & Gimpel (2016), “Gaussian Error Linear Units (GELUs),” arXiv:1606.08415 (confirmed); Press & Wolf (2016), “Using the Output Embedding to Improve Language Models,” arXiv:1608.05859 (weight tying — confirmed); Karpathy, nanoGPT (the Conv1D transpose list and from_pretrained lineage — confirmed); Hugging Face, openai-community/gpt2 model card (loading path, 12-head config — confirmed); Vaswani et al. (2017), arXiv:1706.03762 (the original ReLU FFN we deviated from — confirmed). All prose and code written fresh.