The Opaque Box · Chapter 6

Assembling the GPT

Chapters 1 through 5 machined the parts: a tokenizer, two embedding tables, the attention head, the multi-head mixer, the block that routes and thinks. Every one of them is on the bench now, tested and shape-true. This chapter invents nothing — it bolts the parts together in the only order the shapes allow, then counts every number in the machine, all 11,132,672 of them, by hand. What stands at the end is a complete GPT that cannot yet do a single useful thing — which is the most honest fact about it, and the whole reason the next chapter exists.

6.1 Where we are — the parts on the bench

Before assembly, walk the bench. Five chapters produced five parts, and it is worth stating in one breath each what every part does — because the whole model is nothing more than these five things in sequence. No sixth secret. No hidden magic. Five parts, in order.

The tokenizer (Chapter 1) turns raw text into a stream of integers drawn from a 512-token vocabulary — encode_bpe going in, decode coming out, with the learned merges table between them. It is the only part of the system that ever touches actual text.

The embedding tables (Chapter 2) turn each integer into a 384-dimensional vector: token_embedding answers what the token is, position_embedding answers where in the window it sits, and their sum is the (B, T, C) tensor that flows through everything downstream. Chapter 2 also built the input pipeline — NextTokenDataset and its DataLoader — which we will not need until training begins in Chapter 7.

The attention head (Chapter 3) lets each position look back at earlier positions — never forward, thanks to the tril causal mask — and pull in a weighted blend of what it finds. One head, one pattern of looking.

Multi-head attention (Chapter 4) runs six of those heads in parallel, each 64 dimensions wide, concatenates their answers, and mixes them through a final projection. Six patterns of looking, fused into one shape-preserving operation.

The block (Chapter 5) wraps multi-head attention and a position-wise feed-forward network in residual connections and pre-LN layer norms: route, then think, with a gradient highway running straight through both. Its defining property is the shape contract — (B, T, d_model) in, (B, T, d_model) out.

That contract is about to pay for itself, all at once. Because the block preserves its shape exactly, building a deep model requires no engineering at all — no glue, no adapters, no special case for layer seven. You stack.

It is worth watching the shape travel through the whole machine before we watch the parts. The tokenizer hands in a flat grid of integers; the embeddings inflate each integer into a 384-vector; every block leaves that shape untouched; and only at the very last door does the shape change again — from the model’s private 384-dimensional geometry back out to one score per vocabulary token. Three shapes, two changes, and a long stretch in the middle where nothing moves but the numbers.

The shape contract, end to end: the shape changes exactly twice — once to enter the model’s geometry, once to leave it — and the long blue middle, six blocks and a final norm, never changes shape at all.

6.2 The missing pieces

Three things stand between the bench and a working forward pass. None of them is hard — two are a single line each, and the third is a single word of caution about a loose end most tutorials never mention.

Stacking N blocks

The depth of a GPT is literally the number of times Chapter 5’s block repeats. Since every block maps (B, T, 384) to (B, T, 384), the output of block one is a legal input to block two, and so on forever. nn.Sequential(*[Block(...) for _ in range(num_layers)]) is the entire construction. We set num_layers = 6 — the last free knob in the configuration we fixed back in Chapter 1, now finally used.

The final layer norm

There is a subtle loose end in the pre-LN convention. Inside each block, the layer norms fire on the branches — before attention, before the feed-forward — while the residual stream itself flows through the + unnormalized. That is exactly what makes pre-LN a clean gradient highway (Chapter 5). But it also means that after the sixth block, the stream that emerges has been accumulating raw additions for six layers and has never once been normalized itself. Before we ask a linear layer to read scores off those vectors, we normalize one last time: ln_f, a single nn.LayerNorm(d_model) after the last block. This final norm is the standard companion of pre-LN in the GPT-2 lineage our code follows, the same convention you will find in Karpathy’s nanoGPT.

The language-model head, and what a logit honestly is

Everything so far lives in geometry — 384-dimensional vectors that mean nothing outside the model. The last piece converts geometry back into the vocabulary: lm_head = nn.Linear(d_model, vocab_size), a single linear layer mapping each position’s 384-vector to 512 numbers — one number per token in the vocabulary. Those numbers are called logits, and it is worth being precise about what they are, because sloppy language here is the single most common way people fool themselves about what a model “believes.”

A logit is an unnormalized score. It can be any real number — negative, huge, whatever the matrix multiply produces. A higher logit means the model favors that token more as the next token at that position; that is the entire meaning. Logits are not probabilities: they don’t sum to one, and a logit of 3.2 tells you nothing on its own. Turning the 512 scores into a proper probability distribution takes one more operation — the softmax — and we deliberately do not apply it inside the model. Chapter 7’s loss function applies it internally as part of computing cross-entropy, and Chapter 8’s sampler applies it explicitly when it is time to speak. The model’s own last word is the raw scores.

Notice the symmetry of the two ends of the machine. The embedding table maps token → vector; the head maps vector → a score for every token. They are mirror doors on the same 384-dimensional room. Hold that thought — it becomes Section 6.5.

6.3 The code

Here is the whole machine. Block — and inside it MultiHeadAttention, SelfAttentionHead, and FeedForward — is brought forward from Chapter 5 exactly as written there; nothing is redefined. The class below is everything new this chapter adds, and it absorbs Chapter 2’s GPTFrontEnd into its opening lines.

import torch
import torch.nn as nn

# Block (and everything inside it) is brought forward from Chapter 5
# exactly as written there. Nothing is redefined in this chapter.

# -- the configuration, fixed since Chapter 1 --------------------------------
vocab_size     = 512
context_length = 256
d_model        = 384
num_heads      = 6        # head_size = 384 // 6 = 64
num_layers     = 6        # NEW: how many Blocks to stack
dropout        = 0.1


class GPT(nn.Module):
    """
    The full model: everything this book has built, assembled.
    Input:  (B, T)              -- integer token IDs from the Ch 1 tokenizer
    Output: (B, T, vocab_size)  -- logits: one score per vocab token, at every position
    """
    def __init__(self, vocab_size, context_length, d_model,
                 num_heads, num_layers, dropout=0.0):
        super().__init__()
        self.token_embedding    = nn.Embedding(vocab_size, d_model)       # ch 2: what
        self.position_embedding = nn.Embedding(context_length, d_model)   # ch 2: where
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.Sequential(*[
            Block(d_model, num_heads, context_length, dropout)
            for _ in range(num_layers)
        ])                                       # ch 3-5, repeated num_layers times
        self.ln_f    = nn.LayerNorm(d_model)     # NEW: the final norm
        self.lm_head = nn.Linear(d_model, vocab_size)   # NEW: vectors -> logits

    def forward(self, idx):                      # idx: (B, T) integer token IDs
        B, T = idx.shape
        tok = self.token_embedding(idx)                       # (B, T, d_model)
        pos = self.position_embedding(
            torch.arange(T, device=idx.device))               # (T, d_model)
        x = self.drop(tok + pos)                              # (B, T, d_model)
        x = self.blocks(x)                                    # (B, T, d_model), six times through
        x = self.ln_f(x)                                      # (B, T, d_model)
        logits = self.lm_head(x)                              # (B, T, vocab_size)
        return logits


# -- sanity check -------------------------------------------------------------
model = GPT(vocab_size, context_length, d_model, num_heads, num_layers, dropout)

idx    = torch.randint(0, vocab_size, (32, 256))   # a full (B, T) batch of token IDs
logits = model(idx)
print(logits.shape)     # torch.Size([32, 256, 512])  == (B, T, vocab_size)

n_params = sum(p.numel() for p in model.parameters())
print(n_params)         # 11132672

Line-by-line walk

self.drop after the embedding sum: a light dropout on the combined token+position signal before it enters the stack. This placement follows the GPT-2/nanoGPT lineage; with our dropout = 0.1 it randomly zeroes 10% of entries during training and does nothing at inference.
nn.Sequential(*[Block(...) ...]): the star unpacks a plain Python list of six freshly constructed blocks. nn.Sequential then calls them in order, each feeding the next — legal only because of the shape contract.
torch.arange(T, device=idx.device): positions 0..T-1, built on whatever device the input lives on. Note it reads only the first T rows of the position table — a shorter-than-context input simply uses fewer positions (exercise 4).
self.lm_head(x): applied to the full (B, T, d_model) tensor, so we get logits at every position, not just the last. That looks wasteful for generation, but it is exactly what training needs: Chapter 2’s data pipeline packs T next-token predictions into every window, and Chapter 7 will score all of them at once.
The output shape is (32, 256, 512): for each of 32 sequences, at each of 256 positions, 512 scores — one per vocabulary token — for what comes next.

One honest bookkeeping note before the tower diagram: our lm_head keeps PyTorch’s default bias — 512 extra parameters that GPT-2’s own head does not have (in the original, the output projection is the transposed embedding table, no bias term at all). We keep it because it is what a plain nn.Linear gives you and it changes nothing about the mechanics — but it is a deliberate deviation, and the parameter ledger below counts it. Exercise 3 makes you reason about exactly this bias when the tie in Section 6.5 leaves it stranded.

The whole machine: token IDs enter at the bottom, become vectors, pass through the same block six times, get one final normalization, and leave as logits — a score for all 512 vocabulary tokens at every one of the 256 positions.

6.4 Counting the parameters

Every learnable number in this model can be counted by hand, from the configuration alone — no profiler, no framework magic, just arithmetic. Doing the count once is worth more than any diagram, because it kills the last trace of mystery about where the model is. The model is these tensors. Nothing else. There is no ghost in this machine, only a ledger.

Token embedding: 512 × 384 = 196,608
Position embedding: 256 × 384 = 98,304
Each block: 1,773,312, broken down as:
- attention Q/K/V — 6 heads × 3 bias-free linears × (384 × 64) = 442,368
- attention output projection — 384 × 384 + 384 bias = 147,840
- feed-forward — 384 × 1536 + 1536 plus 1536 × 384 + 384 = 1,181,568
- two layer norms — 2 × (384 + 384) = 1,536
Six blocks: 6 × 1,773,312 = 10,639,872
Final norm ln_f: 384 + 384 = 768
LM head: 384 × 512 + 512 bias = 197,120
Total: 11,132,672 — call it ~11M.

And the one-line verification, which the sanity check above already printed:

total = sum(p.numel() for p in model.parameters())
print(f"{total:,}")        # 11,132,672

# where does it live? mostly in the blocks:
for name, module in [("embeddings", nn.ModuleList([model.token_embedding,
                                                   model.position_embedding])),
                     ("blocks",     model.blocks),
                     ("ln_f",       model.ln_f),
                     ("lm_head",    model.lm_head)]:
    print(name, sum(p.numel() for p in module.parameters()))

Two observations from the ledger. First, the blocks own about 96% of the model — and within each block, the feed-forward network alone owns about two-thirds. Depth and the 4× expansion are where the capacity lives. Second, the count scales in ways you can now predict: doubling num_layers adds exactly 1,773,312 per block, while the embeddings and head don’t move at all (exercise 1 makes you prove this).

Drawn to scale, that first observation stops being a statistic and becomes a picture. The blocks are the model; everything else is a rounding error clinging to the ends.

Every slice is an exact count, drawn to true width. The six blocks are the model — 95.6% of it — and inside one block, the feed-forward network alone is about two-thirds. The final norm’s 768 parameters are a share too thin to draw.

Now the perspective ladder, so the number 11 million sits honestly. Our model shares its architecture with GPT-2 — the same embeddings, the same pre-LN blocks, the same final-norm-then-head exit. GPT-2’s smallest released checkpoint weighs in around 124 million parameters, roughly eleven of ours; its largest is a 1.5-billion-parameter transformer, per Radford et al. (2019) — about 135 of ours. OpenAI judged that largest model consequential enough to release in stages across 2019, small to large, culminating in the full 1.5B release that November. The ladder continues upward from there into models whose parameter counts are corporate secrets — the frontier stops publishing the number right about where it starts to matter. But every rung below that silence is the same move: the same architecture, more of it — wider d_model, more heads, more blocks, bigger vocabulary. Nothing on the bench changes shape. It just multiplies.

6.5 Weight tying — the mirror doors

Section 6.2 noticed a symmetry: the token embedding is a 512 × 384 table mapping tokens into the model’s space, and the LM head’s weight is a 512 × 384 matrix mapping the model’s space back onto tokens. Same shape, mirrored jobs. Press & Wolf (2016, arXiv:1608.05859) proposed making them literally the same matrix — weight tying — and found it both shrinks the model and improves its perplexity. The intuition: if token 371’s embedding is the direction that means token 371, then scoring “how much does this hidden vector point toward token 371” can reasonably reuse the same direction.

Two doors on the same 384-dimensional room: the embedding is the way in (token → vector), the head is the way out (vector → a score for every token). Same shape, mirrored jobs — and weight tying is the choice to make them one matrix.

GPT-2 does exactly this. In OpenAI’s released source code, there is no separate output matrix at all — the token-embedding table wte is reused, transposed, to produce the logits. In our PyTorch, the whole option is one line:

# optional: tie the head to the embedding table (the GPT-2 choice)
model.lm_head.weight = model.token_embedding.weight   # now the SAME tensor, shared

We keep ours untied, and this is a marked deviation from GPT-2: two matrices, two jobs, so that when you inspect the model in these chapters you always know which door you are looking at. The price of that clarity is 196,608 parameters — tied, the model would count 10,936,064 instead of 11,132,672 (exercise 3 has you verify this). When Chapter 9 pours GPT-2’s real weights into this class, tying will switch on, because the checkpoint only ships one matrix.

6.6 The machine is built, and it is noise

Run the sanity check again and sit with what it actually did. A batch of 32 sequences, 256 tokens each, went through the entire machine — embeddings, six blocks, the final norm, the head — and out came (32, 256, 512): over four million logits, a complete opinion about the next token at every position of every sequence. The plumbing is finished. Every shape is right.

And every one of those four million opinions is garbage. The 11,132,672 parameters were initialized to small random values — GPT-2’s own recipe draws them from a normal distribution with standard deviation 0.02 (the w_init_stdev in openai/gpt-2, src/model.py), and PyTorch’s defaults land in the same small-and-random neighborhood; the model has never seen a byte of text. Its logits at every position are near-meaningless wobbles around zero, which after a softmax would give something close to a uniform distribution over the vocabulary — every token roughly equally likely, always. The plumbing is perfect and the water is mud.

We can even say precisely how wrong it is. For a model that spreads its belief uniformly over 512 tokens, the standard scoring rule — the cross-entropy loss Chapter 7 builds — assigns exactly ln(512) ≈ 6.24 nats of loss per prediction. That is arithmetic, not an experiment: the natural log of the vocabulary size is what perfect ignorance costs. A freshly initialized model isn’t exactly uniform, so in practice you will measure something in that neighborhood rather than the exact figure — but 6.24 is the number to hold in your head. It is the starting line.

This is the honest state of the art on our bench: a structurally complete GPT — architecturally the same animal as the models running the world’s chat windows — that knows nothing whatsoever. Between this machine and one that speaks stands a single procedure: show it text, score its surprise, and nudge all eleven million numbers downhill, an enormous number of times. That is the entire distance from noise to language, and it is not a mystery either. It has a name, a formula, and a loop, and it is the whole of the next chapter.

6.7 The thing to actually understand

Assembly was free because the contract held. Every part preserves (B, T, C), so the full model is a pipeline, not a puzzle: embed, stack, normalize, project. The hard design work was done in Chapters 1–5; this chapter is composition.
Depth is repetition, not invention. A deeper GPT is the same block more times. num_layers is a multiplier on one design, and the shape contract is what makes the multiplier legal.
Logits are scores, not probabilities. The model’s final output is 512 unnormalized numbers per position. Softmax — inside the loss (Ch 7) or the sampler (Ch 8) — is what turns scores into a distribution, and it lives outside the model.
You can count every number: 11,132,672. The architecture is fully transparent — you just built it and audited the ledger, line by line, no NDA required. Whatever opacity this book is named for, it is not in the wiring diagram. It will live in the values those 11M numbers take on — and that is exactly where we are headed.
Same skeleton as the giants. GPT-2 is this exact architecture at 124M–1.5B parameters; the two deviations we chose (ReLU, untied weights) are marked and reversible. Scale is a difference of degree, not of kind — and Chapter 9 will test that claim in the least forgiving way there is: by pouring GPT-2’s real weights into this very class and pressing run.

6.8 Exercises

Prove where the capacity lives. Build the model with num_layers = 1, 2, and 12 (leave everything else fixed). Count parameters each time. Verify the count moves by exactly 1,773,312 per block, and that embeddings, ln_f, and the head never move. What fraction of the 12-layer model do the blocks own?
Recount at another width. On paper first: set d_model = 768, num_heads = 12 (keep vocab 512, context 256, 6 layers). Work out the per-block count by hand using Section 6.4’s ledger, then build it and check with sum(p.numel() ...). Notice which terms grow linearly in d_model and which grow with its square.
Tie the weights and count again. Apply the one-line tie from Section 6.5, then recount. Confirm you get 10,936,064 — and explain, in one sentence, which tensor stopped being counted and why the bias of lm_head survived.
Short sequences. Feed the model a batch with T = 32 instead of 256 — torch.randint(0, 512, (4, 32)). Predict the logits shape before you run it. Why does the position embedding not complain? (Look at what torch.arange(T) actually reads.) Then try T = 300 and explain the failure — you predicted this one back in Chapter 2.
Find the single biggest tensor. Loop over model.named_parameters() and print each name with its numel(). Which single weight matrix is the largest in the whole model, and does the answer surprise you given how much attention this book has spent on attention?

What’s next

Ch 7 — Teaching It to Predict

Read Ch 7 →

A 37th-Chamber original. Methods cited: Radford et al. (2019), “Language Models are Unsupervised Multitask Learners” — cdn.openai.com PDF (largest GPT-2 = 1.5B parameters — confirmed); OpenAI’s staged 2019 release of GPT-2, GPT-2: 1.5B release (confirmed); Press & Wolf (2016), “Using the Output Embedding to Improve Language Models,” arXiv:1608.05859 (weight tying — confirmed); GPT-2’s tied embeddings verified in the released source, openai/gpt-2 src/model.py (confirmed); the final-LayerNorm and dropout placement follow the GPT-2 lineage as implemented in Karpathy’s nanoGPT (confirmed). ln(512) ≈ 6.2383 is arithmetic, not a citation; the 124M figure for GPT-2’s smallest checkpoint is stated from the released checkpoints and examined in detail in Chapter 9. All prose and code written fresh.