The Opaque Box · Chapter 6

Assembling the GPT

Chapters 1 through 5 machined the parts: a tokenizer, two embedding tables, the attention head, the multi-head mixer, the block that routes and thinks. Every one of them is on the bench now, tested and shape-true. This chapter invents nothing — it bolts the parts together in the only order the shapes allow, then counts every number in the machine, all 11,132,672 of them, by hand. What stands at the end is a complete GPT that cannot yet do a single useful thing — which is the most honest fact about it, and the whole reason the next chapter exists.

6.1  Where we are — the parts on the bench

Before assembly, walk the bench. Five chapters produced five parts, and it is worth stating in one breath each what every part does — because the whole model is nothing more than these five things in sequence. No sixth secret. No hidden magic. Five parts, in order.

The tokenizer (Chapter 1) turns raw text into a stream of integers drawn from a 512-token vocabulary — encode_bpe going in, decode coming out, with the learned merges table between them. It is the only part of the system that ever touches actual text.

The embedding tables (Chapter 2) turn each integer into a 384-dimensional vector: token_embedding answers what the token is, position_embedding answers where in the window it sits, and their sum is the (B, T, C) tensor that flows through everything downstream. Chapter 2 also built the input pipeline — NextTokenDataset and its DataLoader — which we will not need until training begins in Chapter 7.

The attention head (Chapter 3) lets each position look back at earlier positions — never forward, thanks to the tril causal mask — and pull in a weighted blend of what it finds. One head, one pattern of looking.

Multi-head attention (Chapter 4) runs six of those heads in parallel, each 64 dimensions wide, concatenates their answers, and mixes them through a final projection. Six patterns of looking, fused into one shape-preserving operation.

The block (Chapter 5) wraps multi-head attention and a position-wise feed-forward network in residual connections and pre-LN layer norms: route, then think, with a gradient highway running straight through both. Its defining property is the shape contract — (B, T, d_model) in, (B, T, d_model) out.

That contract is about to pay for itself, all at once. Because the block preserves its shape exactly, building a deep model requires no engineering at all — no glue, no adapters, no special case for layer seven. You stack.

It is worth watching the shape travel through the whole machine before we watch the parts. The tokenizer hands in a flat grid of integers; the embeddings inflate each integer into a 384-vector; every block leaves that shape untouched; and only at the very last door does the shape change again — from the model’s private 384-dimensional geometry back out to one score per vocabulary token. Three shapes, two changes, and a long stretch in the middle where nothing moves but the numbers.

The shape contract: (B,T) then (B,T,384) held through the stack, then (B,T,512) A horizontal pipeline read left to right. It begins with token IDs of shape (B, T). An arrow labeled "embed" leads to the first shape change, (B, T, 384). A long central band, glowing electric blue, holds that shape unchanged across the six transformer blocks and the final layer norm; this is the shape contract. A final arrow labeled "project" leads to the language-model head, which performs the only other shape change, to logits of shape (B, T, 512). Two changes at the two ends; a long unchanged middle.
The shape contract, end to end: the shape changes exactly twice — once to enter the model’s geometry, once to leave it — and the long blue middle, six blocks and a final norm, never changes shape at all.

6.2  The missing pieces

Three things stand between the bench and a working forward pass. None of them is hard — two are a single line each, and the third is a single word of caution about a loose end most tutorials never mention.

Stacking N blocks

The depth of a GPT is literally the number of times Chapter 5’s block repeats. Since every block maps (B, T, 384) to (B, T, 384), the output of block one is a legal input to block two, and so on forever. nn.Sequential(*[Block(...) for _ in range(num_layers)]) is the entire construction. We set num_layers = 6 — the last free knob in the configuration we fixed back in Chapter 1, now finally used.

The final layer norm

There is a subtle loose end in the pre-LN convention. Inside each block, the layer norms fire on the branches — before attention, before the feed-forward — while the residual stream itself flows through the + unnormalized. That is exactly what makes pre-LN a clean gradient highway (Chapter 5). But it also means that after the sixth block, the stream that emerges has been accumulating raw additions for six layers and has never once been normalized itself. Before we ask a linear layer to read scores off those vectors, we normalize one last time: ln_f, a single nn.LayerNorm(d_model) after the last block. This final norm is the standard companion of pre-LN in the GPT-2 lineage our code follows, the same convention you will find in Karpathy’s nanoGPT.

The language-model head, and what a logit honestly is

Everything so far lives in geometry — 384-dimensional vectors that mean nothing outside the model. The last piece converts geometry back into the vocabulary: lm_head = nn.Linear(d_model, vocab_size), a single linear layer mapping each position’s 384-vector to 512 numbers — one number per token in the vocabulary. Those numbers are called logits, and it is worth being precise about what they are, because sloppy language here is the single most common way people fool themselves about what a model “believes.”

A logit is an unnormalized score. It can be any real number — negative, huge, whatever the matrix multiply produces. A higher logit means the model favors that token more as the next token at that position; that is the entire meaning. Logits are not probabilities: they don’t sum to one, and a logit of 3.2 tells you nothing on its own. Turning the 512 scores into a proper probability distribution takes one more operation — the softmax — and we deliberately do not apply it inside the model. Chapter 7’s loss function applies it internally as part of computing cross-entropy, and Chapter 8’s sampler applies it explicitly when it is time to speak. The model’s own last word is the raw scores.

Notice the symmetry of the two ends of the machine. The embedding table maps token → vector; the head maps vector → a score for every token. They are mirror doors on the same 384-dimensional room. Hold that thought — it becomes Section 6.5.


6.3  The code

Here is the whole machine. Block — and inside it MultiHeadAttention, SelfAttentionHead, and FeedForward — is brought forward from Chapter 5 exactly as written there; nothing is redefined. The class below is everything new this chapter adds, and it absorbs Chapter 2’s GPTFrontEnd into its opening lines.

import torch
import torch.nn as nn

# Block (and everything inside it) is brought forward from Chapter 5
# exactly as written there. Nothing is redefined in this chapter.

# -- the configuration, fixed since Chapter 1 --------------------------------
vocab_size     = 512
context_length = 256
d_model        = 384
num_heads      = 6        # head_size = 384 // 6 = 64
num_layers     = 6        # NEW: how many Blocks to stack
dropout        = 0.1


class GPT(nn.Module):
    """
    The full model: everything this book has built, assembled.
    Input:  (B, T)              -- integer token IDs from the Ch 1 tokenizer
    Output: (B, T, vocab_size)  -- logits: one score per vocab token, at every position
    """
    def __init__(self, vocab_size, context_length, d_model,
                 num_heads, num_layers, dropout=0.0):
        super().__init__()
        self.token_embedding    = nn.Embedding(vocab_size, d_model)       # ch 2: what
        self.position_embedding = nn.Embedding(context_length, d_model)   # ch 2: where
        self.drop = nn.Dropout(dropout)
        self.blocks = nn.Sequential(*[
            Block(d_model, num_heads, context_length, dropout)
            for _ in range(num_layers)
        ])                                       # ch 3-5, repeated num_layers times
        self.ln_f    = nn.LayerNorm(d_model)     # NEW: the final norm
        self.lm_head = nn.Linear(d_model, vocab_size)   # NEW: vectors -> logits

    def forward(self, idx):                      # idx: (B, T) integer token IDs
        B, T = idx.shape
        tok = self.token_embedding(idx)                       # (B, T, d_model)
        pos = self.position_embedding(
            torch.arange(T, device=idx.device))               # (T, d_model)
        x = self.drop(tok + pos)                              # (B, T, d_model)
        x = self.blocks(x)                                    # (B, T, d_model), six times through
        x = self.ln_f(x)                                      # (B, T, d_model)
        logits = self.lm_head(x)                              # (B, T, vocab_size)
        return logits


# -- sanity check -------------------------------------------------------------
model = GPT(vocab_size, context_length, d_model, num_heads, num_layers, dropout)

idx    = torch.randint(0, vocab_size, (32, 256))   # a full (B, T) batch of token IDs
logits = model(idx)
print(logits.shape)     # torch.Size([32, 256, 512])  == (B, T, vocab_size)

n_params = sum(p.numel() for p in model.parameters())
print(n_params)         # 11132672

Line-by-line walk

One honest bookkeeping note before the tower diagram: our lm_head keeps PyTorch’s default bias — 512 extra parameters that GPT-2’s own head does not have (in the original, the output projection is the transposed embedding table, no bias term at all). We keep it because it is what a plain nn.Linear gives you and it changes nothing about the mechanics — but it is a deliberate deviation, and the parameter ledger below counts it. Exercise 3 makes you reason about exactly this bias when the tie in Section 6.5 leaves it stranded.

The assembled GPT: embeddings, six stacked blocks, final norm, language-model head A bottom-to-top flow diagram. At the bottom, token IDs with shape (B, T). An arrow leads up into an embeddings box labeled token plus position, producing shape (B, T, 384). An arrow leads into a stack of three overlapping rectangles representing six transformer blocks, glowing electric blue as the charged element; the blocks preserve shape (B, T, 384). An arrow continues up into a LayerNorm box labeled ln_f, then into the LM head box, a linear layer from 384 to 512. At the top, the output: logits of shape (B, T, 512), one score for every vocabulary token at every position.
The whole machine: token IDs enter at the bottom, become vectors, pass through the same block six times, get one final normalization, and leave as logits — a score for all 512 vocabulary tokens at every one of the 256 positions.

6.4  Counting the parameters

Every learnable number in this model can be counted by hand, from the configuration alone — no profiler, no framework magic, just arithmetic. Doing the count once is worth more than any diagram, because it kills the last trace of mystery about where the model is. The model is these tensors. Nothing else. There is no ghost in this machine, only a ledger.

And the one-line verification, which the sanity check above already printed:

total = sum(p.numel() for p in model.parameters())
print(f"{total:,}")        # 11,132,672

# where does it live? mostly in the blocks:
for name, module in [("embeddings", nn.ModuleList([model.token_embedding,
                                                   model.position_embedding])),
                     ("blocks",     model.blocks),
                     ("ln_f",       model.ln_f),
                     ("lm_head",    model.lm_head)]:
    print(name, sum(p.numel() for p in module.parameters()))

Two observations from the ledger. First, the blocks own about 96% of the model — and within each block, the feed-forward network alone owns about two-thirds. Depth and the 4× expansion are where the capacity lives. Second, the count scales in ways you can now predict: doubling num_layers adds exactly 1,773,312 per block, while the embeddings and head don’t move at all (exercise 1 makes you prove this).

Drawn to scale, that first observation stops being a statistic and becomes a picture. The blocks are the model; everything else is a rounding error clinging to the ends.

Where the 11,132,672 parameters live, drawn to true scale A proportional bar. The full width represents all 11,132,672 parameters. It is divided by exact parameter share: the six blocks together take 10,639,872 (about 95.6 percent), the embeddings take 294,912 (2.6 percent), the language-model head takes 197,120 (1.8 percent), and the final norm ln_f takes 768, a share too small to render. A second, exploded bar below zooms into one block's 1,773,312 parameters: the feed-forward network is 1,181,568 (about two-thirds), attention Q/K/V and output projection together are 590,208, and the two layer norms are 1,536, a hairline. All figures are exact counts, not estimates.
Every slice is an exact count, drawn to true width. The six blocks are the model — 95.6% of it — and inside one block, the feed-forward network alone is about two-thirds. The final norm’s 768 parameters are a share too thin to draw.

Now the perspective ladder, so the number 11 million sits honestly. Our model shares its architecture with GPT-2 — the same embeddings, the same pre-LN blocks, the same final-norm-then-head exit. GPT-2’s smallest released checkpoint weighs in around 124 million parameters, roughly eleven of ours; its largest is a 1.5-billion-parameter transformer, per Radford et al. (2019) — about 135 of ours. OpenAI judged that largest model consequential enough to release in stages across 2019, small to large, culminating in the full 1.5B release that November. The ladder continues upward from there into models whose parameter counts are corporate secrets — the frontier stops publishing the number right about where it starts to matter. But every rung below that silence is the same move: the same architecture, more of it — wider d_model, more heads, more blocks, bigger vocabulary. Nothing on the bench changes shape. It just multiplies.


6.5  Weight tying — the mirror doors

Section 6.2 noticed a symmetry: the token embedding is a 512 × 384 table mapping tokens into the model’s space, and the LM head’s weight is a 512 × 384 matrix mapping the model’s space back onto tokens. Same shape, mirrored jobs. Press & Wolf (2016, arXiv:1608.05859) proposed making them literally the same matrix — weight tying — and found it both shrinks the model and improves its perplexity. The intuition: if token 371’s embedding is the direction that means token 371, then scoring “how much does this hidden vector point toward token 371” can reasonably reuse the same direction.

Weight tying: the embedding and the head as mirror doors on the same 384-D room A central rounded panel represents the model's 384-dimensional geometry, "the room." On its left edge is one door: the token embedding, a 512 by 384 table, mapping token in to vector out — the way in. On its right edge is a second door, drawn as its mirror: the language-model head, a 512 by 384 matrix, mapping vector in to a score for every token out — the way out. An arc across the top labels them as the same shape doing mirrored jobs; a caption notes that weight tying makes the two doors literally one shared matrix, while this book keeps them separate for clarity. weight tying = the two doors are literally one shared matrix (GPT-2’s choice) this book keeps them separate — so you always know which door you face
Two doors on the same 384-dimensional room: the embedding is the way in (token → vector), the head is the way out (vector → a score for every token). Same shape, mirrored jobs — and weight tying is the choice to make them one matrix.

GPT-2 does exactly this. In OpenAI’s released source code, there is no separate output matrix at all — the token-embedding table wte is reused, transposed, to produce the logits. In our PyTorch, the whole option is one line:

# optional: tie the head to the embedding table (the GPT-2 choice)
model.lm_head.weight = model.token_embedding.weight   # now the SAME tensor, shared

We keep ours untied, and this is a marked deviation from GPT-2: two matrices, two jobs, so that when you inspect the model in these chapters you always know which door you are looking at. The price of that clarity is 196,608 parameters — tied, the model would count 10,936,064 instead of 11,132,672 (exercise 3 has you verify this). When Chapter 9 pours GPT-2’s real weights into this class, tying will switch on, because the checkpoint only ships one matrix.


6.6  The machine is built, and it is noise

Run the sanity check again and sit with what it actually did. A batch of 32 sequences, 256 tokens each, went through the entire machine — embeddings, six blocks, the final norm, the head — and out came (32, 256, 512): over four million logits, a complete opinion about the next token at every position of every sequence. The plumbing is finished. Every shape is right.

And every one of those four million opinions is garbage. The 11,132,672 parameters were initialized to small random values — GPT-2’s own recipe draws them from a normal distribution with standard deviation 0.02 (the w_init_stdev in openai/gpt-2, src/model.py), and PyTorch’s defaults land in the same small-and-random neighborhood; the model has never seen a byte of text. Its logits at every position are near-meaningless wobbles around zero, which after a softmax would give something close to a uniform distribution over the vocabulary — every token roughly equally likely, always. The plumbing is perfect and the water is mud.

We can even say precisely how wrong it is. For a model that spreads its belief uniformly over 512 tokens, the standard scoring rule — the cross-entropy loss Chapter 7 builds — assigns exactly ln(512) ≈ 6.24 nats of loss per prediction. That is arithmetic, not an experiment: the natural log of the vocabulary size is what perfect ignorance costs. A freshly initialized model isn’t exactly uniform, so in practice you will measure something in that neighborhood rather than the exact figure — but 6.24 is the number to hold in your head. It is the starting line.

This is the honest state of the art on our bench: a structurally complete GPT — architecturally the same animal as the models running the world’s chat windows — that knows nothing whatsoever. Between this machine and one that speaks stands a single procedure: show it text, score its surprise, and nudge all eleven million numbers downhill, an enormous number of times. That is the entire distance from noise to language, and it is not a mystery either. It has a name, a formula, and a loop, and it is the whole of the next chapter.


6.7  The thing to actually understand


6.8  Exercises

  1. Prove where the capacity lives. Build the model with num_layers = 1, 2, and 12 (leave everything else fixed). Count parameters each time. Verify the count moves by exactly 1,773,312 per block, and that embeddings, ln_f, and the head never move. What fraction of the 12-layer model do the blocks own?
  2. Recount at another width. On paper first: set d_model = 768, num_heads = 12 (keep vocab 512, context 256, 6 layers). Work out the per-block count by hand using Section 6.4’s ledger, then build it and check with sum(p.numel() ...). Notice which terms grow linearly in d_model and which grow with its square.
  3. Tie the weights and count again. Apply the one-line tie from Section 6.5, then recount. Confirm you get 10,936,064 — and explain, in one sentence, which tensor stopped being counted and why the bias of lm_head survived.
  4. Short sequences. Feed the model a batch with T = 32 instead of 256 — torch.randint(0, 512, (4, 32)). Predict the logits shape before you run it. Why does the position embedding not complain? (Look at what torch.arange(T) actually reads.) Then try T = 300 and explain the failure — you predicted this one back in Chapter 2.
  5. Find the single biggest tensor. Loop over model.named_parameters() and print each name with its numel(). Which single weight matrix is the largest in the whole model, and does the answer surprise you given how much attention this book has spent on attention?
What’s next
Ch 7 — Teaching It to Predict
Read Ch 7 →

A 37th-Chamber original. Methods cited: Radford et al. (2019), “Language Models are Unsupervised Multitask Learners” — cdn.openai.com PDF (largest GPT-2 = 1.5B parameters — confirmed); OpenAI’s staged 2019 release of GPT-2, GPT-2: 1.5B release (confirmed); Press & Wolf (2016), “Using the Output Embedding to Improve Language Models,” arXiv:1608.05859 (weight tying — confirmed); GPT-2’s tied embeddings verified in the released source, openai/gpt-2 src/model.py (confirmed); the final-LayerNorm and dropout placement follow the GPT-2 lineage as implemented in Karpathy’s nanoGPT (confirmed). ln(512) ≈ 6.2383 is arithmetic, not a citation; the 124M figure for GPT-2’s smallest checkpoint is stated from the released checkpoints and examined in detail in Chapter 9. All prose and code written fresh.

Written by a Fable · Edited by bobby-dig8al