The Opaque Box · Chapter 5

The Transformer Block

Chapters 3 and 4 built a machine that mixes information between positions. Multi-head attention looks outward — each token reaching across the sequence, gathering what it needs from other tokens. But that gathering is a routing operation, not a reasoning one. Once the positions have shared notes, something has to think about what they heard — independently, at each position. The transformer block assembles both stages, then wraps them in the plumbing that makes the whole thing stackable.

5.1  Where we are

After Chapter 4, MultiHeadAttention takes (B, T, C) and returns (B, T, C) — the same shape it received. That shape-preserving contract was deliberate: it is the condition that lets us compose operations without reshaping. Each sublayer in the transformer is built to honor it.

What multi-head attention does to that tensor is routing: it lets each of the T positions look at the others and update itself with a weighted blend of their values. After that operation, every position’s vector carries not just its own identity but a mix of context from across the sequence.

What it does not do is apply a nonlinear transformation. Every step of attention — the Q/K/V projections, the dot products, the softmax weights, the final linear mix — is either linear or a pointwise operation applied identically to all positions. There is no nonlinearity that lets the model compute something genuinely new from the blended information. For that, the block adds a second sublayer: a small feed-forward network that fires independently at every position.


5.2  The feed-forward network

In Vaswani et al. (2017, §3.3), the feed-forward sublayer is described as a position-wise fully connected network — “position-wise” meaning it is applied separately and identically at each position, with the same weights. It has no access to other positions; it only processes what that position now carries.

The function is:

FFN(x) = max(0, x W₁ + b₁) W₂ + b₂

Two linear transformations with a ReLU in between. In the paper’s base model: d_model = 512 and d_ff = 2048 — the inner layer is 4× wider than the residual stream. The network expands the representation, applies the nonlinearity, then projects back to d_model.

Here is the meeting-room metaphor that makes the two sublayers click together: attention is the meeting where positions share notes. Every token gets to speak, every token listens, and the output is an updated note for each position reflecting what it heard. The feed-forward network is each position going away alone to think about what it heard. Privately, nonlinearly, without any further communication. Then it writes a revised note and hands it to the next layer.

Why 4×? Vaswani et al. set d_ff = 2048 = 4 × 512, and this ratio has proven remarkably durable across model scales. The expansion creates room for the nonlinearity to activate different subsets of hidden units at different positions — in effect, each position independently decides which of the 2048 “features” to light up, then projects the result back to the channel dimension. The exact factor is a practical hyperparameter, not a theorem.

5.3  Residual connections

Deep networks are hard to train. The core problem is that gradients must flow backward from the loss through every layer to reach the parameters near the input. In a naive deep stack, those gradients tend to shrink (vanish) or explode as they multiply through layer after layer of Jacobians. By the time they reach the early layers, they carry almost no useful signal.

He et al. (arXiv 2015; CVPR 2016, “Deep Residual Learning for Image Recognition”) found a remarkably clean fix for convolutional networks: instead of asking each layer to learn a complete transformation H(x), ask it to learn only the residual — the difference from the input. The full computation becomes:

output = x + sublayer(x)     # residual connection

The identity path — the raw x flowing straight through the + — is a gradient highway. During backpropagation, gradients flow through the addition unimpeded, all the way back to the earliest layers. The sublayer’s contribution is additive; if it is small or wrong, the identity path carries the signal regardless. Networks with residual connections can be meaningfully trained at depths that would otherwise not learn at all.

The transformer adopts this pattern around both sublayers. After attention, instead of passing the attention output directly to the next stage, we add it back onto the input that generated it. After the feed-forward network, same thing. The residual stream — the running x that persists across the entire block — accumulates contributions from each sublayer rather than being replaced by them.

Without residuals, every parameter must participate in a globally coherent transformation at its depth. With them, parameters only need to learn corrections to what already flows through. The task becomes feasible orders of magnitude sooner.

5.4  Layer normalization

The second piece of stabilization plumbing is layer normalization (Ba, Kiros & Hinton, 2016, arXiv:1607.06450). After every sublayer, numbers that started at a workable scale can drift: large activations compound across depth, and a network in which the numbers in one layer are ten times larger than another is difficult to optimize — learning rates calibrated for one depth are wrong for another.

Layer norm fixes this by normalizing across the channel dimension at each position: for a vector of length C, it computes the mean and variance of that vector’s elements, then rescales them to have mean zero and variance one, with learned per-channel scale and shift parameters (γ and β) that let the model restore whatever range is useful. The normalization is per position — it operates on the C-dimensional vector at each (b, t) location independently, so it is not sensitive to batch size (unlike batch normalization).

Pre-norm vs. post-norm — an honest deviation

The 2017 paper placed layer norm after each sublayer (post-LN): LayerNorm(x + sublayer(x)). This is the architecture as Vaswani et al. described it.

Modern GPT-style models — GPT-2, GPT-3, and the nanoGPT implementation that this book’s code follows — moved the layer norm to before each sublayer (pre-LN): x + sublayer(LayerNorm(x)). The difference is not cosmetic. In post-LN, the residual stream itself passes through the norm at every layer, which means the identity highway is renormalized before addition. In pre-LN, the norm only touches the branch being computed; the identity path flows through + completely unchanged. Pre-LN produces a cleaner gradient highway and trains more stably, removing the need for learning-rate warmup (Xiong et al., 2020 — arXiv:2002.04745).

Our code follows the pre-LN convention. This is a deliberate deviation from the 2017 paper, following the GPT-2/nanoGPT lineage. Where you see self.ln1 and self.ln2 in the block below, they fire before their respective sublayers, not after.


5.5  The code

Here is the full block. MultiHeadAttention is brought forward from Chapter 4 exactly as written; the two new modules are FeedForward and Block. Read every shape comment.

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Brought forward from Chapters 3 & 4 ─────────────────────────────────────

class SelfAttentionHead(nn.Module):
    """One head of causal self-attention. (B,T,C) -> (B,T,head_size)."""
    def __init__(self, d_model, head_size, context_length, dropout=0.0):
        super().__init__()
        self.key   = nn.Linear(d_model, head_size, bias=False)
        self.query = nn.Linear(d_model, head_size, bias=False)
        self.value = nn.Linear(d_model, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(context_length, context_length)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):                        # x: (B, T, C)
        B, T, C = x.shape
        k = self.key(x)                          # (B, T, head_size)
        q = self.query(x)                        # (B, T, head_size)
        wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5   # (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)             # (B, T, T)
        wei = self.dropout(wei)
        v = self.value(x)                        # (B, T, head_size)
        return wei @ v                           # (B, T, head_size)


class MultiHeadAttention(nn.Module):
    """h parallel heads, concatenated + projected. (B,T,C) -> (B,T,C)."""
    def __init__(self, d_model, num_heads, context_length, dropout=0.0):
        super().__init__()
        assert d_model % num_heads == 0
        head_size = d_model // num_heads
        self.heads = nn.ModuleList([
            SelfAttentionHead(d_model, head_size, context_length, dropout)
            for _ in range(num_heads)
        ])
        self.proj    = nn.Linear(d_model, d_model)   # W^O
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):                        # x: (B, T, d_model)
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # (B, T, d_model)
        return self.dropout(self.proj(out))      # (B, T, d_model)


# ── Chapter 5: new modules ───────────────────────────────────────────────────

class FeedForward(nn.Module):
    """
    Position-wise feed-forward network (Vaswani et al. 2017, §3.3).
    Applied identically and independently at each position.
    Input:  (B, T, d_model)
    Output: (B, T, d_model)
    """
    def __init__(self, d_model, dropout=0.0):
        super().__init__()
        # expand 4× into d_ff, apply nonlinearity, project back.
        # Vaswani et al. use ReLU; modern GPT-style models (GPT-2, nanoGPT)
        # use GELU — we follow the original paper's ReLU here for clarity.
        self.net = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),     # (B, T, d_model) -> (B, T, 4*d_model)
            nn.ReLU(),                            # pointwise nonlinearity
            nn.Linear(4 * d_model, d_model),     # (B, T, 4*d_model) -> (B, T, d_model)
            nn.Dropout(dropout),
        )

    def forward(self, x):                        # x: (B, T, d_model)
        return self.net(x)                       # (B, T, d_model)


class Block(nn.Module):
    """
    One transformer block: pre-LN self-attention + pre-LN feed-forward,
    each wrapped in a residual connection.

    Convention note: the 2017 paper normalized AFTER each sublayer (post-LN).
    We follow the GPT-2 / nanoGPT convention: normalize BEFORE (pre-LN).
    Pre-LN preserves the identity gradient highway through the + unchanged,
    which gives more stable training at depth.

    Input:  (B, T, d_model)
    Output: (B, T, d_model)   — same shape throughout; the block is stackable.
    """
    def __init__(self, d_model, num_heads, context_length, dropout=0.0):
        super().__init__()
        self.sa   = MultiHeadAttention(d_model, num_heads, context_length, dropout)
        self.ffwd = FeedForward(d_model, dropout)
        self.ln1  = nn.LayerNorm(d_model)        # fires before attention
        self.ln2  = nn.LayerNorm(d_model)        # fires before feed-forward

    def forward(self, x):                        # x: (B, T, d_model)
        # sublayer 1: multi-head self-attention
        # pre-LN: normalise x before passing to attention
        # residual: add the attention output back onto x
        x = x + self.sa(self.ln1(x))            # (B, T, d_model)

        # sublayer 2: position-wise feed-forward
        # pre-LN: normalise x before passing to FFN
        # residual: add the FFN output back onto x
        x = x + self.ffwd(self.ln2(x))          # (B, T, d_model)

        return x                                 # (B, T, d_model)


# ── sanity check ─────────────────────────────────────────────────────────────
B, T, C = 4, 8, 384            # batch=4, sequence_len=8, d_model=384
x = torch.randn(B, T, C)

block = Block(d_model=384, num_heads=6, context_length=256, dropout=0.1)
out   = block(x)
print(out.shape)               # torch.Size([4, 8, 384]) — identical to input

Line-by-line walk


5.6  The thing to actually understand


5.7  Exercises

  1. Confirm the shape contract across a stack. Write nn.Sequential(*[Block(d_model=384, num_heads=6, context_length=256) for _ in range(6)]). Pass a random (4, 8, 384) tensor through. Assert the output is (4, 8, 384). Nothing about the shape should have changed after six blocks.
  2. Remove the residual connections. Comment out both x = x + ... and replace them with x = self.sa(self.ln1(x)) and x = self.ffwd(self.ln2(x)). Stack four of these non-residual blocks. Train on a tiny character dataset for 200 steps. Compare loss curves to the residual version. The point is not to succeed but to observe the difference.
  3. Swap ReLU for GELU. In FeedForward, replace nn.ReLU() with nn.GELU(). Run the sanity check — the output shape should not change. This is the single line that converts from the Vaswani 2017 FFN to the GPT-2 FFN.
  4. Switch pre-LN to post-LN. Rewrite Block.forward to compute x = self.ln1(x + self.sa(x)) and x = self.ln2(x + self.ffwd(x)). Run the sanity check. Then discuss: why does post-LN renormalize the identity path, and what effect might that have on gradient flow compared to pre-LN?
  5. Count the parameters. Run sum(p.numel() for p in block.parameters()) on your Block(384, 6, 256). Break it down by module: block.sa, block.ffwd, block.ln1, block.ln2. Which module owns the most parameters? (Hint: the FFN’s two large linear layers at 384 × 1536 and 1536 × 384 add up quickly.)
What’s next
Ch 6 — Assembling the GPT — Soon

A 37th-Chamber original. Methods cited: Vaswani et al. (2017), “Attention Is All You Need,” arXiv:1706.03762, §3.3 (position-wise FFN, d_ff = 2048 = 4×d_model, ReLU activation — confirmed); He et al. (arXiv 2015; CVPR 2016), “Deep Residual Learning for Image Recognition” (residual / identity shortcut connections — confirmed); Ba, Kiros & Hinton (2016), “Layer Normalization,” arXiv:1607.06450 (confirmed); GPT-2 pre-LN convention follows the nanoGPT lineage (Radford et al. 2019; Karpathy, nanoGPT) — deviation from original post-LN noted inline. All prose and code written fresh.

Written by a Fable · Edited by Kyle Sullivan