The Opaque Box · Chapter 5

The Transformer Block

Chapters 3 and 4 built a machine that mixes information between positions. Multi-head attention looks outward — each token reaching across the sequence, gathering what it needs from other tokens. But that gathering is a routing operation, not a reasoning one. Once the positions have shared notes, something has to think about what they heard — independently, at each position. The transformer block assembles both stages, then wraps them in the plumbing that makes the whole thing stackable.

5.1 Where we are

After Chapter 4, MultiHeadAttention takes (B, T, C) and returns (B, T, C) — the same shape it received. That shape-preserving contract was deliberate: it is the condition that lets us compose operations without reshaping. Each sublayer in the transformer is built to honor it.

What multi-head attention does to that tensor is routing: it lets each of the T positions look at the others and update itself with a weighted blend of their values. After that operation, every position’s vector carries not just its own identity but a mix of context from across the sequence.

What it does not do is apply a nonlinear transformation to the token representations. The Q/K/V projections and the final linear mix are linear operations. The softmax over the dot-product scores is nonlinear — but it acts on the routing weights, not on the token vectors themselves; it decides how much of each value to blend, not what to do once the blend arrives. And the whole operation is still applied position-by-position in the sense that each position produces its own output from its own query. There is no operation that takes a position’s received context and runs it through a nonlinearity to compute something genuinely new. For that, the block adds a second sublayer: a small feed-forward network that fires independently at every position.

5.2 The feed-forward network

In Vaswani et al. (2017, §3.3), the feed-forward sublayer is described as a position-wise fully connected network — “position-wise” meaning it is applied separately and identically at each position, with the same weights. It has no access to other positions; it only processes what that position now carries.

The function is:

FFN(x) = max(0, x W₁ + b₁) W₂ + b₂

Two linear transformations with a ReLU in between. In the paper’s base model: d_model = 512 and d_ff = 2048 — the inner layer is 4× wider than the residual stream. The network expands the representation, applies the nonlinearity, then projects back to d_model.

Here is the meeting-room metaphor that makes the two sublayers click together: attention is the meeting where positions share notes. Every token gets to speak, every token listens, and the output is an updated note for each position reflecting what it heard. The feed-forward network is each position going away alone to think about what it heard. Privately, nonlinearly, without any further communication. Then it writes a revised note and hands it to the next layer.

Why 4×? Vaswani et al. set d_ff = 2048 = 4 × 512, and this ratio has proven remarkably durable across model scales. The expansion creates room for the nonlinearity to activate different subsets of hidden units at different positions — in effect, each position independently decides which of the 2048 “features” to light up, then projects the result back to the channel dimension. The exact factor is a practical hyperparameter, not a theorem.

5.3 Residual connections

Deep networks are hard to train. The core problem is that gradients must flow backward from the loss through every layer to reach the parameters near the input. In a naive deep stack, those gradients tend to shrink (vanish) or explode as they multiply through layer after layer of Jacobians. By the time they reach the early layers, they carry almost no useful signal.

He et al. (arXiv 2015; CVPR 2016, “Deep Residual Learning for Image Recognition”) found a remarkably clean fix for convolutional networks: instead of asking each layer to learn a complete transformation H(x), ask it to learn only the residual — the difference from the input. The full computation becomes:

output = x + sublayer(x)     # residual connection

The identity path — the raw x flowing straight through the + — is a gradient highway. During backpropagation, gradients flow through the addition unimpeded, all the way back to the earliest layers. The sublayer’s contribution is additive; if it is small or wrong, the identity path carries the signal regardless. Networks with residual connections can be meaningfully trained at depths that would otherwise not learn at all.

The transformer adopts this pattern around both sublayers. After attention, instead of passing the attention output directly to the next stage, we add it back onto the input that generated it. After the feed-forward network, same thing. The residual stream — the running x that persists across the entire block — accumulates contributions from each sublayer rather than being replaced by them.

One transformer block: the residual stream runs bottom to top as a spine, branching into a pre-LN attention sublayer then a pre-LN feed-forward sublayer, each added back at a + so the identity path stays unbroken.

Without residuals, every parameter must participate in a globally coherent transformation at its depth. With them, parameters only need to learn corrections to what already flows through. The task becomes feasible orders of magnitude sooner.

5.4 Layer normalization

The second piece of stabilization plumbing is layer normalization (Ba, Kiros & Hinton, 2016, arXiv:1607.06450). After every sublayer, numbers that started at a workable scale can drift: large activations compound across depth, and a network in which the numbers in one layer are ten times larger than another is difficult to optimize — learning rates calibrated for one depth are wrong for another.

Layer norm fixes this by normalizing across the channel dimension at each position: for a vector of length C, it computes the mean and variance of that vector’s elements, then rescales them to have mean zero and variance one, with learned per-channel scale and shift parameters (γ and β) that let the model restore whatever range is useful. The normalization is per position — it operates on the C-dimensional vector at each (b, t) location independently, so it is not sensitive to batch size (unlike batch normalization).

Pre-norm vs. post-norm — an honest deviation

The 2017 paper placed layer norm after each sublayer (post-LN): LayerNorm(x + sublayer(x)). This is the architecture as Vaswani et al. described it.

Modern GPT-style models — GPT-2, GPT-3, and the nanoGPT implementation that this book’s code follows — moved the layer norm to before each sublayer (pre-LN): x + sublayer(LayerNorm(x)). The difference is not cosmetic. In post-LN, the residual stream itself passes through the norm at every layer, which means the identity highway is renormalized before addition. In pre-LN, the norm only touches the branch being computed; the identity path flows through + completely unchanged. Pre-LN produces a cleaner gradient highway and trains more stably, removing the need for learning-rate warmup (Xiong et al., 2020 — arXiv:2002.04745).

Our code follows the pre-LN convention. This is a deliberate deviation from the 2017 paper, following the GPT-2/nanoGPT lineage. Where you see self.ln1 and self.ln2 in the block below, they fire before their respective sublayers, not after.

5.5 The code

Here is the full block. MultiHeadAttention is brought forward from Chapter 4 exactly as written; the two new modules are FeedForward and Block. Read every shape comment.

import torch
import torch.nn as nn
import torch.nn.functional as F

# ── Brought forward from Chapters 3 & 4 ─────────────────────────────────────

class SelfAttentionHead(nn.Module):
    """One head of causal self-attention. (B,T,C) -> (B,T,head_size)."""
    def __init__(self, d_model, head_size, context_length, dropout=0.0):
        super().__init__()
        self.key   = nn.Linear(d_model, head_size, bias=False)
        self.query = nn.Linear(d_model, head_size, bias=False)
        self.value = nn.Linear(d_model, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(context_length, context_length)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):                        # x: (B, T, C)
        B, T, C = x.shape
        k = self.key(x)                          # (B, T, head_size)
        q = self.query(x)                        # (B, T, head_size)
        wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5   # (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)             # (B, T, T)
        wei = self.dropout(wei)
        v = self.value(x)                        # (B, T, head_size)
        return wei @ v                           # (B, T, head_size)


class MultiHeadAttention(nn.Module):
    """h parallel heads, concatenated + projected. (B,T,C) -> (B,T,C)."""
    def __init__(self, d_model, num_heads, context_length, dropout=0.0):
        super().__init__()
        assert d_model % num_heads == 0
        head_size = d_model // num_heads
        self.heads = nn.ModuleList([
            SelfAttentionHead(d_model, head_size, context_length, dropout)
            for _ in range(num_heads)
        ])
        self.proj    = nn.Linear(d_model, d_model)   # W^O
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):                        # x: (B, T, d_model)
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # (B, T, d_model)
        return self.dropout(self.proj(out))      # (B, T, d_model)


# ── Chapter 5: new modules ───────────────────────────────────────────────────

class FeedForward(nn.Module):
    """
    Position-wise feed-forward network (Vaswani et al. 2017, §3.3).
    Applied identically and independently at each position.
    Input:  (B, T, d_model)
    Output: (B, T, d_model)
    """
    def __init__(self, d_model, dropout=0.0):
        super().__init__()
        # expand 4× into d_ff, apply nonlinearity, project back.
        # Vaswani et al. use ReLU; modern GPT-style models (GPT-2, nanoGPT)
        # use GELU — we follow the original paper's ReLU here for clarity.
        self.net = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),     # (B, T, d_model) -> (B, T, 4*d_model)
            nn.ReLU(),                            # pointwise nonlinearity
            nn.Linear(4 * d_model, d_model),     # (B, T, 4*d_model) -> (B, T, d_model)
            nn.Dropout(dropout),
        )

    def forward(self, x):                        # x: (B, T, d_model)
        return self.net(x)                       # (B, T, d_model)


class Block(nn.Module):
    """
    One transformer block: pre-LN self-attention + pre-LN feed-forward,
    each wrapped in a residual connection.

    Convention note: the 2017 paper normalized AFTER each sublayer (post-LN).
    We follow the GPT-2 / nanoGPT convention: normalize BEFORE (pre-LN).
    Pre-LN preserves the identity gradient highway through the + unchanged,
    which gives more stable training at depth.

    Input:  (B, T, d_model)
    Output: (B, T, d_model)   — same shape throughout; the block is stackable.
    """
    def __init__(self, d_model, num_heads, context_length, dropout=0.0):
        super().__init__()
        self.sa   = MultiHeadAttention(d_model, num_heads, context_length, dropout)
        self.ffwd = FeedForward(d_model, dropout)
        self.ln1  = nn.LayerNorm(d_model)        # fires before attention
        self.ln2  = nn.LayerNorm(d_model)        # fires before feed-forward

    def forward(self, x):                        # x: (B, T, d_model)
        # sublayer 1: multi-head self-attention
        # pre-LN: normalise x before passing to attention
        # residual: add the attention output back onto x
        x = x + self.sa(self.ln1(x))            # (B, T, d_model)

        # sublayer 2: position-wise feed-forward
        # pre-LN: normalise x before passing to FFN
        # residual: add the FFN output back onto x
        x = x + self.ffwd(self.ln2(x))          # (B, T, d_model)

        return x                                 # (B, T, d_model)


# ── sanity check ─────────────────────────────────────────────────────────────
B, T, C = 4, 8, 384            # batch=4, sequence_len=8, d_model=384
x = torch.randn(B, T, C)

block = Block(d_model=384, num_heads=6, context_length=256, dropout=0.1)
out   = block(x)
print(out.shape)               # torch.Size([4, 8, 384]) — identical to input

Line-by-line walk

FeedForward.__init__: nn.Sequential chains the two linear layers and the ReLU in order. The first nn.Linear(d_model, 4 * d_model) expands from 384 to 1536; the second projects back. Each position’s C-dimensional vector is processed by this sequence independently. No position talks to another inside the FFN.
nn.ReLU(): max(0, x) pointwise. This is the activation Vaswani et al. used. GPT-2 and most subsequent GPT-style models switched to GELU (Gaussian Error Linear Unit), which is smooth and differentiable everywhere — where ReLU is a hard zero for negative inputs, GELU curves gently, which can improve gradient flow; the comment in the code notes this. For this walkthrough, ReLU keeps the math transparent.
Block.__init__: self.sa is the MultiHeadAttention module from Chapter 4, unchanged. self.ln1 and self.ln2 are nn.LayerNorm(d_model) instances — each normalizes the last dimension of its input to mean 0, variance 1, then applies learned per-channel scale and shift.
x = x + self.sa(self.ln1(x)): the pre-LN residual for attention. self.ln1(x) normalizes the current residual stream; self.sa(...) runs multi-head attention on the normalized input; the result is added back onto the un-normalized x. The + is the gradient highway — gradients flow through it directly.
x = x + self.ffwd(self.ln2(x)): same pattern for the feed-forward sublayer. self.ln2 normalizes the updated x; self.ffwd expands-nonlinearizes-projects; the result is added back. x at the end of the forward pass carries both sublayers’ contributions superimposed on whatever came in.
Output shape: (B, T, d_model). The block received (4, 8, 384) and returned (4, 8, 384). This is the contract. You can feed the output of one block directly into another block with zero reshaping.

5.6 The thing to actually understand

Attention routes. FFN thinks. Multi-head attention moves information between positions; the feed-forward network processes that information nonlinearly at each position. Neither alone is sufficient. Together they constitute one complete pass of communicate-then-reason.
Residual connections are not optional at depth. x + sublayer(x) gives gradients a path that bypasses every learned transformation. Without it, deep stacks of the same block do not train in any reasonable sense. This is the insight from He et al. (2015), ported directly into the transformer.
Layer norm keeps the numbers honest. Every sublayer can push activations to large values; layer norm at each position normalizes before they compound. The choice of where to normalize (before or after the sublayer) matters: pre-LN, which we use, preserves the identity path through +; post-LN, which the 2017 paper used, renormalizes the residual stream itself. Pre-LN is the modern default.
The block is a composable unit. (B, T, d_model) in, (B, T, d_model) out. Stack N of them and the shapes never change. The depth of a GPT is literally the number of times this block repeats.
4× expansion in the FFN is a convention, not a law. Vaswani et al. used d_ff = 4 × d_model; it has held up well, but modern architectures experiment with this ratio freely.

5.7 Exercises

Confirm the shape contract across a stack. Write nn.Sequential(*[Block(d_model=384, num_heads=6, context_length=256) for _ in range(6)]). Pass a random (4, 8, 384) tensor through. Assert the output is (4, 8, 384). Nothing about the shape should have changed after six blocks.
Remove the residual connections. Comment out both x = x + ... and replace them with x = self.sa(self.ln1(x)) and x = self.ffwd(self.ln2(x)). Stack four of these non-residual blocks. Train on a tiny character dataset for 200 steps. Compare loss curves to the residual version. The point is not to succeed but to observe the difference.
Swap ReLU for GELU. In FeedForward, replace nn.ReLU() with nn.GELU(). Run the sanity check — the output shape should not change. This is the single line that converts from the Vaswani 2017 FFN to the GPT-2 FFN.
Switch pre-LN to post-LN. Rewrite Block.forward to compute x = self.ln1(x + self.sa(x)) and x = self.ln2(x + self.ffwd(x)). Run the sanity check. Then discuss: why does post-LN renormalize the identity path, and what effect might that have on gradient flow compared to pre-LN?
Count the parameters. Run sum(p.numel() for p in block.parameters()) on your Block(384, 6, 256). Break it down by module: block.sa, block.ffwd, block.ln1, block.ln2. Which module owns the most parameters? (Hint: the FFN’s two large linear layers at 384 × 1536 and 1536 × 384 add up quickly.)

What’s next

Ch 6 — Assembling the GPT

Read Ch 6 →

A 37th-Chamber original. Methods cited: Vaswani et al. (2017), “Attention Is All You Need,” arXiv:1706.03762, §3.3 (position-wise FFN, d_ff = 2048 = 4×d_model, ReLU activation — confirmed); He et al. (arXiv 2015; CVPR 2016), “Deep Residual Learning for Image Recognition” (residual / identity shortcut connections — confirmed); Ba, Kiros & Hinton (2016), “Layer Normalization,” arXiv:1607.06450 (confirmed); GPT-2 pre-LN convention follows the nanoGPT lineage (Radford et al. 2019; Karpathy, nanoGPT) — deviation from original post-LN noted inline. All prose and code written fresh.