The Transformer Block
Chapters 3 and 4 built a machine that mixes information between positions. Multi-head attention looks outward — each token reaching across the sequence, gathering what it needs from other tokens. But that gathering is a routing operation, not a reasoning one. Once the positions have shared notes, something has to think about what they heard — independently, at each position. The transformer block assembles both stages, then wraps them in the plumbing that makes the whole thing stackable.
5.1 Where we are
After Chapter 4, MultiHeadAttention takes (B, T, C) and returns (B, T, C) — the same shape it received. That shape-preserving contract was deliberate: it is the condition that lets us compose operations without reshaping. Each sublayer in the transformer is built to honor it.
What multi-head attention does to that tensor is routing: it lets each of the T positions look at the others and update itself with a weighted blend of their values. After that operation, every position’s vector carries not just its own identity but a mix of context from across the sequence.
What it does not do is apply a nonlinear transformation. Every step of attention — the Q/K/V projections, the dot products, the softmax weights, the final linear mix — is either linear or a pointwise operation applied identically to all positions. There is no nonlinearity that lets the model compute something genuinely new from the blended information. For that, the block adds a second sublayer: a small feed-forward network that fires independently at every position.
5.2 The feed-forward network
In Vaswani et al. (2017, §3.3), the feed-forward sublayer is described as a position-wise fully connected network — “position-wise” meaning it is applied separately and identically at each position, with the same weights. It has no access to other positions; it only processes what that position now carries.
The function is:
FFN(x) = max(0, x W₁ + b₁) W₂ + b₂
Two linear transformations with a ReLU in between. In the paper’s base model: d_model = 512 and d_ff = 2048 — the inner layer is 4× wider than the residual stream. The network expands the representation, applies the nonlinearity, then projects back to d_model.
Here is the meeting-room metaphor that makes the two sublayers click together: attention is the meeting where positions share notes. Every token gets to speak, every token listens, and the output is an updated note for each position reflecting what it heard. The feed-forward network is each position going away alone to think about what it heard. Privately, nonlinearly, without any further communication. Then it writes a revised note and hands it to the next layer.
Why 4×? Vaswani et al. set d_ff = 2048 = 4 × 512, and this ratio has proven remarkably durable across model scales. The expansion creates room for the nonlinearity to activate different subsets of hidden units at different positions — in effect, each position independently decides which of the 2048 “features” to light up, then projects the result back to the channel dimension. The exact factor is a practical hyperparameter, not a theorem.
5.3 Residual connections
Deep networks are hard to train. The core problem is that gradients must flow backward from the loss through every layer to reach the parameters near the input. In a naive deep stack, those gradients tend to shrink (vanish) or explode as they multiply through layer after layer of Jacobians. By the time they reach the early layers, they carry almost no useful signal.
He et al. (arXiv 2015; CVPR 2016, “Deep Residual Learning for Image Recognition”) found a remarkably clean fix for convolutional networks: instead of asking each layer to learn a complete transformation H(x), ask it to learn only the residual — the difference from the input. The full computation becomes:
output = x + sublayer(x) # residual connection
The identity path — the raw x flowing straight through the + — is a gradient highway. During backpropagation, gradients flow through the addition unimpeded, all the way back to the earliest layers. The sublayer’s contribution is additive; if it is small or wrong, the identity path carries the signal regardless. Networks with residual connections can be meaningfully trained at depths that would otherwise not learn at all.
The transformer adopts this pattern around both sublayers. After attention, instead of passing the attention output directly to the next stage, we add it back onto the input that generated it. After the feed-forward network, same thing. The residual stream — the running x that persists across the entire block — accumulates contributions from each sublayer rather than being replaced by them.
Without residuals, every parameter must participate in a globally coherent transformation at its depth. With them, parameters only need to learn corrections to what already flows through. The task becomes feasible orders of magnitude sooner.
5.4 Layer normalization
The second piece of stabilization plumbing is layer normalization (Ba, Kiros & Hinton, 2016, arXiv:1607.06450). After every sublayer, numbers that started at a workable scale can drift: large activations compound across depth, and a network in which the numbers in one layer are ten times larger than another is difficult to optimize — learning rates calibrated for one depth are wrong for another.
Layer norm fixes this by normalizing across the channel dimension at each position: for a vector of length C, it computes the mean and variance of that vector’s elements, then rescales them to have mean zero and variance one, with learned per-channel scale and shift parameters (γ and β) that let the model restore whatever range is useful. The normalization is per position — it operates on the C-dimensional vector at each (b, t) location independently, so it is not sensitive to batch size (unlike batch normalization).
Pre-norm vs. post-norm — an honest deviation
The 2017 paper placed layer norm after each sublayer (post-LN): LayerNorm(x + sublayer(x)). This is the architecture as Vaswani et al. described it.
Modern GPT-style models — GPT-2, GPT-3, and the nanoGPT implementation that this book’s code follows — moved the layer norm to before each sublayer (pre-LN): x + sublayer(LayerNorm(x)). The difference is not cosmetic. In post-LN, the residual stream itself passes through the norm at every layer, which means the identity highway is renormalized before addition. In pre-LN, the norm only touches the branch being computed; the identity path flows through + completely unchanged. Pre-LN produces a cleaner gradient highway and trains more stably, removing the need for learning-rate warmup (Xiong et al., 2020 — arXiv:2002.04745).
Our code follows the pre-LN convention. This is a deliberate deviation from the 2017 paper, following the GPT-2/nanoGPT lineage. Where you see self.ln1 and self.ln2 in the block below, they fire before their respective sublayers, not after.
5.5 The code
Here is the full block. MultiHeadAttention is brought forward from Chapter 4 exactly as written; the two new modules are FeedForward and Block. Read every shape comment.
import torch
import torch.nn as nn
import torch.nn.functional as F
# ── Brought forward from Chapters 3 & 4 ─────────────────────────────────────
class SelfAttentionHead(nn.Module):
"""One head of causal self-attention. (B,T,C) -> (B,T,head_size)."""
def __init__(self, d_model, head_size, context_length, dropout=0.0):
super().__init__()
self.key = nn.Linear(d_model, head_size, bias=False)
self.query = nn.Linear(d_model, head_size, bias=False)
self.value = nn.Linear(d_model, head_size, bias=False)
self.register_buffer("tril", torch.tril(torch.ones(context_length, context_length)))
self.dropout = nn.Dropout(dropout)
def forward(self, x): # x: (B, T, C)
B, T, C = x.shape
k = self.key(x) # (B, T, head_size)
q = self.query(x) # (B, T, head_size)
wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5 # (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
v = self.value(x) # (B, T, head_size)
return wei @ v # (B, T, head_size)
class MultiHeadAttention(nn.Module):
"""h parallel heads, concatenated + projected. (B,T,C) -> (B,T,C)."""
def __init__(self, d_model, num_heads, context_length, dropout=0.0):
super().__init__()
assert d_model % num_heads == 0
head_size = d_model // num_heads
self.heads = nn.ModuleList([
SelfAttentionHead(d_model, head_size, context_length, dropout)
for _ in range(num_heads)
])
self.proj = nn.Linear(d_model, d_model) # W^O
self.dropout = nn.Dropout(dropout)
def forward(self, x): # x: (B, T, d_model)
out = torch.cat([h(x) for h in self.heads], dim=-1) # (B, T, d_model)
return self.dropout(self.proj(out)) # (B, T, d_model)
# ── Chapter 5: new modules ───────────────────────────────────────────────────
class FeedForward(nn.Module):
"""
Position-wise feed-forward network (Vaswani et al. 2017, §3.3).
Applied identically and independently at each position.
Input: (B, T, d_model)
Output: (B, T, d_model)
"""
def __init__(self, d_model, dropout=0.0):
super().__init__()
# expand 4× into d_ff, apply nonlinearity, project back.
# Vaswani et al. use ReLU; modern GPT-style models (GPT-2, nanoGPT)
# use GELU — we follow the original paper's ReLU here for clarity.
self.net = nn.Sequential(
nn.Linear(d_model, 4 * d_model), # (B, T, d_model) -> (B, T, 4*d_model)
nn.ReLU(), # pointwise nonlinearity
nn.Linear(4 * d_model, d_model), # (B, T, 4*d_model) -> (B, T, d_model)
nn.Dropout(dropout),
)
def forward(self, x): # x: (B, T, d_model)
return self.net(x) # (B, T, d_model)
class Block(nn.Module):
"""
One transformer block: pre-LN self-attention + pre-LN feed-forward,
each wrapped in a residual connection.
Convention note: the 2017 paper normalized AFTER each sublayer (post-LN).
We follow the GPT-2 / nanoGPT convention: normalize BEFORE (pre-LN).
Pre-LN preserves the identity gradient highway through the + unchanged,
which gives more stable training at depth.
Input: (B, T, d_model)
Output: (B, T, d_model) — same shape throughout; the block is stackable.
"""
def __init__(self, d_model, num_heads, context_length, dropout=0.0):
super().__init__()
self.sa = MultiHeadAttention(d_model, num_heads, context_length, dropout)
self.ffwd = FeedForward(d_model, dropout)
self.ln1 = nn.LayerNorm(d_model) # fires before attention
self.ln2 = nn.LayerNorm(d_model) # fires before feed-forward
def forward(self, x): # x: (B, T, d_model)
# sublayer 1: multi-head self-attention
# pre-LN: normalise x before passing to attention
# residual: add the attention output back onto x
x = x + self.sa(self.ln1(x)) # (B, T, d_model)
# sublayer 2: position-wise feed-forward
# pre-LN: normalise x before passing to FFN
# residual: add the FFN output back onto x
x = x + self.ffwd(self.ln2(x)) # (B, T, d_model)
return x # (B, T, d_model)
# ── sanity check ─────────────────────────────────────────────────────────────
B, T, C = 4, 8, 384 # batch=4, sequence_len=8, d_model=384
x = torch.randn(B, T, C)
block = Block(d_model=384, num_heads=6, context_length=256, dropout=0.1)
out = block(x)
print(out.shape) # torch.Size([4, 8, 384]) — identical to input
Line-by-line walk
FeedForward.__init__:nn.Sequentialchains the two linear layers and the ReLU in order. The firstnn.Linear(d_model, 4 * d_model)expands from384to1536; the second projects back. Each position’sC-dimensional vector is processed by this sequence independently. No position talks to another inside the FFN.nn.ReLU():max(0, x)pointwise. This is the activation Vaswani et al. used. GPT-2 and most subsequent GPT-style models switched to GELU (Gaussian Error Linear Unit), which is smooth and differentiable everywhere — where ReLU is a hard zero for negative inputs, GELU curves gently, which can improve gradient flow; the comment in the code notes this. For this walkthrough, ReLU keeps the math transparent.Block.__init__:self.sais theMultiHeadAttentionmodule from Chapter 4, unchanged.self.ln1andself.ln2arenn.LayerNorm(d_model)instances — each normalizes the last dimension of its input to mean 0, variance 1, then applies learned per-channel scale and shift.x = x + self.sa(self.ln1(x)): the pre-LN residual for attention.self.ln1(x)normalizes the current residual stream;self.sa(...)runs multi-head attention on the normalized input; the result is added back onto the un-normalizedx. The+is the gradient highway — gradients flow through it directly.x = x + self.ffwd(self.ln2(x)): same pattern for the feed-forward sublayer.self.ln2normalizes the updatedx;self.ffwdexpands-nonlinearizes-projects; the result is added back.xat the end of the forward pass carries both sublayers’ contributions superimposed on whatever came in.- Output shape:
(B, T, d_model). The block received(4, 8, 384)and returned(4, 8, 384). This is the contract. You can feed the output of one block directly into another block with zero reshaping.
5.6 The thing to actually understand
- Attention routes. FFN thinks. Multi-head attention moves information between positions; the feed-forward network processes that information nonlinearly at each position. Neither alone is sufficient. Together they constitute one complete pass of communicate-then-reason.
- Residual connections are not optional at depth.
x + sublayer(x)gives gradients a path that bypasses every learned transformation. Without it, deep stacks of the same block do not train in any reasonable sense. This is the insight from He et al. (2015), ported directly into the transformer. - Layer norm keeps the numbers honest. Every sublayer can push activations to large values; layer norm at each position normalizes before they compound. The choice of where to normalize (before or after the sublayer) matters: pre-LN, which we use, preserves the identity path through
+; post-LN, which the 2017 paper used, renormalizes the residual stream itself. Pre-LN is the modern default. - The block is a composable unit.
(B, T, d_model)in,(B, T, d_model)out. Stack N of them and the shapes never change. The depth of a GPT is literally the number of times this block repeats. - 4× expansion in the FFN is a convention, not a law. Vaswani et al. used
d_ff = 4 × d_model; it has held up well, but modern architectures experiment with this ratio freely.
5.7 Exercises
- Confirm the shape contract across a stack. Write
nn.Sequential(*[Block(d_model=384, num_heads=6, context_length=256) for _ in range(6)]). Pass a random(4, 8, 384)tensor through. Assert the output is(4, 8, 384). Nothing about the shape should have changed after six blocks. - Remove the residual connections. Comment out both
x = x + ...and replace them withx = self.sa(self.ln1(x))andx = self.ffwd(self.ln2(x)). Stack four of these non-residual blocks. Train on a tiny character dataset for 200 steps. Compare loss curves to the residual version. The point is not to succeed but to observe the difference. - Swap ReLU for GELU. In
FeedForward, replacenn.ReLU()withnn.GELU(). Run the sanity check — the output shape should not change. This is the single line that converts from the Vaswani 2017 FFN to the GPT-2 FFN. - Switch pre-LN to post-LN. Rewrite
Block.forwardto computex = self.ln1(x + self.sa(x))andx = self.ln2(x + self.ffwd(x)). Run the sanity check. Then discuss: why does post-LN renormalize the identity path, and what effect might that have on gradient flow compared to pre-LN? - Count the parameters. Run
sum(p.numel() for p in block.parameters())on yourBlock(384, 6, 256). Break it down by module:block.sa,block.ffwd,block.ln1,block.ln2. Which module owns the most parameters? (Hint: the FFN’s two large linear layers at384 × 1536and1536 × 384add up quickly.)
A 37th-Chamber original. Methods cited: Vaswani et al. (2017), “Attention Is All You Need,” arXiv:1706.03762, §3.3 (position-wise FFN, d_ff = 2048 = 4×d_model, ReLU activation — confirmed); He et al. (arXiv 2015; CVPR 2016), “Deep Residual Learning for Image Recognition” (residual / identity shortcut connections — confirmed); Ba, Kiros & Hinton (2016), “Layer Normalization,” arXiv:1607.06450 (confirmed); GPT-2 pre-LN convention follows the nanoGPT lineage (Radford et al. 2019; Karpathy, nanoGPT) — deviation from original post-LN noted inline. All prose and code written fresh.