The Opaque Box · Chapter 3

Attention, the Whole Game

Until now each token has been an island: embedded, positioned, but alone. Attention is the bridge — the mechanism that lets every token reach back across the sequence, decide which earlier tokens matter to it, and pull their information into itself. It is the one idea that made modern language models possible. Underneath the name, it is three dot products and a weighted average.

3.1  The one thing that’s missing

At the end of Chapter 2 we had a tensor of shape (B, T, C) — a batch of sequences, each token a 384-dimensional vector that knows what it is and where it sits. But every vector still sits in isolation. Nothing has let "bites" look back at "dog" to work out who is doing the biting.

That’s the hole. Language is contextual: the meaning of a word depends on the words around it. “bank” near “river” is a shoreline; “bank” near “money” is a vault. For the model to disambiguate, each token must be able to gather information from the other tokens — and not equally, but selectively, weighting the ones that matter to it and ignoring the rest.

That selective gathering is attention. The version we build — where a sequence attends to itself — is self-attention (Vaswani et al., “Attention Is All You Need,” 2017). The idea has older roots in machine translation, where Bahdanau et al. (2014) first let a decoder “attend” to different parts of a source sentence; Vaswani’s contribution was to throw away everything else (the recurrence, the convolutions) and show that attention alone was enough.


3.2  The library analogy: query, key, value

Here is the whole mechanism in one image. Imagine each token walks into a library and does three things at once:

To figure out where to look, a token compares its own query against every token’s key. Where a query and a key are aligned — pointing the same direction — that’s a match, a high score. The token then collects the values of the tokens it matched, weighted by how strong each match was, and adds that blend into itself.

The crucial move: query, key, and value are all computed from the token’s own vector, each by a different learned projection. The token decides what to look for, what to advertise, and what to share — and training shapes all three. That’s why it’s “self”-attention: one sequence, generating its own queries, keys, and values.

q (what I want)  ─┐
                  ├─ dot product → score → softmax → weights ─┐
k (what I offer) ─┘                                           ├─ weighted sum → output
v (what I'll give) ──────────────────────────────────────────┘

3.3  The score: a query against every key

Take a single token at position t. Its query is a vector q_t. Every token i in the sequence has a key k_i. The attention score of t looking at i is just their dot product:

score(t, i) = q_t · k_i

A dot product is an alignment meter: large and positive when the two vectors point the same way, near zero when they’re perpendicular, negative when opposed. So score(t, i) measures “how much does what token t is looking for match what token i is advertising?”

Do this for all pairs at once and you get a T × T table of scores — row t, column i = how much t attends to i. In matrix form, if Q and K are (T, head_size) matrices (one row per token):

scores = Q @ Kᵀ        # (T, head_size) @ (head_size, T) = (T, T)

3.4  From scores to weights: softmax

Raw scores are arbitrary real numbers. We want, for each token, a set of weights that are positive and sum to 1 — a proper mixture saying “spend 70% of my attention here, 20% there, 10% there.” That’s exactly what softmax does, applied across each row:

weights[t] = softmax(scores[t])     # over the columns i

Softmax exponentiates (making everything positive, and amplifying the biggest scores) then normalizes (so the row sums to 1). After softmax, row t is a probability distribution over which tokens t is paying attention to.


3.5  The output: a weighted sum of values

Now use those weights to blend the values:

output[t] = Σ_i  weights[t, i] · v_i

Token t’s output is a weighted average of every value, weighted by how much it attended to each. In matrix form:

output = weights @ V        # (T, T) @ (T, head_size) = (T, head_size)

That’s it. That’s attention: softmax(Q @ Kᵀ) @ V, with two adjustments still to add — a scale factor (3.6) and a mask (3.7).


3.6  Why we divide by √(head_size)

There’s a subtle problem. Each score is a dot product summed over head_size dimensions. If the query and key components are roughly independent with variance ~1 (which they are at initialization), then the variance of the dot product grows with head_size — a 16-dim dot product has a standard deviation around √16 = 4; a 64-dim one around 8.

Large scores are poison for softmax. When the inputs to softmax are large in magnitude, it saturates: it returns something nearly one-hot (≈ 1 on the max, ≈ 0 everywhere else). A near-one-hot attention at the start of training is bad — gradients through softmax vanish where it’s saturated, and the model can barely learn where to look.

The fix (Vaswani et al. 2017, §3.2.1) is to scale the scores back down before the softmax, by exactly the standard deviation we expect:

scores = (Q @ Kᵀ) / head_size**0.5

This is why it’s called scaled dot-product attention. Now the scores start with variance ~1, softmax starts gentle and diffuse, and the model is free to sharpen its attention as it learns. (Exercise 4 has you delete the scaling and watch softmax saturate.)


3.7  Causal masking: no peeking at the future

Our model’s job is to predict the next token. During training we show it a whole sequence at once (Chapter 2’s sliding window) and ask it to predict the next token at every position simultaneously. That’s efficient — but it creates a trap: when computing the prediction at position t, the token must not be allowed to attend to positions t+1, t+2, …. Those are the answers. If t could see the future, training would be a fraud — the model would learn to cheat by copying the next token, and at generation time (when there is no future yet) it would collapse.

So we mask: before the softmax, we force every score where i > t (a token looking ahead) to −∞. After softmax, e^(−∞) = 0, so those positions get exactly zero weight. Each token can attend only to itself and the tokens before it. This is causal (a.k.a. masked) self-attention — the defining feature of a decoder-only model like GPT.

The mask is a lower-triangular matrix of ones (tril): keep the diagonal and everything below it, zero out everything above.

allowed (T=4):          token 0 sees: [0]
  1 0 0 0               token 1 sees: [0,1]
  1 1 0 0               token 2 sees: [0,1,2]
  1 1 1 0               token 3 sees: [0,1,2,3]
  1 1 1 1

3.8  One head, from scratch

Let’s build a single attention head end-to-end with raw tensors, so every shape is visible. Type it and run it.

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(1337)

B, T, C = 4, 8, 384      # batch, time (tokens), channels (d_model) — the river from Ch2
x = torch.randn(B, T, C) # stand-in for the (B, T, C) coming out of GPTFrontEnd

head_size = 16           # the width of THIS head's q/k/v space (a hyperparameter)

# three SEPARATE learned projections — no bias, by convention
key   = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x)               # (B, T, head_size) — what each token advertises
q = query(x)             # (B, T, head_size) — what each token is looking for
v = value(x)             # (B, T, head_size) — what each token will hand over

# 1) scores: every query against every key
wei = q @ k.transpose(-2, -1)        # (B,T,hs) @ (B,hs,T) = (B, T, T)
wei = wei * head_size**-0.5          # 2) scale (section 3.6)

# 3) causal mask: zero out the future
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))   # (B, T, T)

# 4) softmax → weights that sum to 1 across the allowed positions
wei = F.softmax(wei, dim=-1)         # (B, T, T)

# 5) weighted sum of values
out = wei @ v                        # (B,T,T) @ (B,T,hs) = (B, T, head_size)

print(wei[0])        # look at one example's (T, T) attention matrix
print(out.shape)     # torch.Size([4, 8, 16])

Run it and read wei[0]. You’ll see a lower-triangular matrix where every row sums to 1 and the upper triangle is exactly 0 — token 0’s row is [1, 0, 0, …] (it can only attend to itself), token 7’s row is a full distribution over all 8 positions. That picture is causal attention. Stare at it until it’s boring.

Note the head shrinks the width. Input channels C = 384, but this head projects down to head_size = 16. Each head works in a smaller subspace. In Chapter 4 we’ll run several heads in parallel and concatenate them back up to C — that’s multi-head attention. One head learns one kind of looking (maybe “verbs seeking subjects”); many heads look in many ways at once.

3.9  Wrapping it as a module

Now the reusable version, in the nn.Module form the rest of the book builds on. Two upgrades: it stores its mask as a buffer, and it adds dropout (regularization we’ll motivate in Ch7).

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttentionHead(nn.Module):
    """One head of causal self-attention. Input (B,T,C) -> output (B,T,head_size)."""
    def __init__(self, d_model, head_size, context_length, dropout=0.0):
        super().__init__()
        self.key   = nn.Linear(d_model, head_size, bias=False)
        self.query = nn.Linear(d_model, head_size, bias=False)
        self.value = nn.Linear(d_model, head_size, bias=False)
        # tril isn't learned, but it must travel with the module (.to(device), state_dict).
        # register_buffer is how you attach non-parameter tensors to an nn.Module.
        self.register_buffer("tril", torch.tril(torch.ones(context_length, context_length)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):                       # x: (B, T, C)
        B, T, C = x.shape
        k = self.key(x)                         # (B, T, head_size)
        q = self.query(x)                       # (B, T, head_size)
        wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5    # (B, T, T), scaled
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # causal
        wei = F.softmax(wei, dim=-1)            # (B, T, T)
        wei = self.dropout(wei)
        v = self.value(x)                       # (B, T, head_size)
        out = wei @ v                           # (B, T, head_size)
        return out

# sanity check
head = SelfAttentionHead(d_model=384, head_size=16, context_length=256)
x = torch.randn(4, 8, 384)        # (B, T, C)
print(head(x).shape)              # torch.Size([4, 8, 16])

The slice self.tril[:T, :T] is the small detail that makes this robust: the buffer is built at full context_length, but we only ever use the top-left T × T corner, so the same head works for any sequence length up to the ceiling — including the single growing sequence we’ll feed it during generation in Ch8.


3.10  The thing to actually understand


3.11  Exercises

  1. Read the matrix. Run the §3.8 code and print wei[0]. Confirm by eye: lower-triangular, rows sum to 1, upper triangle all 0. For token 3’s row, hand-check that the weights sum to 1.
  2. Attention as a moving average. Replace the learned q @ kᵀ scores with a matrix of all zeros before masking + softmax. What does each token’s output become? (Hint: a uniform average over all past tokens. This is the “bag of the past” baseline attention improves on — implement it and confirm out[:, t] equals the mean of v[:, :t+1].)
  3. Order now matters. In Ch2, Exercise 2 showed embeddings without positions are order-blind. Feed this head two sequences that are token-permutations of each other (with position embeddings from Ch2 added first) and show the outputs now differ. Attention + positions = order-aware.
  4. Break the scaling. Delete the * head_size**-0.5 term and set head_size = 256. Print wei[0] — watch the rows collapse toward one-hot (one weight ≈ 1, the rest ≈ 0) before any training. Explain in one sentence why that kills gradient flow.
  5. Break the mask. Replace tril with torch.ones(T, T) (no masking). This is now a bidirectional encoder head (BERT-style), useful for classification but wrong for next-token prediction. Explain precisely what the model could “cheat” at during training if you trained a GPT with this head.
What’s next
Ch 4 — Many eyes: multi-head attention
Coming soon

A 37th-Chamber original. Methods cited (Bahdanau et al. 2014; Vaswani et al. 2017); all prose and code written fresh.