Words as Directions: Embeddings & the Input Pipeline
An integer ID is a name tag, nothing more. Token 257 is not bigger or closer to token 256 than to token 9. Before the model can reason, each token has to become a point in space — a direction it can measure angles and distances against. That’s an embedding, and it’s where meaning starts to live.
2.1 Why an ID isn’t enough
After Chapter 1, "the" might be token 256. But if we feed the raw integer 256 into the network, we’ve told it a lie: that 256 is “more” than 255 and “half of” 512. Integers carry an ordering and a magnitude that token IDs do not have. The IDs are categorical — arbitrary labels.
The fix: give every token its own vector — a list of d_model real numbers (we’ll use 384). Now a token isn’t a scalar with a fake magnitude; it’s a direction in a 384-dimensional space, and the model is free to learn where each token should sit. Tokens used in similar ways drift to similar directions. This is the famous result where “king − man + woman ≈ queen” — relationships become geometry. We don’t hand-build that; we give each token a learnable vector and let training move it.
2.2 The embedding table
An embedding layer is just a lookup table: a matrix of shape (vocab_size, d_model). Row i is the vector for token i. “Embedding a token” = “select its row.” In PyTorch:
import torch
import torch.nn as nn
vocab_size = 512 # from our Ch1 tokenizer
d_model = 384 # the width of each token's vector ("embedding dimension")
token_embedding = nn.Embedding(vocab_size, d_model)
# a tiny batch of token IDs (we'll explain the shape below)
ids = torch.tensor([[256, 257, 9, 14]]) # shape (1, 4): 1 sequence, 4 tokens
vectors = token_embedding(ids) # shape (1, 4, 384)
print(vectors.shape) # torch.Size([1, 4, 384])
nn.Embedding(vocab_size, d_model) creates that (512, 384) matrix, initialized randomly. Those 512×384 = 196,608 numbers are parameters — they get learned during training. The embedding table is usually one of the largest single pieces of a small model.
Mechanically,nn.Embeddingis a(vocab_size, d_model)matrix indexed by token ID. It is equivalent to multiplying a one-hot vector by that matrix — but indexing is what actually happens, because it’s vastly cheaper. Knowing the equivalence matters in Chapter 9, where the embedding and the output layer can share this matrix (weight tying).
2.3 The problem of order
Here’s something that trips everyone up the first time. The attention mechanism we build in Chapter 3 is, by itself, order-blind. Feed it "dog bites man" and "man bites dog" as bags of token-vectors and — astonishingly — it computes the same thing for each word regardless of where the word sits. There is no built-in notion of “first” or “next.” (This is a real property of self-attention, not a simplification; it’s called permutation equivariance.)
But order is the entire game in language. “man bites dog” is news; “dog bites man” is Tuesday. So we have to inject position information ourselves.
The simplest effective method (and what GPT-2 uses): a second learnable embedding table, indexed by position instead of by token.
context_length = 256 # the most tokens the model ever sees at once
position_embedding = nn.Embedding(context_length, d_model)
T = ids.shape[1] # number of tokens in this sequence (4 above)
positions = torch.arange(T) # tensor([0, 1, 2, 3])
pos_vectors = position_embedding(positions) # shape (T, 384) -> (4, 384)
Position 0 has a learned vector, position 1 has another, and so on up to context_length - 1. The model learns what “being third” should mean.
Combining token + position
We simply add the two:
# token vectors: (B, T, d_model) -- B sequences, T tokens each, d_model-wide
# position vectors: ( T, d_model)
# they add by broadcasting over the batch dimension B:
x = token_embedding(ids) + position_embedding(torch.arange(T))
print(x.shape) # (1, 4, 384)
Each token’s final input vector is “what token I am” + “where I am.” That sum is the actual input to the transformer. Addition feels almost too simple — but it works because the model has 384 dimensions to spread these two signals across, and training sorts out how to keep them legible.
2.4 The shape that runs through everything
Three letters will follow you for the rest of this book. Learn them now:
B— batch size: how many independent sequences we process at once (e.g., 32).T— time / sequence length: how many tokens in each sequence (≤context_length).C— channels: the vector width, ourd_model(384). (PyTorch convention calls itC.)
The input to the model is an integer tensor of shape (B, T). The moment it hits the embeddings it becomes a float tensor of shape (B, T, C), and it stays (B, T, C) through every transformer block until the very end. If you ever get lost in later chapters, print .shape — almost every bug is a shape bug.
2.5 The input pipeline: making “predict the next token” concrete
How does a model learn? We give it a sequence and ask it to predict the next token at every position. The training signal is free — it’s already in the text. We just slide a window across the token stream and pair each chunk with the same chunk shifted one token to the right.
tokens: [ 256, 257, 9, 14, 88, 12, ... ]
context_length = 4
x (input): [256, 257, 9, 14]
y (target): [257, 9, 14, 88] <- x shifted left by one
Read it as four predictions packed together: given 256 predict 257; given 256,257 predict 9; given 256,257,9 predict 14; and so on. One window trains the model on T next-token predictions at once.
A Dataset and DataLoader
import torch
from torch.utils.data import Dataset, DataLoader
class NextTokenDataset(Dataset):
def __init__(self, token_ids: list[int], context_length: int):
self.ids = token_ids
self.context_length = context_length
def __len__(self):
# every start position that leaves room for a full window + its shifted target
return len(self.ids) - self.context_length
def __getitem__(self, idx):
chunk = self.ids[idx : idx + self.context_length + 1] # one extra for the shift
x = torch.tensor(chunk[:-1]) # (context_length,)
y = torch.tensor(chunk[1:]) # (context_length,) — shifted by one
return x, y
# build it from a corpus encoded with our Ch1 tokenizer
# token_ids = encode_bpe(open("corpus.txt", encoding="utf-8").read(), merges)
context_length = 256
dataset = NextTokenDataset(token_ids, context_length)
loader = DataLoader(dataset, batch_size=32, shuffle=True, drop_last=True)
xb, yb = next(iter(loader))
print(xb.shape, yb.shape) # (32, 256) (32, 256) -> this is (B, T)
shuffle=True means each batch is a random grab of windows from all over the corpus — the model shouldn’t learn the order the windows happen to sit in memory. drop_last=True discards a final ragged batch so every batch is exactly (B, T).
2.6 Putting the front end together
Here’s the complete “front end” of our model — everything from token IDs to the (B, T, C) tensor the transformer blocks will consume. We’ll grow this class through the rest of the book.
import torch
import torch.nn as nn
class GPTFrontEnd(nn.Module):
def __init__(self, vocab_size, context_length, d_model):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(context_length, d_model)
def forward(self, idx): # idx: (B, T) integer token IDs
B, T = idx.shape
tok = self.token_embedding(idx) # (B, T, C)
pos = self.position_embedding(torch.arange(T, device=idx.device)) # (T, C)
x = tok + pos # (B, T, C) via broadcast
return x
model = GPTFrontEnd(vocab_size=512, context_length=256, d_model=384)
xb, yb = next(iter(loader))
out = model(xb)
print(out.shape) # (32, 256, 384) == (B, T, C)
Run it. If you get (32, 256, 384), your tokens are now living in 384-dimensional space with a sense of position, batched and ready. The hard part — what the model does with these vectors to mix information between tokens — is Chapter 3.
2.7 The thing to actually understand
- Meaning is a location, learned not given. We don’t tell the model what “Shaolin” means. We hand it a random vector and let next-token prediction push that vector somewhere useful. Embeddings are the model’s first opinion about the world, and they start as noise.
- Position is a separate, added signal because attention is natively order-blind. Token identity and token position are two answers (“what” and “where”) summed into one vector.
(B, T, C)is the river. Integers in (B, T), vectors out (B, T, C), and that shape is preserved by every block until the output head. Hold onto it.
2.8 Exercises
- Untrained geometry. Before any training, take your
token_embedding, grab the vectors for two tokens, and compute their cosine similarity (torch.nn.functional.cosine_similarity). Why is it meaningless right now? Write a note predicting what you expect after training (we’ll check it in Ch7). - Order-blindness, felt. Embed
[256, 257]and[257, 256]without position embeddings, sum each sequence’s vectors, and compare. Then redo it with position embeddings. Show numerically that position is what breaks the symmetry. - Shape gauntlet. For
B=8, T=128, d_model=384: what is the shape aftertoken_embedding? After addingposition_embedding? How many parameters are in each embedding table? (Answer in numbers, then verify withsum(p.numel() for p in model.parameters()).) - Window arithmetic. A corpus of 1,000,000 tokens with
context_length=256: how many training windows doesNextTokenDatasetexpose? Whylen(ids) - context_lengthand notlen(ids) // context_length? - Break it. Set
context_length=256in the model but feed a batch withT=300. Predict the error before you run it, then run it. Which embedding blows up, and why does that tell youcontext_lengthis a hard ceiling baked into the architecture?
A 37th-Chamber original. Mechanism cited (Vaswani et al., “Attention Is All You Need,” 2017); all prose and code written fresh.