From Text to Numbers: Tokenization
A model does arithmetic. It has never seen a letter in its life. Before anything else can happen, language has to become number — and exactly how you do that conversion quietly decides what the model can and can’t learn.
1.1 The core problem
A neural network is a pile of multiplications and additions over numbers. It cannot multiply the word “Shaolin.” So step zero of building a language model is a translation layer that turns a string into a list of integers, and turns a list of integers back into a string:
"Shaolin" --encode--> [31, 47, 12, 5, 31, 9, 14] --model--> ... --decode--> "..."
That translation layer is the tokenizer. A token is the atomic unit the model sees — it might be a character, a word-piece, or a whole word. The set of all possible tokens is the vocabulary, and each token has a fixed integer ID.
Two operations, always paired and always inverse:
encode(text) -> [ids]decode([ids]) -> text
If decode(encode(x)) != x, your tokenizer is broken. That round-trip is the first test we’ll write.
1.2 The simplest tokenizer that works: character-level
Start with the most honest possible scheme: every distinct character is a token. Let’s build it on a real corpus. Grab any plain-text file — say a chunk of text in corpus.txt — and:
# char_tokenizer.py
with open("corpus.txt", "r", encoding="utf-8") as f:
text = f.read()
# the vocabulary is just the sorted set of characters that appear
chars = sorted(set(text))
vocab_size = len(chars)
print(f"{vocab_size} unique characters")
print("".join(chars))
# two lookup tables: string->int and int->string
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}
def encode(s: str) -> list[int]:
return [stoi[ch] for ch in s]
def decode(ids: list[int]) -> str:
return "".join(itos[i] for i in ids)
# the round-trip test — this MUST pass
sample = "Shaolin"
assert decode(encode(sample)) == sample, "round-trip failed"
print(encode(sample))
print(decode(encode(sample)))
That’s a complete, working tokenizer. Run it. You now have encode/decode and a vocab_size (probably 60–100 for English prose).
Why character-level is a real choice, not a toy
- Tiny vocabulary (tens, not tens-of-thousands). Cheap.
- No “unknown token” problem — any text made of seen characters encodes perfectly. New word? Still just characters.
Why we don’t stop here
Look at what the model must learn. To “understand” the word the, a character model must learn the sequence t → h → e from scratch, across every context, using up positions in its limited attention window. A 256-character context is only ~50 English words. The model spends enormous capacity just re-discovering that letters clump into words. We’re making it learn spelling before it can learn meaning.
We want a middle ground: units bigger than a character (so sequences are short and word-structure comes for free) but smaller than a word (so we never hit a word we can’t represent). That middle ground is subword tokenization, and the workhorse algorithm is byte-pair encoding.
1.3 Byte-pair encoding (BPE), from scratch
BPE (adapted to NLP by Sennrich, Haddow & Birch, 2016; used by GPT-2, Radford et al. 2019) has one idea, applied repeatedly:
Find the most frequent adjacent pair of tokens in the data, and merge it into a single new token. Repeat until you have the vocabulary size you want.
You start with the smallest possible alphabet and grow the vocabulary by gluing together pairs that occur a lot. Common chunks like th, the, and become single tokens; rare stuff stays in small pieces. Nothing is ever unrepresentable, because the foundation is always there underneath.
Start from bytes, not characters
A subtle, important choice GPT-2 makes: start not from Unicode characters but from raw bytes. There are exactly 256 possible byte values (0–255), and any text in any language — emoji, Chinese, control characters — is just a sequence of bytes in UTF-8. So a byte-level base vocabulary of 256 can represent literally anything with zero unknown tokens, ever. We build on that.
# step 1: text -> raw byte ids (base vocabulary is 0..255)
text = "the theme of the theatre"
ids = list(text.encode("utf-8")) # e.g. [116, 104, 101, 32, ...]
print(ids)
Counting pairs
def get_pair_counts(ids: list[int]) -> dict[tuple[int, int], int]:
"""Count how often each adjacent pair appears."""
counts = {}
for pair in zip(ids, ids[1:]): # (ids[0],ids[1]), (ids[1],ids[2]), ...
counts[pair] = counts.get(pair, 0) + 1
return counts
counts = get_pair_counts(ids)
top_pair = max(counts, key=counts.get)
print(top_pair, counts[top_pair]) # the most frequent adjacent pair
Merging a pair into a new token
def merge(ids: list[int], pair: tuple[int, int], new_id: int) -> list[int]:
"""Replace every occurrence of `pair` with the single token `new_id`."""
out = []
i = 0
while i < len(ids):
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
out.append(new_id)
i += 2 # skip both members of the pair
else:
out.append(ids[i])
i += 1
return out
Training the tokenizer
Training a tokenizer just means: decide which merges to make, in order. We do num_merges rounds; each round finds the top pair and assigns it the next free ID (starting at 256, after the byte range).
def train_bpe(text: str, vocab_size: int):
assert vocab_size >= 256
num_merges = vocab_size - 256
ids = list(text.encode("utf-8"))
merges: dict[tuple[int, int], int] = {} # the learned rules, in order
for i in range(num_merges):
counts = get_pair_counts(ids)
if not counts:
break
pair = max(counts, key=counts.get)
new_id = 256 + i
ids = merge(ids, pair, new_id)
merges[pair] = new_id
return merges
# train on a real corpus string
merges = train_bpe(open("corpus.txt", encoding="utf-8").read(), vocab_size=512)
print(f"learned {len(merges)} merges")
Encoding with the trained merges
To encode new text, apply the learned merges in the same order they were learned (earliest merges are the most general and must go first):
def encode_bpe(text: str, merges: dict[tuple[int, int], int]) -> list[int]:
ids = list(text.encode("utf-8"))
# apply merges in learned order
for pair, new_id in merges.items():
ids = merge(ids, pair, new_id)
return ids
Decoding — rebuild the byte strings
To decode, we need to know what bytes each token ID expands to. Build a vocab table: IDs 0–255 are single bytes; each merged ID is the concatenation of its two parents.
def build_vocab(merges: dict[tuple[int, int], int]) -> dict[int, bytes]:
vocab = {i: bytes([i]) for i in range(256)} # base: each byte
for (a, b), new_id in merges.items(): # in insertion order
vocab[new_id] = vocab[a] + vocab[b]
return vocab
def decode_bpe(ids: list[int], vocab: dict[int, bytes]) -> str:
raw = b"".join(vocab[i] for i in ids)
return raw.decode("utf-8", errors="replace") # 'replace' guards mid-char splits
vocab = build_vocab(merges)
sample = "the theatre theme"
assert decode_bpe(encode_bpe(sample, merges), vocab) == sample
print(encode_bpe(sample, merges))
You just built — from raw bytes — the same family of tokenizer that GPT-2 and GPT-3 use. It compresses text (fewer tokens than characters), never chokes on unknown input, and round-trips perfectly. This is not a simplified stand-in. It is the real algorithm.
1.4 Special tokens
Real models need a few tokens that aren’t text — control signals. The most important:
<|endoftext|>— marks a boundary between documents. During training it tells the model “a new, unrelated stream starts here,” so it doesn’t try to predict the start of one document from the end of another. At generation time it’s the model’s way of saying “I’m done.”
Special tokens get IDs above the learned vocabulary and are inserted by hand, never produced by BPE merges:
END_OF_TEXT = len(vocab) # next free ID after all merges
vocab[END_OF_TEXT] = b"<|endoftext|>" # stored as a marker; never byte-decoded literally
We’ll wire these in properly in Chapter 7 when we assemble training data from many documents.
1.5 The thing to actually understand
- The model’s “atoms” are tokens, not words or letters. When people are surprised a model can’t spell or count letters in a word, this is why — it often never sees the letters; it sees a chunk like
theatreas one indivisible symbol. The opaque box explains the magic and the limits in the same breath. - Vocabulary size is a tradeoff. Bigger vocab → shorter sequences (good, more text per context window) but a larger embedding table and output layer (costly). GPT-2 chose ~50,257. We’ll use a small vocab so our model trains fast.
1.6 Exercises
- Round-trip stress test. Run
decode_bpe(encode_bpe(s, merges), vocab) == son text containing emoji and a non-English script. Does it hold? Why does starting from bytes (not characters) make this work? - Compression ratio. For a held-out paragraph, compute
len(encode_bpe(text)) / len(text.encode("utf-8"))atvocab_size= 300, 500, 1000. Plot tokens-vs-vocab. Where do the gains flatten? - Watch the merges. Print the first 20 merged tokens (decode each
new_idto bytes). Do they match your intuition about common English chunks? - Break it. Remove the
errors="replace"fromdecode_bpe, then decode a truncated token list that cuts a multi-byte character in half. What happens, and what does that teach you about why generation works at the token level but display happens at the character level? - Efficiency (stretch). Our
encode_bpere-scans the whole list once per merge — O(vocab × length). Sketch a faster encoder that, for a given text, repeatedly applies only the earliest-learned applicable merge. Why does merge order matter for correctness?
A 37th-Chamber original. Methods cited (Sennrich, Haddow & Birch 2016; Radford et al. 2019); all prose and code written fresh.