The Opaque Box · Chapter 1

From Text to Numbers: Tokenization

A model does arithmetic. It has never seen a letter in its life. Before anything else can happen, language has to become number — and exactly how you do that conversion quietly decides what the model can and can’t learn.

1.1  The core problem

A neural network is a pile of multiplications and additions over numbers. It cannot multiply the word “Shaolin.” So step zero of building a language model is a translation layer that turns a string into a list of integers, and turns a list of integers back into a string:

"Shaolin"  --encode-->  [31, 47, 12, 5, 31, 9, 14]  --model-->  ...  --decode-->  "..."

That translation layer is the tokenizer. A token is the atomic unit the model sees — it might be a character, a word-piece, or a whole word. The set of all possible tokens is the vocabulary, and each token has a fixed integer ID.

Two operations, always paired and always inverse:

If decode(encode(x)) != x, your tokenizer is broken. That round-trip is the first test we’ll write.


1.2  The simplest tokenizer that works: character-level

Start with the most honest possible scheme: every distinct character is a token. Let’s build it on a real corpus. Grab any plain-text file — say a chunk of text in corpus.txt — and:

# char_tokenizer.py
with open("corpus.txt", "r", encoding="utf-8") as f:
    text = f.read()

# the vocabulary is just the sorted set of characters that appear
chars = sorted(set(text))
vocab_size = len(chars)
print(f"{vocab_size} unique characters")
print("".join(chars))

# two lookup tables: string->int and int->string
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}

def encode(s: str) -> list[int]:
    return [stoi[ch] for ch in s]

def decode(ids: list[int]) -> str:
    return "".join(itos[i] for i in ids)

# the round-trip test — this MUST pass
sample = "Shaolin"
assert decode(encode(sample)) == sample, "round-trip failed"
print(encode(sample))
print(decode(encode(sample)))

That’s a complete, working tokenizer. Run it. You now have encode/decode and a vocab_size (probably 60–100 for English prose).

Why character-level is a real choice, not a toy

Why we don’t stop here

Look at what the model must learn. To “understand” the word the, a character model must learn the sequence t → h → e from scratch, across every context, using up positions in its limited attention window. A 256-character context is only ~50 English words. The model spends enormous capacity just re-discovering that letters clump into words. We’re making it learn spelling before it can learn meaning.

We want a middle ground: units bigger than a character (so sequences are short and word-structure comes for free) but smaller than a word (so we never hit a word we can’t represent). That middle ground is subword tokenization, and the workhorse algorithm is byte-pair encoding.


1.3  Byte-pair encoding (BPE), from scratch

BPE (adapted to NLP by Sennrich, Haddow & Birch, 2016; used by GPT-2, Radford et al. 2019) has one idea, applied repeatedly:

Find the most frequent adjacent pair of tokens in the data, and merge it into a single new token. Repeat until you have the vocabulary size you want.

You start with the smallest possible alphabet and grow the vocabulary by gluing together pairs that occur a lot. Common chunks like th, the,  and become single tokens; rare stuff stays in small pieces. Nothing is ever unrepresentable, because the foundation is always there underneath.

Start from bytes, not characters

A subtle, important choice GPT-2 makes: start not from Unicode characters but from raw bytes. There are exactly 256 possible byte values (0–255), and any text in any language — emoji, Chinese, control characters — is just a sequence of bytes in UTF-8. So a byte-level base vocabulary of 256 can represent literally anything with zero unknown tokens, ever. We build on that.

# step 1: text -> raw byte ids (base vocabulary is 0..255)
text = "the theme of the theatre"
ids = list(text.encode("utf-8"))   # e.g. [116, 104, 101, 32, ...]
print(ids)

Counting pairs

def get_pair_counts(ids: list[int]) -> dict[tuple[int, int], int]:
    """Count how often each adjacent pair appears."""
    counts = {}
    for pair in zip(ids, ids[1:]):       # (ids[0],ids[1]), (ids[1],ids[2]), ...
        counts[pair] = counts.get(pair, 0) + 1
    return counts

counts = get_pair_counts(ids)
top_pair = max(counts, key=counts.get)
print(top_pair, counts[top_pair])        # the most frequent adjacent pair

Merging a pair into a new token

def merge(ids: list[int], pair: tuple[int, int], new_id: int) -> list[int]:
    """Replace every occurrence of `pair` with the single token `new_id`."""
    out = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            out.append(new_id)
            i += 2          # skip both members of the pair
        else:
            out.append(ids[i])
            i += 1
    return out

Training the tokenizer

Training a tokenizer just means: decide which merges to make, in order. We do num_merges rounds; each round finds the top pair and assigns it the next free ID (starting at 256, after the byte range).

def train_bpe(text: str, vocab_size: int):
    assert vocab_size >= 256
    num_merges = vocab_size - 256
    ids = list(text.encode("utf-8"))

    merges: dict[tuple[int, int], int] = {}   # the learned rules, in order
    for i in range(num_merges):
        counts = get_pair_counts(ids)
        if not counts:
            break
        pair = max(counts, key=counts.get)
        new_id = 256 + i
        ids = merge(ids, pair, new_id)
        merges[pair] = new_id
    return merges

# train on a real corpus string
merges = train_bpe(open("corpus.txt", encoding="utf-8").read(), vocab_size=512)
print(f"learned {len(merges)} merges")

Encoding with the trained merges

To encode new text, apply the learned merges in the same order they were learned (earliest merges are the most general and must go first):

def encode_bpe(text: str, merges: dict[tuple[int, int], int]) -> list[int]:
    ids = list(text.encode("utf-8"))
    # apply merges in learned order
    for pair, new_id in merges.items():
        ids = merge(ids, pair, new_id)
    return ids

Decoding — rebuild the byte strings

To decode, we need to know what bytes each token ID expands to. Build a vocab table: IDs 0–255 are single bytes; each merged ID is the concatenation of its two parents.

def build_vocab(merges: dict[tuple[int, int], int]) -> dict[int, bytes]:
    vocab = {i: bytes([i]) for i in range(256)}     # base: each byte
    for (a, b), new_id in merges.items():           # in insertion order
        vocab[new_id] = vocab[a] + vocab[b]
    return vocab

def decode_bpe(ids: list[int], vocab: dict[int, bytes]) -> str:
    raw = b"".join(vocab[i] for i in ids)
    return raw.decode("utf-8", errors="replace")    # 'replace' guards mid-char splits

vocab = build_vocab(merges)
sample = "the theatre theme"
assert decode_bpe(encode_bpe(sample, merges), vocab) == sample
print(encode_bpe(sample, merges))

You just built — from raw bytes — the same family of tokenizer that GPT-2 and GPT-3 use. It compresses text (fewer tokens than characters), never chokes on unknown input, and round-trips perfectly. This is not a simplified stand-in. It is the real algorithm.


1.4  Special tokens

Real models need a few tokens that aren’t text — control signals. The most important:

Special tokens get IDs above the learned vocabulary and are inserted by hand, never produced by BPE merges:

END_OF_TEXT = len(vocab)          # next free ID after all merges
vocab[END_OF_TEXT] = b"<|endoftext|>"   # stored as a marker; never byte-decoded literally

We’ll wire these in properly in Chapter 7 when we assemble training data from many documents.


1.5  The thing to actually understand

  1. The model’s “atoms” are tokens, not words or letters. When people are surprised a model can’t spell or count letters in a word, this is why — it often never sees the letters; it sees a chunk like  theatre as one indivisible symbol. The opaque box explains the magic and the limits in the same breath.
  2. Vocabulary size is a tradeoff. Bigger vocab → shorter sequences (good, more text per context window) but a larger embedding table and output layer (costly). GPT-2 chose ~50,257. We’ll use a small vocab so our model trains fast.

1.6  Exercises

  1. Round-trip stress test. Run decode_bpe(encode_bpe(s, merges), vocab) == s on text containing emoji and a non-English script. Does it hold? Why does starting from bytes (not characters) make this work?
  2. Compression ratio. For a held-out paragraph, compute len(encode_bpe(text)) / len(text.encode("utf-8")) at vocab_size = 300, 500, 1000. Plot tokens-vs-vocab. Where do the gains flatten?
  3. Watch the merges. Print the first 20 merged tokens (decode each new_id to bytes). Do they match your intuition about common English chunks?
  4. Break it. Remove the errors="replace" from decode_bpe, then decode a truncated token list that cuts a multi-byte character in half. What happens, and what does that teach you about why generation works at the token level but display happens at the character level?
  5. Efficiency (stretch). Our encode_bpe re-scans the whole list once per merge — O(vocab × length). Sketch a faster encoder that, for a given text, repeatedly applies only the earliest-learned applicable merge. Why does merge order matter for correctness?
What’s next
Ch 2 — Words as Directions: embeddings & the input pipeline
Read Ch 2 →

A 37th-Chamber original. Methods cited (Sennrich, Haddow & Birch 2016; Radford et al. 2019); all prose and code written fresh.