What Is the Box?
You use it every day. It answers questions, writes paragraphs, explains things patiently. And yet the people who built it cannot fully tell you how it does any of that. That’s not a confession of incompetence. It’s just the honest state of the science.
The thing nobody can see into
When you ask an AI a question and it answers well, something is happening inside it — some chain of calculation that turns your words into its words. That chain is real. It runs on a computer. Its results can be remarkable. But if you ask one of its makers to point at a specific part of the system and say “this is where it decided to say that,” they can’t. Not fully. Not yet.
This is not a secret being kept from you. It is a genuine open problem in science, one important enough that Anthropic — the company that builds Claude — runs a dedicated interpretability research program whose entire job is to try to see inside. The field is called mechanistic interpretability, and it exists precisely because the box is opaque. Researchers are not polishing something well understood. They are trying to understand something that already works without anyone having fully designed it.
That’s the strange situation we are in. And this book is about that situation, from the ground up.
Program vs. model — the one distinction that matters
Before we go further, we need one load-bearing idea. There are two very different things that can run on a computer, and people use the word “AI” for both of them as if they were the same. They are not.
The first is a program. A program is a set of rules a human wrote down. Every branch of its behavior came from a decision a person made at a keyboard. If you open the file, the logic is there to read:
if temperature > 100:
alert("too hot")
else:
keep running
You can trace every outcome. You can read every instruction. The program does exactly what the author said, no more and no less. There is no mystery in it, only the question of whether the author thought of everything.
The second is a model. A model is not rules a human wrote. It is a very large pile of numbers — billions of them — that nobody sat down and chose individually. Those numbers were tuned by showing the system enormous amounts of examples until its behavior became useful. Nobody wrote the instructions for how it would answer your question. The answer emerged from the numbers after training.
Written versus grown. That is the distinction. A program is written. A model is grown.
Modern AI — the kind that holds a conversation, writes code, explains a concept — is a model. The opacity follows directly from that fact. When nobody wrote the rules, there are no rules to read.
What “training” means in one breath
So where do the numbers come from? Here is the whole secret, stated plainly:
Show the model a lot of text. Ask it to guess the next word. When it guesses wrong, nudge the numbers slightly in the direction that would have made a better guess. Repeat this an enormous number of times.
That’s it. Everything else — every refinement, every variation, every capability that surprises you — is detail on top of that one loop. The model starts as random noise and ends as something that has, in some sense, absorbed patterns from everything it read. GPT-2, an early model from 2019 that OpenAI eventually published in full (released in stages over that year), reached 1.5 billion numbers tuned this way (Radford et al., 2019). The models you use today have grown much larger — but the loop is the same loop.
Notice that at no point in that description does anyone explain anything to the model. Nobody says “a sentence has a subject and a verb” or “France is in Europe.” The model infers all of that, if it does, from seeing patterns recur across billions of examples. What it learns is real. How it stores what it learned, and exactly how it retrieves it — that is the opaque part.
Why “opaque”
Imagine you could read every single number in a trained model — all 1.5 billion, or all 70 billion, laid out in front of you. You still could not read off what the model “knows” or “believes” the way you could read a program. The numbers do not say “Paris is the capital of France.” They say things like −0.3812 and 1.0047. Their meaning only becomes apparent when they work together, in combinations that interact in ways nobody designed.
This is not a flaw to be embarrassed about. It is what the training process produces. The system that comes out the other end genuinely works — often beautifully. But “works” and “is understood” are not the same thing, and conflating them leads to confusion about what AI can and can’t do, what it is and isn’t.
There is nothing catastrophic in this admission. Lots of things work before they are fully understood. What matters is knowing the difference between the behavior (observable, testable, improvable) and the mechanism (the object of active research). We will be honest about both throughout this book.
Why we build one
Here is our answer to the opacity: we are going to build a small language model ourselves, from scratch, step by step, in plain code that you will be able to read in full.
Not a toy that pretends. A real one — the same architecture, the same training loop, the same components that sit inside the large models you use, just smaller. Small enough that it trains on a laptop. Small enough that every piece of it fits in a single screen.
We are doing this because the only honest path to understanding something you cannot see through is to build one you can. When you have assembled each piece with your own hands and watched it fail and then not fail, the large systems stop being magic and start being the same thing, much bigger. That is a very different relationship to have with technology you use every day.
There are no prerequisites here. We will explain every concept before we use it and every line of code before we run it. This book is free, and it will stay free. Knowledge free, forever.
One problem left to solve before we can begin
We said the model reads text and learns from it. But a model cannot actually read. It is arithmetic — it works on numbers. So the very first thing we need, before any training can happen, is a way to convert text into numbers and back again without losing anything.
That sounds small. It is not. The way you make that conversion quietly shapes what the model can and cannot learn. Getting it right turns out to be its own interesting problem, with a genuinely clever solution.
Chapter 1 is how text becomes numbers.
Anthropic interpretability research program: anthropic.com/research/team/interpretability. GPT-2 parameter count: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). “Language Models are Unsupervised Multitask Learners.” All other prose original.