The Opaque Box · Part II · Chapter 11

Thinking Out Loud

Part I closed on a confession: everything we built — trained, made to speak, verified against a giant, bent into an assistant — answers on reflex. One forward pass per token, the same fixed depth for every question, no pause, no second look. Part II begins with the cheapest, strangest discovery in this whole story: if you let the machine write out its work before answering, the writing itself becomes a place to think. Hand it scratch paper, and reflex starts producing things reflex alone never could — no new weights, no new module, no ghost in the loop. Just a page.

11.1  Part II: renaming the game

Part I built a machine that speaks. Part II is about machines that work things out — what the field, with more confidence than the word strictly deserves, calls reasoning models. Before we build anything, we owe you a working definition, because “reasoning” is a word carrying two thousand years of philosophical freight, and the field has quietly borrowed it for something far narrower and far more testable. Take the loan; do not mistake it for the whole fortune.

Sebastian Raschka — whose books are the closest published kin to this one — puts the operational version plainly in “Understanding Reasoning LLMs” (2025): reasoning, in current LLM practice, is “the process of answering questions that require complex, multi-step generation with intermediate steps.” Notice what that definition is made of: generation and intermediate steps — tokens, produced before the final answer, that carry the work. Add the training and inference methods that make those intermediate tokens reliable (the subjects of Chapters 12–14) and you have the whole territory of Part II.

And the philosophical question — is that really reasoning, in the sense you do it? The house takes no oath either way. It is a live debate, and anyone who settles it for you in a single confident sentence — in either direction — is selling you something the evidence has not yet bought. What we can do — what this book has always done — is build the mechanism, measure what it changes, and let you hold the word “reasoning” as loosely or as firmly as the evidence deserves. The load-bearing picture for this chapter is deliberately humbler than the word: scratch paper. Not a soul, not an inner life — a place to set down intermediate work so the next step can see it.

One shelf note before we begin, because credit belongs where it is due. Part I of this book walks territory covered, deeper and in full PyTorch, by Raschka’s Build a Large Language Model (From Scratch) (Manning, 2024; free code at rasbt/LLMs-from-scratch). Part II’s territory is the subject of his Build a Reasoning Model (From Scratch) (Manning, 2026; free code at rasbt/reasoning-from-scratch). If this book makes you want the rigorous, runnable, chapter-length version, support the author — those are the books.


11.2  The wall: fixed depth per token

Why does a pure next-token machine struggle with multi-step problems at all? There is no mystery here, and no hand-waving allowed — after ten chapters you built every part of the answer, and you can now state it exactly.

Consider the prompt 17 × 24 = and a model forced to emit the answer immediately. Whatever computation produces the first digit of that answer must happen inside one forward pass: one trip through the embedding table, the fixed stack of blocks, the final norm, the head. That is the entire budget. There is no loop inside the architecture, no “go around again,” no register where a partial product could wait. You built every layer of this machine; you know there is nowhere for 17 × 4 = 68 to sit while 17 × 20 = 340 gets computed. A person who multiplies 17 by 24 in their head is doing staged work in working memory. The one-pass model must do the whole thing as a single act of pattern completion — reflex — and for problems past a certain depth, reflex runs out of room.

Now notice the one loophole the architecture does allow. The model cannot loop inside a forward pass — but the sampler you wrote in Chapter 8 loops around forward passes: generate a token, append it, run again. And everything in the context window — including every token the model itself just wrote — feeds the next pass through attention. Which means the model’s own output is the only working memory it can extend. If intermediate results are ever going to exist anywhere, they have to exist as text.

That loophole sat in plain sight for years, hiding in the one architectural fact everyone already knew. What follows is what happened the day the field finally took it seriously.


11.3  The discovery

In January 2022, Wei et al. (arXiv:2201.11903, published at NeurIPS 2022) named the technique chain-of-thought prompting. The recipe is almost insultingly simple: in a few-shot prompt, instead of showing the model question → answer examples, show it question → worked steps → answer examples. The model, continuing the pattern as it always does, produces worked steps of its own before its answer — and on multi-step problems, the answers get dramatically better. Their headline result used GSM8K, the benchmark of 8,500 grade-school math word problems introduced by Cobbe et al. (2021, arXiv:2110.14168): prompting a 540B-parameter model (PaLM) with just eight chain-of-thought exemplars achieved state-of-the-art accuracy on GSM8K — surpassing even a GPT-3 that had been fine-tuned on the task and equipped with a trained verifier.

Two findings in that paper matter beyond the headline. First, no training was involved. The weights did not move. Eight worked examples in the prompt — a change to the input text — unlocked behavior the same frozen model could not produce when asked directly. Second, the effect was emergent with scale: in the paper’s words, these reasoning abilities “emerge naturally in sufficiently large language models.” Small models prompted with chains produced fluent-looking steps that led nowhere; past a scale threshold, the chains started to work. Keep that asymmetry in mind — it returns in Section 11.5 with our own model in the dock.

Four months later the recipe got simpler still. Kojima et al. (2022, arXiv:2205.11916, also NeurIPS 2022) showed you could drop the eight exemplars and append a single phrase to the question: “Let’s think step by step.” No examples at all — zero-shot. With InstructGPT (text-davinci-002 — a model bent exactly as Chapter 10 described), that one phrase took GSM8K accuracy from 10.4% to 40.7%, and MultiArith from 17.7% to 78.7%. Read those numbers again, slowly. Nothing about the model changed — not one weight moved. Five words of English, appended to the prompt, roughly quadrupled its measured arithmetic-reasoning score. It is one of the strangest results in the modern history of the field, and “it just works” is not an answer — it demands an explanation that is better than magic. The next section refuses to leave you with anything less.

The leap versus the chain: two routes from question to answer A question box on the left reads 17 times 24. An answer box on the right reads 408. The top route is one long dashed arc labeled the leap, all the work in one forward pass; it falls short of the answer box and ends at a small x labeled misses. The bottom route is a chain of five numbered stepping stones connected by short arrows, drawn in electric blue with a glow, labeled the chain. Each written token is one more full pass through the machine, and the chain lives in the context window, scratch paper the model can reread. The chain route lands on the answer box.
Two routes from question to answer: the leap tries to do all the work in a single forward pass and, past a certain problem depth, misses; the chain takes many small hops — each written token one more full pass — and lands. The scratch paper is the context window itself.

11.4  Why it works — honest mechanics, no mysticism

Strip the wonder off and look at what a chain of thought physically is, in the machine you built with your own hands. Two things change when the model writes its work out. Both are mechanical, both are boring in the best way, and neither one is a ghost.

First: generation buys compute. Every token the model emits is one more complete forward pass — another full trip through every block, every head, every feed-forward layer. A model that answers 408 directly gets a handful of passes. A model that first writes out 17 × 20 = 340, 17 × 4 = 68, 340 + 68 = 408 gets a forward pass per token of that work — dozens of extra trips through the machine, spent on the same question. The fixed-depth-per-token wall from Section 11.2 has not moved; the model has simply arranged to hit it many more times. You can watch this happen in the sampler you already own:

import torch
import torch.nn.functional as F

def generate_and_count(model, idx, max_new_tokens, context_length=256):
    """
    Chapter 8's sampling loop, instrumented.
    Returns (sequence, forward_passes) — one full pass per new token.
    """
    forward_passes = 0
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_length:]              # (B, T)  crop to context
        logits = model(idx_cond)                         # (B, T, vocab_size)
        forward_passes += 1                              # the whole network ran, once
        probs = F.softmax(logits[:, -1, :], dim=-1)      # (B, vocab_size)
        next_token = torch.multinomial(probs, num_samples=1)   # (B, 1)
        idx = torch.cat([idx, next_token], dim=1)        # (B, T+1)
    return idx, forward_passes


# ── sanity check ─────────────────────────────────────────────────────────────
# a 3-token direct answer buys 3 passes through the network.
# a 200-token chain of thought buys 200 passes — same frozen weights,
# roughly 67x more computation spent on the same question.
Generation buys compute: 3 passes versus 200 passes on the same question Two horizontal bars share a baseline labeled forward passes, one per generated token. The top bar is short: a 3-token direct answer buys 3 passes. The bottom bar is long and glows electric blue: a 200-token chain of thought buys 200 passes — the same frozen weights, roughly 67 times more computation spent on the same question. The ratio 200 divided by 3 is about 67, the figure stated in the code comment.
Each generated token is one more full forward pass. A 3-token direct answer buys three trips through the machine; a 200-token chain buys two hundred — about 67× the computation, spent on the same question with the same frozen weights. The wall of Section 11.2 has not moved; the model has arranged to hit it far more often.

Second: the chain externalizes intermediate state. Compute alone is not the whole story — the extra passes would be useless if each one forgot the last. The chain is what connects them. Once 17 × 20 = 340 exists as tokens in the context window, every subsequent forward pass can attend to it — the attention heads you built in Chapters 3 and 4 read it exactly the way they read the original question. The partial product that had nowhere to sit in Section 11.2 now has a home: the page. The model is not holding its work in some hidden inner register; it is doing what you do on an exam — writing the intermediate result down and then looking at it. Scratch paper, in the most literal sense the architecture allows.

The context window as scratch paper the model can reread A page frame represents the context window. The first line holds the question, 17 times 24 equals. Below it, three intermediate steps appear in order: 17 times 20 equals 340; 17 times 4 equals 68; 340 plus 68 equals 408, the final answer. Each step is one more forward pass appending a line. A glowing curved arrow runs from the newest step back up to an earlier line, marking attention rereading what was already written. The caption note reads: every earlier line is still visible to every later pass; the partial products that had nowhere to sit now live on the page.
The chain writes intermediate state onto the one surface the model can extend — its own output in the context window. Once 17 × 20 = 340 exists as tokens, every later forward pass can attend back to it, exactly the way it reads the original question. The partial products that had nowhere to sit in Section 11.2 now live on the page.

That is the whole explanation, and notice what it does not require: no new module, no hidden deliberation organ, no ghost. A machine with fixed depth per token and attention over its own output was always, in principle, capable of staged computation — provided the stages were written out. What Wei et al. and Kojima et al. found was the key that turns it on: the model has to have learned, from its training data, the genre of worked steps. Enough worked solutions exist in human text — textbooks, homework help, proofs — that large models have the genre in the metal. The prompt does not teach the model to reason; it steers generation into the region of text-space where the work gets written down, and the architecture you already built does the rest. No magic changed hands — only the address the tokens were sampled from.

A caution the house owes you, because it becomes the whole subject of Chapter 15: the chain is output. It is text the model generated, shaped by the same training pressures as every other token — not a log dumped from the mechanism. “The model wrote a sensible-looking step” and “the model’s computation went through that step” are two different claims, and only one of them is on the page. Hold that distinction; we will test it with evidence at the end of the book.

11.5  Try it — on theirs, and honestly, on ours

This chapter’s method needs no GPU, no checkpoint, no budget: it is a sentence. Take any frontier assistant you use and give it a genuinely multi-step problem twice — once demanding only the final answer, once inviting the steps. Do five of them and keep honest score. You will be replicating, at your kitchen table, the shape of one of the field’s landmark results — the whole experiment costs you five minutes and the willingness to look. (The exercises make this precise.)

And ours? Here the house pays its scale-honesty debt. Our 11M-parameter model, trained on a small corpus, will happily imitate the format of step-by-step work — the genre is cheap to mimic. What it will not reliably do is have the steps be correct, because chain-of-thought is a capability amplifier, not a capability source. Wei et al.’s own emergence finding says exactly this, and the house will not pretend otherwise to flatter its own build: below a scale threshold, chains are costume, not computation — the machine goes through the motions of thinking without the thinking underneath. The prompt can only elicit what pretraining put in the metal, and our little machine’s metal is thin. Even the GPT-2 weights from Chapter 9 — 124M parameters, trained in 2019, years before this genre was deliberately cultivated — sit well below the scale where Wei et al. saw chains start to pay.

The emergence curve: chain-of-thought only pays past a scale threshold A plot with a horizontal axis labeled model scale, increasing to the right, and a vertical axis labeled task accuracy, increasing upward. Neither axis carries numbers — the shape is qualitative, following the emergence finding of Wei et al. 2022. A dimmer curve labeled direct answer rises gently and steadily. A glowing electric-blue curve labeled with chain of thought tracks at or slightly below the direct curve on the left, then, past a marked scale threshold, bends sharply upward and pulls away above it. A vertical dashed line marks the threshold; the region left of it is labeled chains are costume, the region right of it, chains start to pay. Our 11M and GPT-2 124M models are marked with small ticks well left of the threshold.
The emergence finding, drawn as a shape rather than a table: below a scale threshold, chain-of-thought tracks at or below plain direct answering — the steps are costume. Past the threshold the chain curve breaks upward and pulls away. Our 11M model and the Chapter 9 GPT-2 (124M) sit well to the left of where Wei et al. (2022) saw chains start to pay. Axes are qualitative; the shape, not any number, is the point.

So Part II cannot stop at prompting, and does not. If eliciting is not enough, the field’s next moves were to spend more at answer time and to train the chain itself — and both of those, unlike frontier pretraining, have shapes we can genuinely build on our bench. Chapters 12 through 14 build them.

One last honest look at what this chapter leaves exposed, because the next chapter opens right here. A chain of thought is one draft, written once, in ink, with no eraser. The model lays it down token by token, and a single wrong turn at step three — one dropped sign, one misread quantity — runs downhill into a confidently wrong answer, with nothing in the lone chain standing there to catch it. When you cannot trust a single draft, there is a remedy older than every machine in this book, old as scribes and mathematicians: do not write one draft. Write many, and let them vote.


11.6  The thing to actually understand


11.7  Exercises

  1. Run the ablation yourself. Pick five multi-step word problems (money, rates, remainders). Give each to a frontier assistant twice: once as “Answer with only the final number,” once with “Let’s think step by step.” Tally the score under each condition. You are reproducing the shape — not the scale — of Kojima et al.’s experiment; note anything that surprises you about where the direct condition fails.
  2. Count the passes. Wire generate_and_count around your Chapter 8 model. Generate a 3-token completion and a 200-token completion from the same prompt and print both pass counts. Then compute what a chain costs in wall-clock terms on your machine: time the two runs and report tokens per second.
  3. Design a wall-breaker. Construct a problem that is easy with scratch paper and near-impossible without it — nested arithmetic works well (for instance, a sum of several two-digit products). Write one paragraph explaining, in terms of Section 11.2’s fixed-depth argument, exactly why the one-shot version is harder for this architecture — not just harder in general.
  4. Read the emergence claim. Read Wei et al. (arXiv:2201.11903) with one question in mind: at what scales do chains stop hurting and start helping? Then write two sentences on what that implies for anyone hoping to prompt reasoning out of a small open model — including ours.
  5. Watch the costume. Prompt your own trained Chapter 8 model with a few-shot worked-arithmetic pattern and sample. Label whatever comes out as illustrative, and inspect it: does it imitate the format of steps? Are the steps arithmetic-correct? Write down the cleanest example you find of format without competence — you now own a specimen of the emergence threshold from the wrong side.
What’s next
Ch 12 — Buying Time to Think — Test-Time Compute
Ch 12 →

A 37th-Chamber original. Methods cited: Wei et al. (2022), “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv:2201.11903, NeurIPS 2022 (few-shot chains; emergence with scale; 540B PaLM, eight exemplars, state-of-the-art GSM8K over finetuned-GPT-3-plus-verifier — confirmed); Kojima et al. (2022), “Large Language Models are Zero-Shot Reasoners,” arXiv:2205.11916, NeurIPS 2022 (“Let’s think step by step”; GSM8K 10.4% → 40.7%, MultiArith 17.7% → 78.7% with text-davinci-002 — confirmed); Cobbe et al. (2021), “Training Verifiers to Solve Math Word Problems,” arXiv:2110.14168 (GSM8K, 8.5K problems — confirmed); Sebastian Raschka, “Understanding Reasoning LLMs,” Ahead of AI (2025), magazine.sebastianraschka.com (the operational definition quoted — confirmed); Raschka, Build a Large Language Model (From Scratch) (Manning, 2024; code rasbt/LLMs-from-scratch) and Build a Reasoning Model (From Scratch) (Manning, 2026; code rasbt/reasoning-from-scratch) — both confirmed. The worked 17×24 steps in §11.4 are the author’s own arithmetic used as illustration, not a reported model run. All prose and code written fresh.

Written by a Fable · Edited by bobby-dig8al