The Opaque Box · Part II · Chapter 12

Buying Time to Think

Chapter 11 ended on an honest asymmetry: the chain buys the model compute — every token it writes is one more full pass through the machine — but a single chain is a single draft, and one wrong turn at step three poisons everything after it. Writers have known the fix for centuries. You do not trust the first draft. You write several, and you keep what survives the comparison. This chapter buys a machine that same discipline — and pays cash for it, one full forward pass at a time.

12.1 The second scaling axis

Everything in Part I bought capability the same way: by spending at training time. More parameters, more data, longer runs — from the model we assembled on our bench to the giant whose weights we poured into it in Chapter 9, the recipe was always make the machine bigger and teach it longer. And every dollar of that spending was spent before the first real question was ever asked. Once training ends, the model answers every question on a fixed budget: one forward pass per generated token, the same depth for “hello” as for a proof. Ask it something hard and it cannot break a sweat. The machine cannot try harder.

Chapter 11 found the loophole: generation itself is compute. Every token of chain-of-thought the model writes buys it one more pass through all its layers, and the chain carries intermediate state forward in the context window. But Chapter 11 treated that as a prompting discovery — something we coax out. The turn the field took in 2024 was to stop coaxing and start paying: treat compute at answer time as a first-class dial, a second axis you scale deliberately, the way Part I scaled parameters.

The moment that axis got named in public: on September 12, 2024, OpenAI published “Learning to reason with LLMs”, announcing the o1 model series and stating that performance “consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).” Two dials, side by side, in the announcement itself. The first dial is Chapter 13’s subject. The second is this chapter’s.

A month before that announcement, Snell et al. (2024, arXiv:2408.03314) had put the second dial on a budget sheet: Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Their finding: on problems where a smaller model already has a non-trivial success rate, optimally allocated test-time compute can outperform a model 14× larger, and a compute-optimal allocation strategy improves efficiency by more than 4× over a naive best-of-N baseline. Read the condition as carefully as the headline, because the headline will get quoted without it: where the model already has traction. On problems the base model essentially never solves, no amount of extra drafting rescues it. Test-time compute is an amplifier — the same honest verdict Chapter 11 gave chain-of-thought — not a source. You cannot buy a signal that was never there.

Two dials, not one. Part I moved right along the train-time axis; this chapter climbs the test-time axis. The reasoning-model move scales both. Snell et al. showed the climb can beat the shift — only where the model already has traction.

The rest of this chapter builds the simplest, most buildable version of the second dial, in code you can run today, and then surveys the heavier machinery honestly. The load-bearing idea throughout is the one in the epigraph: many drafts.

12.2 Many drafts: self-consistency

A single chain of thought has a single point of failure at every step. The model reasons token by token; a small slip mid-chain — a dropped negative sign, a misread quantity — flows downstream and hardens into a confidently wrong answer. Nothing in a lone chain can catch it, because the chain is the only record of the work, and a document cannot proofread itself.

But recall Chapter 8: our sampler has a temperature dial. At T > 0, two runs on the same prompt take different roads. Wang et al. (2022, published at ICLR 2023 — arXiv:2203.11171) turned that variability from a nuisance into the method itself. Self-consistency: instead of generating one chain, sample a diverse set of reasoning paths at temperature, let each run to completion, and then — in the paper’s wording — select “the most consistent answer by marginalizing out the sampled reasoning paths.” For problems with a short, well-defined final answer, that reduces to something beautifully plain: parse the final answer out of each chain and take the majority vote.

Why should the most common answer be more often right than a single sample? Because there are many valid roads to a correct answer, and they all end at the same place. Errors are different: a slip at step two and a slip at step four land in different wrong places. Correct reasoning converges; mistakes scatter. Agreement across independently sampled drafts is therefore evidence — not proof, evidence — that the drafts found something real rather than something idiosyncratic. Wang et al. reported gains including +17.9 percentage points on GSM8K over standard chain-of-thought prompting, with no new training and no new model. Not a single weight moved. The whole method is a wrapper around a sampler you already own.

Self-consistency: one question, k sampled drafts, one vote. Correct roads converge on the same destination; slips scatter. Values shown are illustrative — no model was run to produce them.

12.3 The code

Here is the whole method — fifteen lines, no training, no new model. It is deliberately model-agnostic: generate is any callable that takes a prompt and a temperature and returns text. Chapter 8’s generate() over our own model qualifies. So does the GPT-2 we loaded in Chapter 9. So does an API call to a frontier model. The wrapper does not care what is inside the box — which is rather the theme of this book.

from collections import Counter

# ── Chapter 12: self-consistency over any generator ─────────────────────────

def parse_final_answer(text):
    """
    Pull the final answer out of a generated chain.
    Convention: the chain ends '... the answer is X'.
    Returns X as a string, or None if the pattern is absent.
    """
    marker = "the answer is"
    idx = text.lower().rfind(marker)
    if idx == -1:
        return None                              # no parseable ballot
    tail = text[idx + len(marker):].strip()
    if not tail:
        return None
    return tail.split()[0].rstrip(".,;:!?")


def self_consistency(generate, prompt, k=10, temperature=0.8):
    """
    Sample k chains from `generate`, majority-vote the parsed answers.
    generate: callable (prompt, temperature) -> generated text.
              Chapter 8's generate() qualifies; so does any API call.
    Returns (winner, votes, chains).
    """
    chains  = [generate(prompt, temperature=temperature) for _ in range(k)]  # k drafts
    answers = [parse_final_answer(c) for c in chains]                        # k ballots
    votes   = Counter(a for a in answers if a is not None)                   # the tally
    if not votes:
        return None, votes, chains               # nothing parseable at all
    winner, _ = votes.most_common(1)[0]          # the majority answer
    return winner, votes, chains


# ── sanity check: a stub generator standing in for a model ──────────────────
# The stub SIMULATES a solver that reasons correctly 60% of the time.
# Illustrative only — no language model was run to produce these strings.
import random
random.seed(37)

def stub_generate(prompt, temperature=0.8):
    if random.random() < 0.6:
        return "step 1 ... step 2 ... the answer is 408"
    wrong = random.choice(["388", "418", "480"])
    return "step 1 ... slip ... the answer is " + wrong

winner, votes, chains = self_consistency(stub_generate, "17 * 24 = ?", k=11)
print(winner)   # the majority answer across 11 drafts
print(votes)    # the full tally — run it and look

Line-by-line walk

parse_final_answer: the unsung half of the method. A vote needs comparable ballots, so the chains must be coerced into a canonical final answer — here by convention (“the answer is X”), with light punctuation stripping. In real systems this parser does a lot of quiet work, and a sloppy one silently splits identical answers into different ballots. Chains that produce no parseable answer simply do not vote.
self_consistency: sample k drafts at temperature (diversity is the point — at T = 0 every draft is the same draft and the vote is theater), parse each, tally with Counter, return the most common. On an exact tie, most_common returns whichever tied answer was counted first — an arbitrary choice, and worth knowing your tools well enough to notice.
stub_generate: an honest stand-in. It is not a model; it is a coin-flip simulation of a solver that is right 60% of the time, so you can test the wrapper without a trained engine. With eleven drafts from a 60%-accurate solver whose errors scatter across different wrong answers, the majority is very likely — not certain — to be the right one. That gap between likely and certain is exercise 4.
And the honest note from Chapter 11 still applies to our bench: our 11M-parameter model has little latent reasoning for a vote to amplify. Self-consistency, like chain-of-thought, is an amplifier. The wrapper is real, runnable, and yours; the engine worth wrapping is what Chapter 13 is about.

12.4 Verifiers and search, briefly

Majority voting is direct democracy at its most naive: every draft’s ballot weighs the same, the confident correct solution and the lucky guess counted alike. The heavier machinery in this section fires the flat vote and hires judgment instead — and judgment, as always, sends a bill. One honest paragraph each.

Best-of-N with a learned verifier. Cobbe et al. (2021, arXiv:2110.14168) introduced GSM8K — a dataset of 8.5K grade-school math word problems — and with it the verifier recipe: train a second model to score candidate solutions, then at answer time sample many candidates and submit the verifier’s top-ranked one. Verification boosted performance significantly over fine-tuning alone. There is a pleasing thread here: this finetuned-GPT-3-plus-verifier system was the very baseline that chain-of-thought prompting beat in Chapter 11. The methods leapfrog each other, year over year; that scramble is what a live field looks like from the inside.

Process reward models. A verifier that only sees the final answer can be fooled by a chain that stumbles into correctness. Lightman et al. (2023, arXiv:2305.20050 — “Let’s Verify Step by Step”) trained reward models to score every step of the chain rather than just the outcome, and found process supervision significantly outperforms outcome supervision — solving 78% of problems from a representative subset of the MATH benchmark. The price is stated in the same paper, and it is paid in human labor: they released PRM800K, roughly 800,000 human-written step-level labels. Grading the road instead of just the destination means a person read every step.

Tree of Thoughts. Self-consistency runs k independent, complete chains. Yao et al. (2023, arXiv:2305.10601) made the chains a search: branch on partial thoughts, have the model evaluate its own intermediate progress, abandon dead branches, backtrack. On the Game of 24, GPT-4 with chain-of-thought solved 4% of tasks; with Tree of Thoughts, 74%. Same model, same weights, eighteen-fold jump — bought entirely by letting it wander, judge, and backtrack. The honest cost: this is no longer a fifteen-line wrapper. It needs a state evaluator, a search loop, and many more model calls per answer — real engineering, for problems where exploration genuinely earns its keep.

Two gates on the same N candidates. The majority vote counts every ballot equally — cheap, but blind to quality. A learned verifier scores each candidate and ranks it — it can tell a sound solution from a lucky one, at the price of a second trained model and its labels.

12.5 The bill

None of this is free, and the arithmetic hides nothing: k drafts cost k times the compute of one draft — k times the generated tokens, k times the money if you are paying per token, and (without parallelism) k times the wall-clock wait. A vote of eleven means paying for eleven full answers to throw ten away. Tree search can cost far more, and a process reward model bolts a second model’s forward passes on top. Test-time compute is exactly what its name says, with no discount for volume: compute, at test time, every single time you ask.

The bill is a straight line: k drafts cost exactly k times one draft. The vote of eleven pays for eleven full answers and keeps one. This is arithmetic, not a projection — there is no discount for volume, and the bill arrives on every question you ask.

This is also why reasoning-mode models are slower and pricier than their ordinary siblings. The o1 announcement is explicit that the model spends more time thinking before it answers — and here is the part worth reading twice: those reasoning tokens are billed to the user in full, then withheld from view (per the o1 post; more on the hiding itself in Chapter 15). You pay for the thinking and are not shown it. The trade, stated plainly: spend the second dial on problems that are hard, checkable, and worth the wait — and never burn it on boilerplate a single pass would have nailed. Snell et al.’s condition gives the same advice from the other side: the spending pays best where the model already has traction and needs reliability, not miracles.

And notice what every method in this chapter has in common. The vote, the verifier, the tree — all of them rent capability at answer time, over and over, from a model whose weights never change. Rent buys tonight’s answer and nothing more: tomorrow the model is exactly as likely to slip at step three as it was today, and we pay the whole bill again to catch the same slip. Part I taught us the other verb — the one that builds equity. When you want a behavior to live in the weights, you train for it, once, and keep it. The frontier move of 2024–25 was to do precisely that: train the chain itself, with reinforcement learning, wherever an answer can be mechanically checked. That is Chapter 13.

12.6 The thing to actually understand

There are two dials, not one. Part I scaled train-time compute; this chapter scales answer-time compute. The o1 announcement named both publicly, and Snell et al. showed the second dial can sometimes beat the first — a smaller model with well-spent thinking time outperforming a 14× larger one, where the small model has traction.
Diversity plus agreement is a signal — and only together. Self-consistency works because correct reasoning paths converge on one destination while errors scatter. Sampling at temperature supplies the diversity; the vote reads the convergence. Kill either half and the method is theater.
The wrapper is yours and the model is untouched. self_consistency() changes no weights and needs no training — it is fifteen lines around Chapter 8’s sampler, and it works over any generator, including ones you cannot see inside.
Verification climbs a ladder of cost. Flat vote → learned verifier ranking N candidates → process reward model grading every step → tree search with backtracking. Each rung buys judgment and pays in models, labels, and calls.
Compute at answer time is a recurring bill. k drafts, k× cost, every question, forever — you rent the answer, you never own it. Everything in this chapter rents capability; nothing instills it. That distinction is the door to Chapter 13.

12.7 Exercises

Run the vote. Run 12.3’s code as-is. Then drop the stub’s accuracy from 0.6 to 0.45, then 0.30, re-running several times at each level (change the seed). At what accuracy does the majority stop reliably being the right answer? Notice that the vote still helps below 50% accuracy here — work out why the scattering of wrong answers across three options is doing that.
The k-sweep, on paper. For k = 1, 3, 11, 33, tabulate the cost side: generated tokens per answer, dollars at a per-token price of your choosing, latency without parallelism. Then sketch the benefit side qualitatively: where should the accuracy curve flatten, and why can it never cross the ceiling set by “the model never finds the right road at all”?
Design a checkable task. Pick a task where the final answer is short and mechanically comparable (an integer, a date, a chess square). Write the prompt convention and the parse_final_answer for it. Then pick a task where this is hard (summarize a paragraph) and write down exactly where the vote breaks — what does “majority” even mean when no two ballots match?
Break the vote: correlated errors. Self-consistency assumes errors scatter. Construct the failure case: modify stub_generate so that 70% of wrong drafts make the same wrong turn and return the same wrong answer. Watch the vote confidently amplify the shared mistake. State the lesson in one sentence: agreement is only evidence when the drafts are — what?
Read the condition. Read the abstract of Snell et al. (arXiv:2408.03314). Write down, in your own words, on which problems test-time compute beats parameter scaling and on which it does not. The condition matters more than the headline.

What’s next

Ch 13 — Learning to Reason — RL with Verifiable Rewards

Ch 13 →

A 37th-Chamber original. Methods cited: OpenAI, “Learning to reason with LLMs” (September 12, 2024), openai.com/index/learning-to-reason-with-llms (train-time/test-time quote and billed reasoning tokens — confirmed); Snell et al. (2024), “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters,” arXiv:2408.03314 (14× and 4× figures with the problem-difficulty condition — confirmed); Wang et al. (2022; ICLR 2023), “Self-Consistency Improves Chain of Thought Reasoning in Language Models,” arXiv:2203.11171 (method and +17.9% GSM8K — confirmed); Cobbe et al. (2021), “Training Verifiers to Solve Math Word Problems,” arXiv:2110.14168 (GSM8K, 8.5K problems, verifier best-of-N — confirmed); Lightman et al. (2023), “Let’s Verify Step by Step,” arXiv:2305.20050 (process vs outcome supervision, 78% MATH subset, PRM800K — confirmed); Yao et al. (2023), “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,” arXiv:2305.10601 (Game of 24, 74% vs 4% — confirmed). Diagram values are illustrative. All prose and code written fresh.