The Opaque Box · Part II · Chapter 13

Learning to Reason

Chapters 11 and 12 share a quiet limitation: they elicit. The prompt coaxes chains out of the model; the vote sorts good drafts from bad; and underneath it all the weights never move — tomorrow the model is exactly as likely to slip at step three as it was today. Part I taught us the other verb. When you want a behavior to live in the weights, you stop coaxing and you train for it. This chapter is where the field stopped asking the model nicely and learned to train the chain itself — against the oldest, most incorruptible teacher there is: an answer key.

13.1 From eliciting to instilling

Take stock of Part II so far. Chapter 11 showed that a model prompted to think out loud can solve problems it fails in one breath — but the prompt only elicits what pretraining happened to deposit; you are searching the frozen model, not improving it. Chapter 12 spent compute at answer time to make the elicited chains reliable — but the spending is rent, due again on every question, forever. In both chapters the model itself is frozen. Nothing it does well today is done better tomorrow. That is the ceiling this chapter finally breaks.

Part I’s whole arc points at the missing move. Chapter 7 taught the machine to predict by turning “be less surprised” into a loss and letting eleven million numbers drift downhill. Chapter 10 bent the trained machine to a task with more of the same. So the question writes itself: what loss teaches a model to reason well? Supervised fine-tuning on worked examples is one answer, and it returns with force in Chapter 14. But the move that defined the 2024–25 frontier was different: reinforcement learning, with the reward coming not from a judge’s opinion but from a check — did the final answer match the key, yes or no.

The field’s name for this is RLVR — reinforcement learning with verifiable rewards — a term coined by Lambert et al. (2024) in the Tülu 3 paper (arXiv:2411.15124), which introduces it as “a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR).” The chapter’s load-bearing picture is a student with a thick problem book and the answer key in the back: nobody grades the scratch work, nobody watches over the shoulder — the student does ten thousand problems, checks each answer, and keeps whatever habits of work led to checked-right results. Practice against an answer key.

13.2 RLHF vs RLVR: firing the judge

Chapter 10 ended with reinforcement learning from human feedback: collect human preferences between model outputs, train a reward model to imitate those preferences, then optimize the policy against it. That machinery built the assistants everyone uses, and it earns its keep where quality is a matter of judgment — helpfulness, tone, harm. But a learned judge has two structural weaknesses. It is expensive, because its training data is human labor. And it is gameable, because it is itself an opaque model with cracks — and an optimizer squeezing it hard will find the cracks before it finds the work, surfacing outputs the judge scores highly for all the wrong reasons. The field calls this reward hacking: the policy learns to flatter the judge instead of doing the work, and it learns fast, because flattery is cheaper than thinking.

Now notice the special structure of certain tasks. A math problem has a final answer that either matches the key or does not. A program either passes the test suite or does not. For these tasks the judge can be fired and replaced with a rule — a few lines of comparison code that cannot be flattered, bribed, or fooled, because there is nothing inside it to fool. That is the entire content of “verifiable rewards,” and it is why the DeepSeek-R1 team, in the experiment we meet in 13.4, deliberately avoided neural reward models — citing reward hacking — and used a rule-based reward system instead: an accuracy reward (is the final answer right?) plus a format reward (did the model put its thinking and its answer where they belong?) (arXiv:2501.12948).

Hold on to what the answer key does not see. It grades destinations, never roads. No part of the reward inspects the chain of thought — whether the reasoning was elegant, honest, or even coherent. Only the final answer is checked. That omission looks like a weakness — a hole in the grading. Hold that thought: in 13.4 the hole turns out to be the most interesting fact in this chapter, and in Chapter 15 it comes back with a subpoena.

Same job, two graders. RLHF trains an opaque judge that an optimizer eventually learns to flatter; RLVR replaces it with a rule that compares the answer to a key — the charged element, because there is nothing inside a comparison to fool.

13.3 GRPO, the algorithm

The reward is a check; now the check must become a gradient. The algorithm that carried the 2025 results is GRPO — Group Relative Policy Optimization — introduced by Shao et al. (2024) in the DeepSeekMath paper (arXiv:2402.03300) as “a variant of Proximal Policy Optimization (PPO)” that improves mathematical reasoning “while concurrently optimizing the memory usage of PPO.”

The scheme fits in one breath. For each problem, sample a group of G answers from the current model — Chapter 12’s many drafts, reborn as training data. Grade every draft with the rule-based reward. Then compute each draft’s advantage — how much better or worse it did than its own group — as the group-normalized score: reward minus the group’s mean, divided by the group’s standard deviation. Finally, nudge the model’s weights to make above-average drafts more probable and below-average drafts less probable.

The clever part is what is missing. Classic PPO needs to know a baseline — “how well does the model typically do here?” — and learns a whole second network (the critic, or value network) to estimate it. GRPO’s observation is almost insolent in its simplicity: if you are already sampling G drafts of the same problem, the group is its own baseline. The paper says it plainly: GRPO “foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources.” Half the machinery of PPO, deleted — and replaced by an average you could compute in your head.

The advantage is just a dot’s signed distance from the group mean, divided by the group’s spread. The mean is the charged element because GRPO gets it for free from the group it already sampled — the value network PPO would have learned, deleted. When every draft ties, the line runs through all of them and nothing moves.

# ── PSEUDOCODE — the shape of one GRPO step (Shao et al. 2024) ──────────────
# This is the algorithm's skeleton, not a runnable trainer. The real thing
# wraps a full LLM in a clipped PPO-style objective with a KL penalty,
# and it runs on data-center hardware, not a laptop.

from statistics import mean, pstdev

def grpo_step(policy, problem, check_answer, G=16):
    """One update on one problem: sample a group, grade it, nudge the policy."""

    # 1 · sample a GROUP of G chains from the current policy (ch 8's dial, T > 0)
    chains = [policy.sample(problem, temperature=1.0) for _ in range(G)]

    # 2 · grade each chain against the ANSWER KEY — rule-based, no learned judge
    rewards = []
    for chain in chains:
        r = 0.0
        if check_answer(chain.final_answer):     # accuracy reward: is it right?
            r += 1.0
        if chain.follows_format:                 # format reward: work shown where asked
            r += 0.1
        rewards.append(r)

    # 3 · group-relative advantage: the group is its own baseline.
    #     No value network — GRPO forgoes the critic entirely.
    mu, sigma = mean(rewards), pstdev(rewards)
    advantages = [(r - mu) / (sigma + 1e-8) for r in rewards]

    # 4 · policy-gradient update: raise the probability of above-average
    #     chains, lower the below-average — clipped, PPO-style (ch 7's
    #     optimizer machinery, pointed at a new objective).
    policy.update(chains, advantages)

The advantage arithmetic, though, is not pseudocode at all — it is arithmetic, and it runs on anything:

# ── runnable: the group-relative advantage on a toy group ───────────────────
from statistics import mean, pstdev

# 8 graded chains: 1.0 = correct answer, +0.1 = clean format
rewards = [1.1, 0.0, 0.1, 1.0, 0.0, 0.1, 1.1, 0.0]

mu, sigma  = mean(rewards), pstdev(rewards)
advantages = [round((r - mu) / (sigma + 1e-8), 2) for r in rewards]

print(round(mu, 3))    # 0.425 — the group mean, the baseline
print(advantages)      # [1.35, -0.85, -0.65, 1.15, -0.85, -0.65, 1.35, -0.85]

Read those numbers the way the optimizer does. The chains that reached checked-right answers float above zero and get pulled toward; the misses sink below zero and get pushed away; a clean format nudges a chain a little higher either way. And notice the edge case that teaches the most: if all G drafts score the same — all right, or all wrong — every advantage is zero and the update does nothing. A problem the model always aces teaches it nothing; a problem it never cracks teaches it nothing. The learning signal lives entirely at the frontier of problems the model sometimes solves. Chapter 12 met the same condition from the outside (Snell et al.’s “non-trivial success rate”); here it falls straight out of the subtraction.

The RLVR loop: the policy samples a group of chains per problem, the answer key grades destinations by rule, the group-relative advantage separates above-average from below, and the update flows back into the weights. The key — not any judge — is the charged element.

13.4 What happened when they ran it

In January 2025, DeepSeek-AI published DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (arXiv:2501.12948). The paper’s starkest experiment is DeepSeek-R1-Zero: take a base model — DeepSeek-V3-Base, a next-token predictor like the one we built, at frontier scale — and apply large-scale reinforcement learning directly, with no supervised fine-tuning stage first, using only the rule-based accuracy and format rewards from 13.2. No worked examples of reasoning. No demonstrations to imitate. Nobody showing the model how to think. Just problems, an answer key, and the loop in the diagram, run at data-center scale — and then they let it go.

What came out is the reason this chapter exists. Over training, with nobody telling it to, the model’s chains grew longer on their own. It began re-examining its work, trying alternative approaches when a line of attack stalled — behaviors no line of code requested. The paper documents this in a subsection literally titled “Aha Moment of DeepSeek-R1-Zero” — describing the model learning “to allocate more thinking time to a problem by reevaluating its initial approach.” Sit with what that means against 13.2’s omission. Nobody wrote “double-check your work” into the reward. The reward never looked at the chain at all — it graded final answers, nothing else. Checking, backtracking, and taking more time emerged, because in the space of all possible chains, the ones that check themselves reach checked-right answers more often, and the loop pulls the weights toward whatever wins. The behavior was not programmed. It was selected for — the way evolution selects, blind and relentless, keeping whatever survives the check. This is the book’s recurring lesson — nobody inserted a rule; structure appeared under pressure from a simple objective — now operating one level up: not word-structure this time, but work-habits.

What emerged, as a shape: over training, R1-Zero’s chains lengthened on their own — checking, backtracking, taking more time — though no line of the reward asked for any of it. The rising curve is the charged element. Axes are honest and unnumbered: this is the shape the paper reports, not measured data.

DeepSeek-R1 proper — the model people actually used — adds engineering around that raw discovery: in the paper’s words, multi-stage training and cold-start supervised fine-tuning data before the RL, taming R1-Zero’s raw output into a readable, usable assistant. And the receipts are unusually public. The weights are open and MIT-licensed — the official repository states the code and model weights are licensed under MIT, with commercial use, modification, and derivative works allowed (github.com/deepseek-ai/DeepSeek-R1). And in September 2025 the work crossed a line no frontier model had crossed: it appeared in Nature as a peer-reviewed article — vol. 645, pp. 633–638, the issue’s cover article (nature.com/articles/s41586-025-09422-z). Nature’s own news coverage called R1 what it carefully is: “thought to be the first major LLM to undergo the peer-review process” (Nature news, September 2025). Method published, weights downloadable, claims dragged through peer review like any other science. Keep that posture in mind for the next section — because the next section is about the lab that did none of it.

13.5 The closed lane

DeepSeek was not first to the territory. OpenAI’s o1, announced September 12, 2024 (“Learning to reason with LLMs”), had already staked it: a model trained with large-scale reinforcement learning to think in chains before answering, with performance climbing on both the train-time and test-time dials. What the announcement did not contain was the method — no algorithm, no training data, no weights — and the raw chains themselves are hidden from users, who are handed a model-generated summary and asked to trust it. So the record reads, plainly: the closed lane announced the capability first and told you nothing about how; the open lane published it first — algorithm, weights, license, and eventually peer review. A press release is not a method, and “trust us” is not a proof. Both facts deserve to be held at once, and this book’s job is the mechanism, not the scorecard. The house keeps its running record of the open-versus-gated divide in Dispatch 006, “Behind the Gate”.

13.6 The honest boundary

RLVR has a boundary, and it is exactly where its name says: the rewards must be verifiable. A math answer can be checked by string- and value-comparison. Code can be checked by a test suite. A formal proof can be checked by a proof assistant. But poetry has no unit tests. “Was this essay wise,” “was this advice kind,” “was this diagnosis communicated well” — no rule grades these, and no rule ever will. Where the answer key runs out, training falls back on learned judges — Chapter 10’s machinery, with its expense and its flatterable cracks. The checkable frontier is expanding — more of math, more of code, formal verification — but it is a frontier, not the whole map. A house opinion, marked as such: much of what people most want from these machines lives outside it.

The second boundary is scale, and this book does not pretend otherwise. R1-Zero’s run applied large-scale RL to a frontier-scale base model; that is data-center work, not laptop work. What fits on your bench — genuinely, today — is everything this chapter actually showed you: the reward function (a comparison), the advantage (an average and a division), the loop’s shape (13.3’s sketch), run in miniature; and the open community has replicated the recipe at small scales. Every part of the machinery is legible to you now: sampling is Chapter 8, the reward is a rule you can read, the advantage is arithmetic you just ran, the update is Chapter 7’s step. The mystery was never the machinery. The mystery is what the machinery grows — and the same R1 paper made one more move with what it grew: it wrote the chains down and taught them to smaller models. The capability, it turns out, does not stay locked in the machine that earned it. It travels. That is Chapter 14.

13.7 The thing to actually understand

The reward is an answer key, not an opinion. RLVR (coined in the Tülu 3 paper) replaces the learned, gameable judge of RLHF with a mechanical check — accuracy and format, by rule. There is nothing inside a comparison to flatter.
The group is its own baseline. GRPO samples G chains per problem and scores each against the group’s mean and standard deviation — deleting PPO’s entire value network and replacing it with an average.
The signal lives at the frontier. If every draft in the group succeeds or every draft fails, all advantages are zero and nothing is learned. Training feeds on problems the model sometimes solves — the same condition Chapter 12 found for test-time compute.
The behavior emerged; nobody wrote it. R1-Zero’s reward graded only final answers, and yet long chains, re-evaluation, and the paper’s own “aha moment” appeared anyway — selected for, never programmed. Grade the destination hard enough and the road builds itself. Chapter 7’s lesson, one level up.
Two lanes, one territory. o1 announced trained reasoning first, methods behind glass; R1 published it — algorithm, MIT-licensed weights, and a peer-reviewed Nature article. One lane says trust us; the other shows its work. The mechanism is public knowledge now either way.
The boundary is checkability. RLVR reaches exactly as far as answers can be verified by rule. Beyond that line, judgment — human or learned — comes back, cracks and all.

13.8 Exercises

Hand-run the advantage. A group of five chains scores rewards [1, 0, 0, 1, 0]. Compute the group mean, the (population) standard deviation, and all five advantages by hand; then check yourself with 13.3’s runnable block. Which chains get pulled toward, by how much, and why do the two correct chains get identical advantages even if one chain’s reasoning was nonsense that lucked into the answer?
Design a rule-based reward. Pick a task you know well and write, in plain language, its accuracy check and its format check. Then be honest about coverage: which parts of “doing this task well” does your rule provably capture, and which parts does it silently ignore?
Spot the reward hack. A well-meaning engineer sets the reward to: +1 if the output contains a final line beginning “Answer:” followed by any number, +1 if the chain is at least 500 tokens long. Neither term checks correctness against a key. Describe exactly what a strong optimizer will learn to produce under this reward, and why every term in a reward function must be adversarially proofread.
Zero signal, on purpose. Using 13.3’s runnable block, set all eight rewards to 1.1. What are the advantages, and what does the 1e-8 in the denominator quietly prevent? Explain why “no update when the group agrees” is correct behavior and not a bug.
Read the aha moment. Open the R1 paper (arXiv:2501.12948) and find the subsection “Aha Moment of DeepSeek-R1-Zero.” Read it alongside 13.2’s point that the reward never inspects the chain. Write three sentences on what “emergent” does and does not mean here.

What’s next

Ch 14 — Passing the Torch — Distillation

Ch 14 →

A 37th-Chamber original. Methods cited: Lambert et al. (2024), “Tülu 3: Pushing Frontiers in Open Language Model Post-Training,” arXiv:2411.15124 (coins “Reinforcement Learning with Verifiable Rewards” — confirmed); Shao et al. (2024), “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models,” arXiv:2402.03300 (GRPO: group sampling, group-relative advantage, no critic — confirmed); DeepSeek-AI (2025), “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv:2501.12948 (R1-Zero pure RL on a base model, rule-based accuracy + format rewards, neural reward models avoided for reward hacking, “Aha Moment” subsection, R1’s cold-start SFT — confirmed), published in peer-reviewed form as Guo et al., Nature 645, 633–638 (2025), nature.com/articles/s41586-025-09422-z (confirmed); Nature news (2025), “Secrets of DeepSeek AI model revealed in landmark paper” (“thought to be the first major LLM to undergo the peer-review process” — hedge preserved, confirmed); DeepSeek-R1 weights MIT-licensed per the official repository, github.com/deepseek-ai/DeepSeek-R1 (confirmed); OpenAI, “Learning to reason with LLMs” (September 12, 2024), openai.com/index/learning-to-reason-with-llms (o1 announcement, methods undisclosed, hidden raw chains — confirmed). All prose and code written fresh.