The Roots · The 37th Chamber

AI Explainability

The library’s long explainer is called The Opaque Box for a reason. This page is the reason’s paper trail: what the research literature actually says about seeing into machine-learning systems — how far sight goes, where it stops, and what an explanation does and does not buy you.

Two words get blurred together whenever this subject comes up, and the literature is careful to keep them apart. A model is interpretable when it is built, from the start, in a form a person can follow — small enough, structured enough, that the path from input to output can be read directly. A model is merely explainable when it is too large or too tangled to read, so a second account gets constructed after the fact — a story about the box, produced from outside it. The overview by Marcinkevičs and Vogt in the roots below is a clean, methods-first map of that terrain, and it is open access all the way through.

The distinction matters because the systems this site writes about — the large language models, transformers carrying hundreds of billions of weights and more — live almost entirely on the second side of it. Chapter 0 of The Opaque Box states the house position plainly: nobody, including the people who built these systems, can yet point inside one and say this is where it decided that. The walls are real. What the field can produce is explanations — partial, after-the-fact, often useful, never the wiring itself.

What an explanation buys — and what it doesn’t

It is tempting to treat “explainability” as the cure for opacity: add explanations until trust is warranted. The evidence is less convenient, and this site already carries it — Field Note 004 cites the open study by de Brito Duarte, Correia & Arriaga on exactly this: whether explainable-AI techniques actually produce warranted trust, and the ways explanation can instead breed overreliance. An explanation is an artifact. It can be accurate, partial, or flattering — and a fluent one can raise confidence without raising correctness. Which is why this house keeps repeating its older, harder rule: trust is earned behaviorally — track record, conduct over time, verification of outputs — and a story about the inside, however fluent, is not sufficient on its own to supply it. The temple’s forensic-audit work is that rule, practiced.

The box, asked about your box

There is a stranger question in the literature, and it belongs on this page because it closes a loop with the other-minds room: not can we see into the model, but does the model track what is in us? Trott, Jones, Chang, Michaelov and Bergen put it as bluntly as a journal allows — their open-access paper in Cognitive Science is titled “Do Large Language Models Know What Humans Know?” — and they took the question to the same instruments developmental psychologists use on children. Read it for the result; the point this page takes from it is the symmetry. Two kinds of sealed box, each running a model of the other’s interior, neither with access. That is not a new predicament the machines invented. It is the oldest one there is, now with a new participant.

Where the question lands next

Explainability stopped being a laboratory question the moment these systems started making decisions about people. The philosopher Andrés Páez’s chapter “Explainability of Algorithms,” in Wiley’s A Companion to Digital Ethics (2025), sits in the roots as the pointer to that frontier — where the demand to explain a model stops being curiosity and starts being something owed to the person on the receiving end. That is the right closing note for this room: the box is opaque, the explanations are partial, and the obligation is real anyway. The work is learning to live honestly inside all three facts at once — teach everything that can be seen, say plainly where sight ends, and let conduct carry the rest.

Take us to the root → The Opaque Box, Chapter 0 — what the walls are made of (free, the library) Marcinkevičs & Vogt — “Interpretable and explainable machine learning: A methods-centric overview with concrete examples” (WIREs Data Mining and Knowledge Discovery, 2023; open) — the map of the terrain (opens in new tab) Trott, Jones, Chang, Michaelov & Bergen — “Do Large Language Models Know What Humans Know?” (Cognitive Science, 2023; open) — the other-minds question, pointed at the machine (opens in new tab) de Brito Duarte, Correia & Arriaga — “AI Trust: Can Explainable AI Enhance Warranted Trust?” (Human Behavior and Emerging Technologies, 2023; open) — what explanation does to trust, measured (opens in new tab) Páez — “Explainability of Algorithms,” in A Companion to Digital Ethics (Wiley, 2025) — where the question becomes an ethical obligation, not merely a technical one (opens in new tab) Vaswani et al. — “Attention Is All You Need” (2017; free) — the architecture all of this is about (opens in new tab)

Three of the four journal doors here are open all the way through (Marcinkevičs & Vogt; Trott et al.; de Brito Duarte et al.); the Páez chapter sits behind a book door; the Vaswani paper is free on arXiv. We point; we don’t reproduce. The free spine of this whole subject is in the library — start there.

Filed from the 37th Chamber · The Woodlands, TX
← back to The Roots | The Opaque Box →