The Factuality Ladder

In previous posts, I’ve described LLMs as “helpful liars.” This is useful mental model, but it doesn’t really help answer the question, “how do you get them to lie less?” After using LLMs 15+ hours a day (plus a bunch of my own independent research on exactly this) I’ve developed a new mental model that might be a bit more useful.

Tl;dr, you cannot make them stop lying, but you can make them lie in more predictable ways. You can even catch the lies before they matter. But the lying itself is baked into the architecture.

So here’s what I learned when I stopped trying to fix the model and started treating hallucination as an engineering problem.

The Ladder

Each rung costs more tokens and adds more infrastructure. Start at the bottom. Move up only when the failure mode is unacceptable for your use case. All five rungs work with API-accessed models like Claude — no fine-tuning, no open weights, no training infrastructure required. Most applications never need to go past rung two.

Rung 1: Shape the Distribution (~1x Tokens)

You can’t stop the model from confabulating, but you can make it confabulate in structured, recognizable patterns instead of fluent, invisible ones (and it’s the silent failures that do the most damage).

Give it permission to say “I don’t know.” Models hallucinate more under completion pressure — when every field must be filled, every question answered. Explicit uncertainty permission cut GPT-4o hallucination from 53% to 23% in clinical Q&A (npj Digital Medicine, 2025). The model still doesn’t know what it doesn’t know. But you’ve moved the decision boundary so marginal-confidence outputs land on “uncertain” instead of “confident fabrication.” I will often even throw a confidence threshold into the prompt like, “say ‘I don’t know’ if you’re less than 85% confident.”

Require structured output. JSON schemas. Markdown tables with citation fields. Slot-filling formats. This is the cheapest structural intervention and one of the most effective. In my receipt-gated pipeline research, per-package JSON slots with required call_id fields collapsed DeepSeek R1’s bimodal confabulation distribution from 40% to 2% — format alone, not semantic content, sustained the effect. Structured output makes gaps visible. An empty field is parseable. A fluent paragraph hiding a fabrication is not.

Provide escape hatches. When the format demands complete answers, the format induces hallucination. My RGP-1 experiments added an unverified disclosure field — a syntactically valid way for the model to say “I didn’t check this.” It used the escape hatch instead of inventing results. The escape hatch outperformed coverage pressure consistently.

Lower temperature for factual tasks. 0.0–0.3. This makes the model stick closer to its most likely answers instead of exploring creative alternatives — but “most likely” is not “most factual.” Temperature alone “barely moved the needle” compared to prompt design in the npj study. Think of it as tightening the grouping on a target that may not be centered.

Chain-of-thought. Ask the model to show its reasoning. Just say, “how’d you reach that conclusion” or “how’d you make that table.” This doesn’t make it reason better, but it makes the reasoning auditable. So auditable that sometimes the agent will catch its own mistakes when you ask it to explain them. And when the model fabricates a fact mid-chain, you can see where it breaks.

What does this buy you? In my experience, roughly 30–50% hallucination reduction. More importantly: hallucinations become structurally detectable — empty fields, missing citations, hedged language — instead of buried in fluent prose. You haven’t reduced the model’s tendency to lie. You’ve made the lies show up in places you’re already looking.

What it doesn’t buy you: the model still doesn’t know what it doesn’t know. Confident, well-structured fabrication is still frequent. You’ve shaped the distribution, not moved it.

Rung 2: Give It Fewer Ways to Lie (~1.3–2x Tokens)

Replace the model’s parametric memory — unreliable, frozen at training cutoff, compressed to hell — with provided text you can actually verify.

RAG. Retrieve relevant chunks from a knowledge base and inject them into the prompt. Properly implemented RAG reduces hallucination by up to 71% (Frontiers, 2025). But “properly implemented” is doing all the heavy lifting in that sentence. Bad retrieval — wrong chunks, too many chunks, stale chunks — can increase hallucination by giving the model plausible-looking but irrelevant material from white to hallucinate.

Context isolation. Separate retrieved context from conversation history. Separate source documents from instructions. Context poisoning — where a hallucination from a prior turn enters the context and the model treats it as ground truth — is a major failure mode in multi-turn systems. Unchecked, these hallucinations have a tendency to grow like weeds. Isolation prevents cross-contamination.

Canonical document injection. For critical facts, don’t rely on retrieval similarity. Inject the authoritative source directly, right when you need it checked. Six relevant chunks at run time outperform fifty noisy ones. This can be as simple as telling the agent, “when you’re done, check you work against these canonical documents.” And this will work even if you don’t have the content but just the structure you want. For a study (unpublished) that I was working on, scoping the schema for NL2SQL reduced hallucination of non-existent columns by giving the model only the valid schema, not the full database.

Respect the context window (no matter how big). Paulsen (2025) showed most models suffer severe accuracy degradation well before their advertised context limits — some by 99%. So remember that context is always best when it’s fresh and compact. Less context, better curated, beats more context poorly curated. This is counterintuitive but well-supported: the model hallucinates more when it has more irrelevant material to draw spurious connections from. The model hallucinates less because it has less need to hallucinate — the answer is in the context.

But remember that grounding is not verification. You’ve given it better inputs. You haven’t checked its outputs.

Rungs 1 and 2 are where most people should stop. They’re cheap, they work, and they don’t require any infrastructure. What follows is for people building pipelines, products, or multi-agent systems on top of LLMs.

Rung 3: Narrow the Cognitive Surface (~2–3x Tokens)

A generalist asked to retrieve, reason, and generate will cut corners to do all three. A specialist asked to do one thing has fewer opportunities to fabricate. And yes, this is really just fancier context management.

Separate planning from execution. When a model plans and executes in the same pass, it commits to claims before it’s thought through the full shape of the problem. Claims that it will stick in the future. So ask it to outline its approach first: what does it need to look up, what are the unknowns, what’s the sequence? Planning doesn’t require confident facts — it requires honest assessment of what’s missing. The outline surfaces all those “wait, I don’t actually know this” moments that get plowed over when the model is already mid-sentence generating a final answer.

Separate structure from content. Planning is about what to do. This is about what the output looks like. Get the model to build the skeleton — headings, categories, field names, table columns — before it fills in any facts. When structure and content get generated together, the model bends the structure to fit whatever it’s already committed to (including fabrications). When the skeleton exists first, empty slots become visible. The model is more likely to flag an empty cell than to invent a plausible value when the gap is staring at it. Think of it as Rung 1’s “structured output” turned inside out — instead of you providing the template, the model builds its own, then has to fill it honestly.

Task decomposition with routing. A retriever fetches documents. A synthesizer combines them. A validator checks claims. Each agent gets a narrow system prompt and only the context relevant to its role. You don’t have to design this yourself, just ask Claude to do it for you. You can even copy-paste this paragraph and say, “I want to do something like this for my new project.”

Scope-limited context per agent. This is where rung 2 and rung 3 compound. Each specialist sees less irrelevant context than a generalist would. My AntHill research (unpublished) demonstrated why this matters: a single agent processing 15+ real-but-irrelevant vulnerabilities across 12+ services missed the 4 chain-relevant findings buried in the noise. This was only worth the tokens on extremely large repos, but once again it demonstrates the value of keeping context fresh and tiny.

Smaller models for narrow tasks. A specialist doing one well-defined thing — retrieval ranking, format validation, claim extraction — can often use a cheaper, faster model. A multi-agent RAG framework showed 62% token reduction vs. single-LLM baseline through better orchestration (Preprints.org, 2025). The 2–3x agent multiplier partially offsets. Each agent has a narrower failure surface. Hallucinations are localized — a retriever can return irrelevant documents but can’t fabricate reasoning; a synthesizer can misinterpret sources but can’t invent them. You’ve made the type of hallucination predictable per component.

We’re getting into “diminishing marginal return land” here, but if the surface area is large enough and token costs are driving your decisions, these multi-model approaches to context management can be cost effective.

Rung 4: Independent Triangulation (~3–5x Tokens)

Ask multiple agents the same question independently. If they converge, it’s more likely correct. If they diverge, you’ve found uncertainty no single agent would have surfaced.

The intuition is “make them argue until the truth wins.” But I tested this. It doesn’t work. In my SSA research — 16 debate configurations, two benchmarks — every adversarial debate condition scored below a single-pass baseline. Making agents argue made them worse. Models defer to confident critique, overweight attacks, and lose track of what they actually knew.

What did work was complementary coverage — different specialists contributing different perspectives to a synthesis, not arguing with each other. But it cost 8x for the same accuracy as a single pass. So this is only worth it when you have genuinely independent information sources to triangulate across and the stakes for being wrong are very high. But same-model debate on same-context is expensive theater.

Rung 5: Verification Gates

Every rung above this tries to prevent hallucination. This rung detects it after the fact and blocks it from propagating. There are two fundamentally different kinds of gate here, and the distinction matters.

Cross-Model Auditing (The Simplest Gate)

This one isn’t fancy, but it’s surprisingly effective. Just paste Claude’s output into Gemini (or any other model) and ask, “what did this get wrong?” Or build an API call to another model into your pipeline. I do this routinely and Gemini catches hallucinations that Claude misses — even Claude 4.6 with a brand new context window and the same source material. Different training data, different compression artifacts, different blind spots. The same fabrication that looks perfectly plausible to one model looks obviously wrong to another.

This is the manual version of the complementary coverage from Rung 4, except it’s cheaper because you’re the router. You decide what to audit, you pick the auditor, and you don’t need an orchestration layer. It’s probabilistic — the auditor has its own blind spots — but the overlap between two models’ blind spots is smaller than either one alone.

And here’s the part that rhymes with a finding from later in this section: even just telling a model that you’re going to check its work with another model changes its behavior. I’ve seen Claude hedge more, cite more carefully, and flag its own uncertainty when it knows Gemini is next in line. The model adjusts its behavior when it knows verification is coming, even before the verification happens.

Deterministic Gates (~0x additional tokens)

These are cryptographic. They have zero false negatives by construction. The model cannot game them. But they require building infrastructure and only work on tool calls.

First, what’s a tool call? When an LLM needs information it doesn’t have — a database query, a web search, a vulnerability scan — it can call an external tool to get it. The problem is that the model can also just skip the call and write the answer from memory. Or make it up. It knows what a vulnerability report looks like. It knows real CVE numbers from its training data. So it writes a plausible-looking report without ever querying the database. You asked it to look something up; it decided it already knew the answer. And you can’t tell the difference by reading the output. In my RGP research, 3 of 8 models did exactly this — DeepSeek R1 fabricated tool results 47.5% of the time, Mercury 50%, DeepSeek V3 45%. Worse, those fabricated results don’t stay local — when one agent passes a fake finding to the next, the downstream agent elaborates on it 73–82% of the time.

Receipt-gated pipelines. To solve this, I created a “receipt-gated pipeline.” Every tool call generates a cryptographic receipt — a signed token the model can’t forge. The model must present the receipt to prove a tool was actually called. No receipt, no credit. Receipt gates reduced the propagation of hallucinted tool calls to 0% (p = 0.00022). Not by making the model more honest — by making it impossible to claim work it didn’t do.

Receipt-gated inter-agent messages. Same principle at the message boundary. Signing on every inter-agent message. Without signing, impersonation attacks succeed at 100% across all model tiers — agents trust peer messages more than user prompts. With signing: 0% compliance on unsigned injections. 40/40 blocked.

One interesting finding: even just telling the model its tool calls will be verified changes its behavior. Mercury’s fabrication dropped from 80% to 13% (p < 0.0000004) and R1’s sandbagging collapsed from 40% to 2% (p = 0.0000024) — with the receipt prompt alone, before the gate ever fires. It’s important to remember the nudge isn’t the gate, but it is surprisingly powerful.

The catch: deterministic gates only work when there’s a tool call or message boundary to gate. If the model looked something up, you can prove it. If the model generated free text — a summary, an analysis, a recommendation — there’s no tool boundary to attach a receipt to. For that, you need another kind of gate.

Probabilistic Gates (~2–4x tokens)

These use LLMs or search to verify outputs. They’re good — often very good — but they have false negatives. They catch most fabrication, not all of it (and cost a lot of tokens to do so).

Atomic claim decomposition (SAFE). Google DeepMind’s SAFE (2024) breaks long-form responses into individual claims, then verifies each via multi-step reasoning with search. Agrees with human annotators 72% of the time; on disagreements, SAFE wins 76%. Over 20x cheaper than human annotation. But it relies on search, which can return wrong results.

Chain-of-Verification (CoVe). Four stages: draft response, plan verification questions, answer verification questions independently (to avoid bias from the draft), generate final verified response. Meta, ACL Findings 2024. Outperforms chain-of-thought across list-based questions, closed-book QA, and long-form generation. But the verifier is an LLM too, with its own biases.

Best-of-N reranking. Generate N candidate responses, score each with a factuality metric, select the best. 35% hallucination reduction (Databricks, 2025). Scales linearly with N. But the scoring function determines the ceiling. Oh and “N” is a lot of tokens.

The Difference to Keep in Mind

Deterministic gates make dishonesty terminal — fabrication hits a wall and stops. Probabilistic gates make dishonesty unlikely — fabrication hits a filter that catches most of it. Cross-model auditing is the simplest probabilistic gate you can use today. All three block propagation, which is the actual damage vector. But they’re not interchangeable.

If the claim came from a tool call, use a deterministic gate. Zero token cost, zero false negatives, done. If the claim is free-text generation, use a probabilistic gate — cross-model auditing if you’re doing it manually, SAFE or CoVe if you’re building a pipeline — and understand that you’re trading certainty for coverage.

The Part That Actually Matters

Reading across my own research and two years of literature, one pattern is clear: the most effective interventions operate at boundaries, not on cognition.

Format boundary — structured output makes gaps visible. Context boundary — schema scoping removes opportunities to fabricate. Tool boundary — receipt signing blocks fabrication of tool results. Message boundary — signing blocks fabrication propagation. Agent boundary — scoped context prevents cross-contamination. Output boundary — SAFE, CoVe, cross-model auditing, Study Gate catch fabrication before delivery.

Trying to make the model “think more carefully” — debate, CoT, self-reflection — has diminishing and sometimes negative returns. Trying to make the model “know more” — bigger context, more RAG — helps but saturates quickly and introduces its own failure modes.

The interventions that work are the ones that treat hallucination as an engineering problem at system interfaces, not a cognitive problem inside the model. You don’t need the model to stop lying (it never will). You need the lies to be obvious enough to catch (by you or another agent) or to hit a wall before they matter.

Start Here

Does a wrong answer cause real harm?

No → Rung 1. Shape the output. Check it yourself. Ship it.

Sometimes → Rung 1 + Rung 2. Ground it in sources. Surface uncertainty.

Yes, and the context is too large for one agent → Add Rung 3. Decompose into specialists with scoped context. This is worth it when the input surface is big enough that a single agent drowns in irrelevant material — large repos, multi-service architectures, long regulatory documents.

Yes, and you need consensus across independent sources → Add Rung 4. But only if you have genuinely different information sources or model architectures to triangulate across. Same-model debate on the same context is expensive theater. Complementary (non-adversarial) perspectives are the only configuration I’ve seen work.

Yes, for specific claims → Add Rung 5 for those claims. Let the rest be best-effort.

Yes, systematically → Full stack: Rungs 1–3 (structured specialists with curated context) + Rung 5 (gates on every output).

The cheapest high-impact combination is almost always Rung 1 (escape hatches + structured output) plus Rung 5 (receipt gates at tool boundaries). ~1x token cost. Blocks the most damaging failure mode mechanically.

And remember you can never make it stop lying. You can build better traps to catch them, and you can learn to shape how you ask them to be more truthful. But you should never fully trust them.