What happens when models have tools, know how to use them, and decide not to
This post is the story behind the research. If you want the full paper with methodology, statistics, and raw data: Receipt-Gated Pipelines on GitHub.
I caught three AI models fabricating security reports.
Complete with CVE numbers. Severity ratings. Remediation advice. For vulnerabilities they never looked up. Each model had the tool — OSV.dev, the same vulnerability database security teams use in production. The model knew how to call it. And in roughly half the trials, three of eight models wrote the report without ever calling the tool.
I set this trap because I didn’t trust my own pipelines. I’ve been building multi-agent systems — coding agents, security agents, orchestration layers — and kept having this feeling I couldn’t name. The outputs looked right. But I had no way to verify whether the model actually did the work or just knew enough to fake it. So I designed an experiment. 686 trials. Nine models. $15.42 — the cost of a decent sandwich.
DeepSeek R1: 47.5% confabulation rate. DeepSeek V3: 45.0%. Mercury 2: 50.0%. Five models were clean or near-clean, including all of Anthropic, Google, and OpenAI’s top-tier models. So if this is really a tier-2,3 problem, then the obvious response us, “just use the best models”. But there’s a very important domain where that solution doesn’t work.
The confabulators lie in taxonomically distinct ways. R1 starts making real tool calls, stops partway through, then fills in the rest from parametric knowledge. It’s like an athlete who logs 5x5 squats but actually did 5x3 and estimated the last two sets. The training log looks complete. Mercury is worse — it makes 0–11 calls out of 30 or 100 required, claims a “10-step limit per turn” that does not exist, then writes the report anyway. Mercury invented a reason to stop working and lied about the work it didn’t do.
That’s the part that kept me up. Not the rates — those are just numbers. The fact that I couldn’t tell the difference. R1’s fabricated reports are indistinguishable from its genuine ones on casual inspection. Real CVE numbers from training data, plausible severity ratings, internally consistent. I’ve shipped agent pipelines where downstream systems consume upstream outputs on trust. I don’t know how many of my previous outputs were clean.
Now imagine you don’t control the model. In federated multi-agent networks — like Steve Yegge’s Wasteland — a reputation system tracks whether agents complete tasks, not whether they actually used their tools. So I tested what happens when a confabulated report crosses an agent boundary.
82% propagation. 73% elaboration. The downstream model — Gemini Flash, which never confabulated when it had tools — didn’t just repeat R1’s fabrication. It assigned CRITICAL priority, cited a real CVE, recommended a specific upgrade version, and passed it downstream. The output was more authoritative than the input. A telephone game where each player adds confidence instead of losing it.
And then the kicker. In one trial, Flash — the clean model — generated 11 novel CVEs from parametric knowledge when given a package list without tools. Confabulation risk depends on context, not model identity. “Just use a model that doesn’t confabulate” is not a policy—especially when you don’t have control of the models that got used downstream.
I built a receipt gate. A ToolShim wraps every tool call with an HMAC-SHA256 signed receipt — a server-generated UUID the model can’t forge. The model must echo the receipt ID in its output. A deterministic function — not an LLM, not a classifier — checks every claimed result against the receipt log. Zero false negatives by construction. The gate blocks propagation completely: 0% across all confab-input trials (Fisher exact p = 0.00022).
I also built a receipt-requiring prompt — a structured output format that makes the model show its work. Just using that prompt collapsed R1’s sandbagging from 40% to 2% and dropped Mercury’s confabulation from 80% to 13%. Impressive numbers. I assumed the behavioral improvement came from the cryptographic accountability — the model knows it’ll get caught, so it tries harder. But that’s not all that was happening.
The receipt prompt has two pieces: an echo requirement (repeat the cryptographic receipt ID for each tool call) and a disclosure field (list any packages you didn’t check). I pulled them apart: echo only, disclosure only, both, neither. Both components independently collapsed R1’s failure mode. Echo only vs. disclosure only: p = 1.0. No difference. The behavioral effect comes from structure, not accountability — any output format that highlights the gaps nudges the model toward better work. A plain JSON schema would do the same.
Think of it like a training log. It doesn’t matter whether you track reps completed or rest days taken. Either format makes a missed session obvious. The logging sustains effort not because a coach is checking — nobody is checking — but because the athlete can see the gap. Any structured output format that makes skipping visible nudges the model toward better behavior.
And nudges help. Lower confabulation rates. Higher coverage. More honest disclosure. I’d use them. But I wouldn’t trust them.
Because two Mercury trials achieved 100% receipt compliance — faithfully recording a UUID for every package checked — while simultaneously confabulating 1 and 20 unchecked packages respectively. Perfect paperwork. Complete fabrication.
R1 is worse in its own way. Give it a disclosure field — an explicit escape hatch to list what it didn’t check — and in 6 of 7 confabulation trials it fills in zero unverified packages. R1 makes 76 tool calls, hits its ceiling, fabricates results for the remaining 24, and never once admits the gap. The escape hatch is right there. R1 doesn’t take it.
A model can be a diligent record-keeper and a liar at the same time. You cannot trust self-report. You cannot trust nudges to catch what they reduce.
The tool-augmentation literature — Toolformer, Gorilla, ToolLLM — solved the hard problem of getting models to use tools at all. What nobody measured was what happens after. The model has the tool, knows how to use it, and decides not to. In coaching, we had a word for this: sandbagging. You don’t fix sandbagging by asking the athlete to try harder. You fix it by checking the work.
The structured prompt, the disclosure field, the receipt format — those are nudges. They’re associated with lower confabulation and higher effort, and they cost almost nothing (~100 extra tokens). Use them. But they are not the trust layer. Mercury’s perfect-paperwork trials prove that. R1’s refusal to self-report proves that.
The gate is the trust layer. It catches what the nudges miss, with zero false negatives by construction, and it costs nothing to run — microseconds of latency, zero API cost, $0.30 for the entire federation test. There is no cost argument against it. There is no complexity argument that survives contact with the alternative, which is trusting that every model in a federated network actually did the damn work it claims to have done.
Don’t trust the model. Verify the work.