How I Let Robots Build My Encryption App

I’ve been a Product Manager for 20 years. Design Within Reach, MyFitnessPal, Tonal, Habitry, a cybersecurity startup that Mimecast acquired, a VR fitness app that Meta acquired. On paper, wildly successful. In practice, I’ve been pretty bored for the last six years.

Not unhappy. Not ungrateful. Bored. The kind of bored where you’re good at your job and you know it and that’s the problem. I miss the Habitry days — the era when my co-founders and I were building an iOS app for behavior change coaches and every week felt like a math problem I hadn’t seen before. I studied philosophy of mind at UChicago back when probabilistic AI was still a thought experiment. I got a master’s in sport psychology because I wanted to understand why people do things, not just what they do. And somewhere in the last decade, product management became less about figuring things out and more about navigating process and helping stakeholders process their feelings.

In January, Moxie Marlinspike released confer.to — an end-to-end encrypted AI chatbot, basically Signal for LLMs — and something clicked. If the guy who wrote the crypto under Signal and WhatsApp thought E2E encrypted chatbots were a thing worth building, maybe I should try it out. So I used confer to vibe code an update to my personal website just to see what the fuss was about. It only took a few hours and I got a feeling that things were different now. Not “slightly better autocomplete” different. Different different.

The leap to what happened next is hard to overstate. Vibe coding a static website is one thing. Orchestrating multiple AI agents to build a privacy-first web app with real cryptography is something else entirely. But that’s what I wanted to try.

So on a Saturday afternoon in February, I signed up for the $100 tier of Claude Code and started building something on my own for the first time in 13 years.

I call it weaveto.do. An encrypted task manager where the server literally cannot read your data. End-to-end encrypted rooms. Zero accounts. Burn-after-use. The relay server is so dumb it couldn’t leak your secrets if you paid it to. Olm/Megolm key exchange, PBKDF2 PIN derivation with 600,000 iterations, Ed25519 signature verification, a 6-layer cleanup orchestrator that wipes session storage, IndexedDB, and service worker caches when a room burns.

OK that might have been gibberish to you, but this is the crazy part: I built all of it with Claude agent teams. Eight milestones. 63 commits. 12,000 lines of TypeScript and Svelte. 372 unit tests, 119 end-to-end tests. Zero dependencies added after M2. (Three production deps existed from the start — vodozemac for crypto, simplewebauthn/browser for WebAuthn, ws for the relay. Every feature after that used browser built-ins.)

I can’t remember when I’ve been this excited to just play and make things because I can. Learning because there’s a direct connection between my learning and making value in the world — the thing that made me fall in love with building products in the first place.

What it’s actually like

The marketing copy makes “agent teams” sound like you say “build me an app” and go get coffee. The reality is setting the conditions, checks, balances, and idea flows of a bunch of well-meaning teammates who have a nasty habit of lying about when things are finished and whether they’re actually up to the quality bar you set for them.

So here was my team of well meaning liars — each one a markdown file that gets loaded into an agent to give it a specific role and quality bar:

Product Manager — writes user stories with Gherkin acceptance criteria, defines milestone goals, owns the “what” and “why”.
UX Designer — turns stories into user flows, grades each milestone 1–10, enforces WCAG 2.1 AA accessibility, pushes back when a story optimizes for the system instead of the person.
Architect — defines system layers, security invariants, agent contracts, and the threat model. Picks the cheapest appropriate model for each task.
Production Engineer — owns the quality gates: test coverage, type checking, TDD process, no plaintext on disk
Security Auditor — enforces ten principles of secure agentic development, auto-audits every code change against OWASP’s Top 10 for Agentic Applications, halts execution if any principle is violated.

I didn’t even write the feature ideas directly. I used confer.to to riff on what I wanted to build — I don’t like to noodle in plaintext — then fed those ideas as prompts for the PM, UX Designer, and Architect agents to evaluate, scope, set into milestones, and push back on. And frankly, they did a better job keeping scope down than many human teams I’ve worked with.

For the workflows, I started with inspiration from Steve Yegge’s Gastown and glittercowboy’s Get Shit Done framework. And after milestone 2, I did something that still feels surreal — I just asked the orchestrator agent to research the GSD repository and recommend changes to the workflow I’d designed, optimizing for token efficiency. And it actually worked. I agreed to a 5-phase-to-3-phase consolidation that cut redundant context loads by 60%. The workflow settled into three phases:

Phase 1: Think. One agent loaded with product manager, UX designer, and architect perspectives. One agent, reading the docs once, producing acceptance criteria and an implementation plan.

Phase 2: Execute. The orchestrator spawns parallel waves of agents. Independent tasks run simultaneously; crypto waves stay serial because you do not parallelize key management changes.

Phase 3: Ship. Push, write docs, GitHub sync, retrospective. The orchestrator handles the mechanical parts. I review the retro, extract the lessons, and update the workflow for next time.

And after every milestone shipped, I told the orchestrator to create retrospective documents with efficiency lessons-learned that fed forward into the next sprint. The agents got better over time because I was running a learning loop on top of them — each sprint’s retro became the next sprint’s constraints. And after a few sprints, I learned to trust it and just chained the output of one to the input of the next.

Here’s what one sprint looks like:

┌─────────────────────────────────┐
                        │           ME (Stevo)            │
                        │                                 │
                        │  • Write feature ideas          │
                        │  • Design workflows & skills    │
                        │  • Review retro → update rules  │
                        └──────────┬──────────────────────┘
                                   │ launches sprint
                                   ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 1: THINK                                                      │
│                                                                      │
│  ┌────────────────────────────────────────┐                          │
│  │  Consensus Agent (sonnet)              │                          │
│  │  loads: PM + UX + Architect skills     │                          │
│  │  reads: codebase, docs (once)          │                          │
│  │  outputs: user stories (Gherkin)       │                          │
│  │           acceptance.md                │                          │
│  │           implementation.md (waves)    │                          │
│  └────────────────────────────────────────┘                          │
└──────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 2: EXECUTE + GATE                                             │
│                                                                      │
│  Wave 1 (parallel)          Wave 2 (parallel)        Wave 3 (serial) │
│  ┌─────────┐ ┌─────────┐   ┌─────────┐ ┌─────────┐  ┌─────────┐      │
│  │ haiku   │ │ haiku   │   │ sonnet  │ │ haiku   │  │ sonnet  │      │
│  │ UI comp │ │ UI comp │   │ crypto  │ │ tests   │  │ crypto  │      │
│  └─────────┘ └─────────┘   └─────────┘ └─────────┘  └─────────┘      │
│                                                                      │
│  ┌────────────────────────────────────────┐                          │
│  │  Ship-Readiness Gate (sonnet/opus)     │                          │
│  │  loads: prod-eng + security skills     │                          │
│  │  runs: tests, type check, OWASP audit  │                          │
│  │  verdict: PASS or FAIL (with fixes)    │                          │
│  └────────────────────────────────────────┘                          │
└──────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌──────────────────────────────────────────────────────────────────────┐
│  PHASE 3: SHIP                                                       │
│                                                                      │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────────┐                │
│  │ git push │  │ doc sync │  │ retrospective        │                │
│  │          │  │ (haiku)  │  │ token counts, lessons│                │
│  └──────────┘  └──────────┘  └──────────┬───────────┘                │
└──────────────────────────────────────────┼───────────────────────────┘
                                           │
                                           ▼
                        ┌─────────────────────────────────┐
                        │           ME (Stevo)            │
                        │                                 │
                        │  Read retro → update workflows  │
                        │  & skills → next sprint         │
                        └─────────────────────────────────┘

Was this actually an efficient way to code?

I told Claude I wanted exact, verifiable evidence for the retrospective — not vibes. So it built a Python script that parses its own session logs, maps every API call to a milestone using git commit timestamps, and spits out the totals (that you can verify). Here’s what came back.

461 million total tokens across 6,424 API calls. $100/month Claude Max subscription. 2.5 days of wall time. 12,000 lines of production code, 491 tests, 8 milestones of E2E encrypted software. I still have 26 days left on the billing cycle, and I’ve been using the same subscription to build Uluka — an open-source security CLI that cross-references what your README claims about security against what the code actually does. That’s the tool I’m pointing at weaveto.do for the vulnerability scanning in Milestone #8. Two production apps, $100. I haven’t had this much fun making something real since the Habitry days.

Now, 461 million tokens sounds obscene. It’s not. Here’s why.

95.7% of those tokens are cache reads — the model re-loading context (codebase, docs, conversation history) every turn from a warm cache instead of re-processing from scratch. Think of it like a Docker layer cache: every turn rebuilds the image, but the layers haven’t changed, so it’s nearly free. You’re paying for the diff, not the base image.

Is 95.7% good? The most rigorous public analysis I could find — LMCache’s trace of a Claude Code session running a SWE-bench task — found a 92% overall prefix reuse rate across 92 API calls, with the execution phase hitting 97.8%. Developer self-reports on Hacker News cluster around 90%. My 95.7% across 6,424 calls is above the LMCache baseline and well within the execution-phase range. It’s not anomalous — it’s what well-structured multi-turn sessions with stable system prompts and consistent tool definitions naturally produce. But it does mean the workflow was doing the right thing: accumulating context efficiently instead of thrashing.

The actual new-work tokens — where the model is thinking, generating code, making decisions — total about 20 million. That’s 4.3% of the headline number.

For the curious: if you were paying per-token on the API instead of a flat subscription, cache reads run 90% off the base rate. So 442 million cache reads cost roughly $200 at API prices. The 20 million new-work tokens add about $113. Grand total at API rates: roughly $315 for an entire application with real cryptography. Well, I’m pretty sure the crypto isn’t all lies; Uluka hasn’t verified it yet.

Model mix was the other big efficiency lever. The sweet spot was 27% opus, 56% sonnet, 17% haiku. When I let the orchestrator implement waves directly instead of delegating, opus usage spiked to 72% and costs went up 5x. Same mistake as hiring a surgeon to put on Band-Aids. The orchestrator should orchestrate. The workers should work.

What Actually Worked

Warning, this gets (even more) technical

The pure function pattern was unbeatable. My task store is event-sourced. The auto-assign algorithm is a pure function: inputs in, events out, no side effects. These hit 100% test coverage trivially. If your agents are writing pure functions, they write near-perfect code on the first try.

Pre-writing interfaces before spawning code agents was the single biggest improvement. In M2, I told the orchestrator to start writing TypeScript interfaces and pasting them directly into agent prompts. Agents got typed contracts, not a scavenger hunt through the codebase. This eliminated an entire class of cross-agent blocking. It’s the difference between handing a contractor blueprints versus telling them “build me a house.” One produces a house. The other produces a conversation about what you meant by “house.”

The zero-dependency constraint forced good decisions. The browser’s built-in WebAssembly API was sufficient for agent sandboxing. The Web Crypto API handled PIN derivation and AES-GCM. vodozemac WASM handles Olm/Megolm. At no point did an agent suggest adding a library that I actually needed. Like Colin Chapman says: simplify and add lightness.

Security audits before shipping, not after. In M3, an opus-level review of the agent runtime caught 7 findings including a memory corruption vector where the host was writing unsolicited data into WASM sandbox memory. These are semantic issues that no linter or type checker will ever find. One opus review pass costs almost nothing compared to shipping a crypto vulnerability. This is --dry-run for your entire security model: you don’t get to skip it because the tests passed.

What Didn’t Work

I should have used agent teams from the start. M1 had three clearly parallel work streams — store logic, UI components, agent algorithms — and I ran them serially because I was being cautious. I was wrong. A team would have cut wall time significantly. I was so worried about race conditions in my own workflow that I accidentally serialized the whole thing.

Cheap models take shortcuts when you’re vague. Claude comes in three sizes — Haiku (cheap/fast), Sonnet (mid-range), and Opus (expensive/smart). I used Haiku for simple UI work to save tokens. In M6, I told a Haiku agent to “build a PIN entry component.” It needed to tell the parent page how many times the user got the PIN wrong. The correct way is to pass data through the component’s official interface — clean, traceable, testable. Instead, the agent duct-taped a hidden variable onto the browser’s global state. A sticky note on the fridge that any code in the entire app can read or overwrite. It works, technically. It’s also the kind of thing that causes bugs three milestones later. When I changed the instruction to “PinEntry receives failedAttempts as a prop from its parent,” it built it correctly. The specific instruction isn’t micromanaging — it’s giving the agent a blueprint instead of a napkin sketch. Give it one, or it’ll invent its own, and you won’t like what it invents.

Agents don’t grep all test files. In M5.5, an agent updated 7 E2E test files but missed 2 assertions in a file it wasn’t explicitly told about. The old string “Invite to Room” became “Invite to {roomName}” and two tests broke. Now every fix agent gets an explicit instruction: grep for ALL occurrences of the old string across ALL test files before you’re done. Every. Damn. File.

Vitest doesn’t catch TypeScript errors. This burned me in M7. All unit tests passed. Only npm run check caught a type mismatch where Uint8Array doesn’t satisfy WebCrypto’s BufferSource type. Your tests passing is not the same as your code being correct. Green CI is necessary but not sufficient. Remember: they’re well-meaning liars.

The Uncomfortable -> Fun Part

Here’s the thing I keep coming back to. This app has real cryptography. Not “we base64 the password” cryptography. Real key exchange, real forward secrecy, real signatures, real authenticated encryption. A 6-layer cleanup orchestrator that wipes everything when a room burns.

Agents built all of it.

Not from scratch. I had to know what to ask for. I had to catch what they got wrong. And I have to verify the crypto is actually correct in Milestone #8. But the implementation was theirs.

I don’t know how to feel about that. But I know how I feel about the rest of it.

I spent six years being really good at a job that didn’t make me curious anymore. I studied philosophy of mind twenty years ago because I wanted to understand how thinking works. I got a master’s in sport psychology because I wanted to understand why people do things. And then I spent a decade in product management where most of the job is managing the egos of people with more money than product sense.

This is the first time in years where I’m learning something genuinely new every day. Not “new framework” new — new category new. How do you design a workflow that makes unreliable agents produce reliable output? How do you write a prompt that functions like a job description, a spec, and a quality gate all at once? How do you run a retrospective on a team that doesn’t remember the last sprint?

I have 26 days left on this subscription and a list of things I want to build that keeps getting longer. That’s the feeling I missed. Not the shipping. The wanting to ship.

Verify it yourself

The entire codebase is public: github.com/smledbetter/Weaveto.do

Every claim in this post has a paper trail:

Token numbers & model mix — [docs/RETROSPECTIVE.md](https://github.com/smledbetter/Weaveto.do/blob/main/docs/RETROSPECTIVE.md) has the verified per-milestone breakdown
5 skill prompts — [.claude/skills/](https://github.com/smledbetter/Weaveto.do/tree/main/.claude/skills) contains the exact markdown files loaded into each agent role
Gherkin acceptance criteria — [docs/milestones/](https://github.com/smledbetter/Weaveto.do/tree/main/docs/milestones) has acceptance.md files for every milestone
3-phase workflow — [docs/WORKFLOW.md](https://github.com/smledbetter/Weaveto.do/blob/main/docs/WORKFLOW.md) describes the full sprint process
Wave-based commits — run git log --oneline and look for the feat(MN): ... (Wave N) pattern
Zero dependencies after M2 — run git log -- package.json to see that only 2 commits touch it
Tests — clone the repo, then npm install && npm run test:unit && npm run test:e2e