case study · 03

Salazar

The tool that builds itself.

Autonomous coding orchestrator. Hand it a markdown spec, the planner-generator-evaluator loop runs until working software exists. Pointed at its own CLI spec, it built a 1,141-test Ink terminal app in four hours. Pointed at brownfield mode against its own codebase, it added features to itself.

Salazar illustration

~ what shipped ~

Three numbers that matter.

metric · 01

1,141 tests

The Salazar CLI itself. Spec written, agent dispatched, walked away. Four hours later: 63/63 features, 1,141 tests passing, fully functional terminal app. The tool built its own interface.

metric · 02

$9.27

Total cost to build mini-jwt: 38/38 features, 76 tests, 96% coverage, 70 minutes. End-to-end from spec to merged PR. The economics make this an actual tool, not a demo.

metric · 03

Contract-gated

Planner → Generator → Evaluator handoffs are typed contracts, not vibes. Each agent reads the previous agent's structured output. Failures route back, not forward.

The loop

Three agents, contract-gated, run until convergence. Planner reads the spec, emits a structured task graph. Generator consumes one task, writes code. Evaluator runs tests, decides: accept, retry, escalate. The loop continues until the spec is satisfied or the evaluator gives up. No human in the inner loop.

The ouroboros

Named for the serpent that eats its own tail. The first big proof: we wrote a spec for an Ink-based terminal UI for Salazar itself, pointed Salazar at it, and walked away. Four hours later there was a CLI. We then pointed it at its own codebase in brownfield mode to add new features. The tool maintains the tool that maintains the tool.

What it doesn't do

Salazar isn't a chat assistant. There's no conversation. You hand it a spec and read the resulting PR. If the spec is bad, the build is bad. The loop can't divine intent. Garbage in, working code in the shape of garbage out. That's a feature: the spec discipline transfers.

Cost realism

mini-jwt at $9.27 isn't an outlier. Most builds in the 30-100 minute range run in single-digit dollars on Claude Sonnet 4.6 / 4.7. Long horizons or messy specs push that up. The economics make Salazar an actual production tool, not a "wait until inference gets cheaper" demo.

↳ "we pointed it at a spec for its own CLI and it built a 1,141-test terminal app in four hours."

~ on the workbench ~

The tooling.

~ counterfactual ~

What would have been worse.

Without it: every greenfield build is a thousand human-hours of code that an agent loop could write while you sleep. Or worse, the build never starts because the dev surface to express 'just build this' didn't exist outside Bedrock-grade procurement processes. Salazar is the cheap version of expensive infrastructure.

~ got something like this on the bench? ~

Pull the cord.

Start the conversation

/ salazar / built by hand / shipped to a working URL /