metric · 01
1,141 tests
The Salazar CLI itself. Spec written, agent dispatched, walked away. Four hours later: 63/63 features, 1,141 tests passing, fully functional terminal app. The tool built its own interface.
case study · 03
The tool that builds itself.
Autonomous coding orchestrator. Hand it a markdown spec, the planner-generator-evaluator loop runs until working software exists. Pointed at its own CLI spec, it built a 1,141-test Ink terminal app in four hours. Pointed at brownfield mode against its own codebase, it added features to itself.
~ what shipped ~
metric · 01
1,141 tests
The Salazar CLI itself. Spec written, agent dispatched, walked away. Four hours later: 63/63 features, 1,141 tests passing, fully functional terminal app. The tool built its own interface.
metric · 02
$9.27
Total cost to build mini-jwt: 38/38 features, 76 tests, 96% coverage, 70 minutes. End-to-end from spec to merged PR. The economics make this an actual tool, not a demo.
metric · 03
Contract-gated
Planner → Generator → Evaluator handoffs are typed contracts, not vibes. Each agent reads the previous agent's structured output. Failures route back, not forward.
Three agents, contract-gated, run until convergence. Planner reads the spec, emits a structured task graph. Generator consumes one task, writes code. Evaluator runs tests, decides: accept, retry, escalate. The loop continues until the spec is satisfied or the evaluator gives up. No human in the inner loop.
Named for the serpent that eats its own tail. The first big proof: we wrote a spec for an Ink-based terminal UI for Salazar itself, pointed Salazar at it, and walked away. Four hours later there was a CLI. We then pointed it at its own codebase in brownfield mode to add new features. The tool maintains the tool that maintains the tool.
Salazar isn't a chat assistant. There's no conversation. You hand it a spec and read the resulting PR. If the spec is bad, the build is bad. The loop can't divine intent. Garbage in, working code in the shape of garbage out. That's a feature: the spec discipline transfers.
mini-jwt at $9.27 isn't an outlier. Most builds in the 30-100 minute range run in single-digit dollars on Claude Sonnet 4.6 / 4.7. Long horizons or messy specs push that up. The economics make Salazar an actual production tool, not a "wait until inference gets cheaper" demo.
↳ "we pointed it at a spec for its own CLI and it built a 1,141-test terminal app in four hours."
~ on the workbench ~
~ counterfactual ~
Without it: every greenfield build is a thousand human-hours of code that an agent loop could write while you sleep. Or worse, the build never starts because the dev surface to express 'just build this' didn't exist outside Bedrock-grade procurement processes. Salazar is the cheap version of expensive infrastructure.
~ got something like this on the bench? ~
/ salazar / built by hand / shipped to a working URL /