How agentic AI made us waterfall again

By Wednesday morning, the part nobody had warned us about was already obvious — the meetings.yes — the irony was lost on no one

We were two and a half days into a one-week experiment that, on paper, should have been mostly keyboard and not very much conversation. Three senior engineers. A locked room. An agent in every editor. We had been told — by tweets, by vendors, by our own optimism — that putting agentic AI in the loop would make us faster, looser, more agile. And it did make us faster. By the time the week ended we’d shipped 137,900 lines, 194 commits, and 30 features that a traditional team our size would normally have spent two to three months on. But by Wednesday we hadn’t shipped a single feature yet. We were on day three of a week and we were still writing specs. We were spending more time in alignment than I had on any project in years. Acceptance criteria sounded like a lawyer drafted them. The keyboard, the part of the job we used to fight over, was suddenly the easy bit — but only because we hadn’t actually reached it.

This is a note about why that happened, what we built to make it work, and why I now think it’s the correct shape for this kind of bounded experiment — even though it isn’t the shape anyone sells you when they’re talking about AI, and even though I wouldn’t want to live in this shape long term.

A bit of setup. The project was an internal proof of concept: a one-week workshop where three senior engineers — tech leads and staff — would build a real product surface against an on-the-books business problem, with Claude Code driving the keystrokes. The deliverable wasn’t just code. It was a number leadership could trust: how much faster does our most expensive engineering tier actually move with this stuff? So everything that follows was decided with one eye on the work and one eye on whether we’d be able to defend the result.

The week broke into three phases, planned tightly upfront:

Monday — input. Business stakeholders walked us through the requirements (Excel + Figma), and we used the same day to prep the repo for the agent: writing the architecture docs, the DDD strategic-design notes, the workflow rules, the .claude/rules/ files, a Vertical Slice playbook, the CLAUDE.md, the testing strategy. Markdown, not code. By Monday evening the repo knew what kind of repo it wanted to become.
Tuesday and Wednesday — planning. Two full days of turning requirements into Jira: Epics, Features, Subtasks, dependencies, acceptance criteria. We did this with Claude — feeding it the business inputs and the prep markdown and letting it propose the decomposition, which we then read, pushed back on, and refined.
Thursday and Friday — development. Pick up Subtasks from the board. Run the Vertical Slice. Let Claude build, self-review against Figma, open the PR. Review, merge, next.

Three days of input and planning. Two days of code. The shape of the waterfall is right there in the calendar.

The repo we built for the agent, not for ourselves

The first real decision was the repo. Most of us came in assuming we’d spin up four or five microservice repos and let polyrepo discipline keep the boundaries clean. That instinct lasted about a week.

It turns out that what’s good for human cognition — small repos, narrow scopes, separate CI — is terrible for an agent. An agent reasons over files. If the file you need to change lives in another repo with another build system and another set of conventions, the agent has to hop. Each hop is a place where context gets dropped and hallucination creeps in. Multiply that across a feature that touches four services and you’ve built a maze specifically tuned to make the AI worse at its job.took us most of a week to admit this out loud

So we collapsed everything into a Go monorepo on Bazel. The shape settled quickly:

Click any folder to see why we cared about that one.

Fig 1. The repo, predictable to a fault. Click a folder to see why we cared about that one.

protos/ is the contract layer — every gRPC service has its .proto files here, and nothing else. service/ holds all the runtimes: the Go domain services, the Next.js frontend, and the API gateway. pkg/ is the shared Go library — auth, database, kafka, logging, grpc, generics, the usual things, written once, used everywhere. tests/, docs/, hack/, tools/, schemas/ round it out.

The point isn’t that this is novel. It very much isn’t. The point is that it’s boring — and boring, when an LLM is reading your repo, turns out to be a feature. The agent learns the shape once. After that, every new service, every new feature, every new contract change lands in the place you’d predict. We never had to teach Claude where things go. The folder structure was the teaching.

A second thing fell out of the monorepo that I didn’t expect to value as much as I did: a single go.mod. Cross-service refactors stopped being a dependency-version dance. If the pkg/auth interface changed, every consumer’s tests ran in the same CI invocation. The agent could suggest a breaking change, run the entire test graph, and tell us whether it had broken anything across all services. In a polyrepo world, that workflow either doesn’t exist or takes a week to cobble together.

One front door, on purpose

The other early structural decision was the gateway. We could have let the frontend talk to each Go service directly over gRPC-Web, or we could have written one BFF per consumer, or any of the other six patterns people defend in conference talks. We picked GraphQL, with a single gateway service in front of the gRPC backends, partly for boring reasons (we needed it for the frontend anyway) and partly for one reason that turned out to matter more than we expected.

The reason was: we wanted the APIs to be a product, not an integration problem. One schema. One front door. Anyone who wanted to consume the platform — the Next.js frontend, a future mobile app, an internal back office, a partner — would aim at exactly one URL and exactly one set of types.

That decision compounded with the agent in two ways.

First, the gateway gave Claude a single source of truth for the question what data exists in this product? In a polyglot, multi-endpoint world, “is there a way to fetch the customer’s open invoices?” is a question that requires the agent to crawl service definitions, REST routes, and tribal knowledge. With one GraphQL schema sitting in service/api-gateway/, the answer is one file away. Every time we added a new field to the schema, we were also extending the agent’s working vocabulary for what the platform could do.

Second, it kept service ownership clean behind that surface. The gateway resolves into the gRPC services, but it doesn’t own domain logic. A Subscription feature lives in service/agreement/ and service/billing/ and service/consumption/, each with its own proto and its own boundary, and the gateway is the place where they’re stitched together for a UI that doesn’t need to know how the sausage was made. The agent gets to reason about each service in isolation, then reason about composition exactly once — at the resolver — and stop.

I can’t overstate how much this mattered for keeping AI-generated work coherent across services. The gateway isn’t a clever pattern; it’s a containment strategy.

Every Feature is shaped the same way

Once the repo and the gateway were settled, the workflow shape almost designed itself. Every Feature in Jira decomposed into the same five artifacts, in the same order, and we started calling it the Vertical Slice.

Hover or tap any step to see what happens at that layer.

Fig 2. The Vertical Slice. Same five files, same order, every Feature.

Proto first, BDD last, no exceptions. The proto pins the contract before anyone writes Go. The Go server can be implemented and tested in isolation. The GraphQL resolver is a thin translation layer. The React component is fed mocked GraphQL until the resolver is real. The BDD scenario closes the loop by exercising the whole stack from a user’s point of view.

This is not a new idea. It’s just unusually rigid. And rigidity, again, is what the agent rewards. Hand Claude a Subtask labelled “implement the gRPC handler for GetSubscription,” and it knows exactly which file to write, which pkg/ libraries to import, which test fixtures to follow, which adjacent handlers to read for style. The decomposition wasn’t doing the AI’s homework for it. The decomposition was the homework — and the moment we let it slip, output quality fell off a cliff.

Which brings me to the part I came here to talk about.

We accidentally became waterfall

Here’s the realization that crept up on us somewhere in the middle of the planning phase, when we noticed the calendar had eaten more of the week than we’d budgeted for.

The agile playbook — small slices, learn-by-doing, refine on the fly — is a strategy for managing implementation ambiguity. You don’t fully specify the work because you know the act of building will teach you what you actually need. The first version is wrong, the second version is less wrong, the third version starts to look right. Specifications are liabilities; running code is the spec.

Agentic AI quietly upends that. When the cost of producing a first implementation drops to nearly zero, the bottleneck doesn’t disappear. It moves. It moves up — to the spec.

You can’t hand a mushy ticket to a competent agent and expect implementation to clarify it for you. The agent will produce something, quickly, plausibly, and — if the spec was vague — wrongly. And because it’s so quick, you won’t notice you’ve gone in the wrong direction until you’ve covered a lot of ground. The faster the typing, the more expensive a wrong direction becomes.

So you start refining harder. You get sharper about acceptance criteria. You agree on field names before anyone proposes a proto. You sketch the React component against Figma before there’s a resolver to feed it. The spec stops being a sketch and starts being a contract. And the workflow stops looking like this:

spec a bit → build a bit → learn → spec a bit more → build a bit more → learn …

…and starts looking like this:

Fig 3. The rhythm we ended up with. The Figma badge is a recursive self-check Claude runs against the design via MCP before opening a PR — the human Review step gets a result that already passed its own visual diff.

The loop is still there — but it loops back to refinement, not to a half-built version of the thing. There’s no “let’s just see what it looks like, then iterate.” There’s a spec; there’s an implementation; there’s a review. If the review fails, you don’t tweak the implementation. You go back to the spec.

The little Figma badge hanging off the Implement box is the part of this diagram I keep pointing at when I tell people about it.

We hooked Claude up to Figma over MCP, and the agent learned to perform a recursive visual diff on its own work before opening the pull request. The flow goes: implement the React component, render it, fetch the Figma frame for the same screen via get_design_context(), compare them. If spacing is off or a token is wrong, Claude adjusts the code and runs the check again. By the time a human reviewer opens the PR, the screen they see has already been validated against the design. Sometimes more than once.

Two things made this matter more than it might sound. First, it killed an entire class of review nit — the “your padding is 4px off” comment that used to consume a third of frontend reviews. The agent caught those itself, silently, with better fidelity than a human eyeballing the diff. Second, it sharpened the boundary between what computers can verify and what humans need to verify. Pixel correctness against a known design? Computer. Whether the design was the right design in the first place, or whether the resulting interaction felt right to a real customer? Still ours. The Figma self-check didn’t replace human review — it raised its floor, so when a human did open the PR, they were always reviewing something that already passed the easy bar.

The funny thing is that even when Claude did the refinement — and it did, often — the rhythm didn’t change. The agent was excellent at turning a vague Feature description into a sharp, well-decomposed set of Subtasks, but you still had to read every one. You still had to push back. You still had to agree on the spec before anybody (human or AI) started producing code against it. Outsourcing the typing didn’t shorten the upstream conversation; if anything, it made it more important, because the typing step would no longer slow anyone down enough for misalignment to be caught organically.

I’d resisted this framing for a while, mostly because “we became waterfall” sounds like a confession of failure. But waterfall isn’t intrinsically bad — it’s just the wrong tool when the cost of building is high relative to the cost of specifying. Agentic AI inverts that ratio. Once a feature costs an hour to build but a day to specify correctly, spending most of your time on the spec is the rational move. The methodology was tracking the economics, not betraying them.

What we’d actually backed into has a name in some circles: spec-driven development. Write the spec, let the agent build to spec, review against spec. It’s a perfectly defensible methodology and it has been around in various flavours for decades. But spec-driven development done unchecked has a precise failure mode in the agentic era — full refinement before any development. Big spec → big batch → big blast radius. The bigger the spec at hand-off, the further off-course the agent can drift before anyone notices, and the more code you have to throw away when you do.

The clearest signal that we’d over-specified was the autonomous loop. Hand Claude a fat spec — a whole feature, multiple files, a half-dozen acceptance criteria — and it would happily disappear into a code-and-self- review cycle for thirty, forty, sometimes forty-five minutes without prompting. Refining. Regenerating. Second-guessing its own diff. Coming back with a wall of code that was internally coherent and wrong in the two or three places that mattered. Smaller batches don’t just lower the blast radius of a wrong direction — they cut the agent’s autonomous time short enough that misalignment surfaces before it compounds. The spec is the throttle.the one metaphor I’m taking with me

That, more than anything, is the lesson I’d hand the next team running this experiment. Don’t fall in love with the spec.

Next time: agile spec-driven AI development

The waterfall rhythm fit the workshop because the workshop was bounded. Three days, fixed scope, every minute on the clock — there was no room for the spec to evolve, only to be agreed and executed. Outside that constraint, though, I don’t think waterfall is what you actually want. What you want is closer to agile spec-driven AI development: same recognition that the spec is now load-bearing, but iterated in slices the way agile codebases used to iterate on code.

We didn’t get to practise this — by design, we were measuring a single sprint. But here’s the shape I’d try next time, in roughly the order I’d introduce it to a team:

Spec the smallest thing that can ship. Not the feature; the thinnest slice of the feature that exercises every layer end-to-end. One screen, one endpoint, one user story. Let that round-trip teach you what the spec for the next slice should sharpen.
Time-box the agent’s autonomous time. If Claude has been off thinking for fifteen minutes without checking in, that’s a signal the spec was too big or too vague. Interrupt, refine, restart. Forty-five- minute solo loops aren’t agentic — they’re a stalled context window pretending to make progress.
Treat the spec as a living document, not a contract. The day-one prep artifacts (architecture, conventions, DDD, the Vertical Slice rules) should be stable — that’s the frame, and the frame is expensive to change. But the per-feature spec should grow alongside the code, with each slice sharpening the spec for the slices that follow.
Smaller PRs, more PRs. “Large AI changesets are riskier than large human changesets” was the most actionable bullet in our post-mortem. The agent doesn’t have intuition about blast radius; you have to enforce it externally, with batch-size discipline.
Spec interfaces, not implementations. The richer the spec, the more the agent acts like a transcription machine and the less like a collaborator. Specifying the contract — proto, schema, component signature — and leaving the body open is where Claude’s judgment is at its most useful.

None of this is novel; it’s mostly classic agile, ported one layer up. The trick is recognising that with an agent in the loop the unit of agility has moved. You’re no longer iterating on the code. You’re iterating on the spec, with the code as a fast-feedback compiler for whether the spec was any good.

What this means for senior-engineer time

Stack three senior engineers on one project and you’d normally expect them to drown each other in opinions. The classic failure mode is that you have three planners and no implementer, so the planning becomes the work and nothing ships.

Agentic AI fixes that — but in a particular way. It doesn’t turn the planners into implementers. It makes the implementing part small enough that a planners-only team can ship. The cognitive ratio of the team matches the cognitive ratio of the work. We were three planners; the work became mostly planning and review; nobody had to pretend to be a junior engineer for a week to absorb the typing tax.

The unglamorous version of the headline finding is that AI’s biggest leverage on senior engineers isn’t writing code. It’s offloading the decomposition work that seniors normally do for other people. Take that off their plate and give them an agent that can be trusted to translate a sharp spec into a working slice, and you discover that a small, senior team can drive a real product from contract to ship without the usual machinery around them.

There’s a line I keep coming back to from the post-mortem write-up, because it’s the most honest summary of what we saw:

Where structure exists, AI multiplies speed. Where structure is missing, AI multiplies chaos.

That’s it. That’s the whole experiment in one sentence. Every hour of prep — every markdown file, every ADR, every CLAUDE.md, every rule under .claude/rules/ — was an hour invested in the speed side of that equation. Skip the prep and the same agent, on the same problem, would have produced an enthusiastic mess at twice the rate of a junior engineer with a Stack Overflow tab open.

What the experiment proved

The point of the project wasn’t to prove that AI can write Go. We knew that. The point was to put a number on AI-augmented velocity for the most expensive engineering tier, on a real on-the-books business problem, so the result couldn’t be dismissed as a hype demo.

The numbers, then. Three engineers, one week, structured into one day of input, two days of planning, and two days of build. 137,900 lines of code. 194 commits. 59 pull requests, 53 of them merged. 30 features done, 199 of 330 issues closed. Around $1,725 in agent tokens for the whole sprint — less than the catering. A traditional team of three would have spent eight to twelve weeks to land equivalent scope: requirements engineering through working code, CI/CD, deployable infrastructure, the lot.

What we ended up with wasn’t just a working slice of a real product. It was a stack of Jira artifacts that documented every decision and how long it took. A GraphQL surface a non-technical stakeholder could click through. And — more usefully than the numbers — a defensible story about how the velocity was achieved. The shape of the repo. The shape of the gateway. The shape of the slice. The shape of the workflow that fell out of all three.

The line I’d port out of the experiment to anyone considering one of their own is this: the model is a force multiplier on the step before the keyboard, not the keyboard itself. Build the repo for the agent. Build the gateway as a product. Build the slice as a contract. Spend day one on the markdown. And then accept that with agentic AI in the loop, the unit of agility has moved upstream — you’re not iterating on the code anymore, you’re iterating on the spec, and the code is just the fastest-feedback compiler you’ve ever had for whether the spec was any good.