Agentic SDLC: The Anatomy of a Validation Harness

In early 2026, Stripe disclosed that a fleet of autonomous agents internally called "Minions" ships over 1,300 pull requests a week. Roughly 70% of them get merged with no human modification. In the same window, Faros AI's telemetry across 22,000 developers reported that median PR review time had climbed 441%, incident rate per PR was up 242%, and 31% more PRs were being merged with no review at all.

Pick whichever number unsettles you more. They describe the same trend.

The bottleneck in software delivery has moved. It used to live in typing. Then it lived in deploying. Now it lives in trusting. We have agents that can produce more code in a week than most teams shipped in a quarter, and we have nothing remotely equivalent on the receiving end of the merge button.

Most teams' first answer to this has been to slap an AI review bot on the PR. CodeRabbit. Greptile. Bugbot. Diamond. GitHub Copilot Code Review. Anthropic's Claude Code Review. The bot reads the diff, leaves a summary, flags a few things. Done.

It is not the wrong direction. It is the wrong shape. What teams actually need is a validation harness, and the right way to think about a validation harness is to treat it like a CI platform. Many layers. Many specialized agents. Real infrastructure underneath. Different owners for different parts. The single-bot model is to the validation harness what a 2009 Jenkins(I think it was still called Hudson) box was to a modern internal developer platform.

Here is what that looks like.

25–41%

New code now AI-authored across Microsoft, Google, Stripe and others

+441%

Median PR review time, 2024 → 2026 (Faros AI, 22k devs)

+242%

Incidents per PR over the same period

01 / The Bottleneck MovedGeneration is cheap. Trust isn't.

The numbers are not subtle. Satya Nadella said in April 2025 that 20 to 30 percent of Microsoft's code is now machine-written. Sundar Pichai put Google's number above 25%. Boris Cherny, who runs Claude Code at Anthropic, says 4% of all public GitHub commits are now Claude-authored and that code output per Anthropic engineer is up 200% in a year. Stripe's Minions ship 1,300 PRs a week, autonomously. Stack Overflow's 2025 survey has 82% of developers using AI weekly.

On the other side of the merge button, almost nothing has changed.

Faros AI's 2026 dataset across 22,000 developers shows PR volume up 98%, PR review time up 441%, bug rate per developer up 54%, incidents per PR up 242%, and the share of PRs merging without any human review up 31%. A 2025 internal Agoda study and METR's randomized trial on experienced open-source maintainers both found the same counterintuitive thing: developers using AI tools were 19% slower on real, large codebases, even though they felt 20% faster. The cost of generation collapsed. The cost of comprehension did not.

This is what a bottleneck shift looks like. Gergely Orosz put it cleanly: speed of typing was never the bottleneck. AWS's Executive in Residence blog reaches for Goldratt directly: "When you use AI assistants to speed up coding, bottlenecks shift elsewhere in the value stream." The 2025 DORA Report (~5,000 respondents) is more careful and arrives at the same place. AI is an amplifier. If your system is mature, it goes faster. If your system is chaotic, the chaos goes faster too.

Right now, for most teams, the chaos is going faster.

"The most striking thing about the 2025 DORA report was that the majority of companies report that AI is just adding more chaos to a system already defined by chaos." Charity Majors, charity.wtf

02 / Why One Bot Is Not EnoughThe structural argument against the summary-on-the-PR.

The most popular response to the review crunch has been to drop a single AI reviewer into the PR flow. The bot reads the diff, comments on a few things, posts a summary. This is better than nothing. The numbers it produces in independent benchmarks are also catastrophically incomplete.

Greptile's own benchmark on 50 real bugs put Greptile at 82% catch rate, but with 11 false positives per run. Bugbot caught 58%. CodeRabbit caught 44%. Graphite caught 6%. Macroscope's later sweep had nobody above 48%. A 28-PR audit of CodeRabbit found 15% of comments were noise and another 21% were nitpicks. Augment Code's own benchmark put the best single tool at an F-score of 0.59.

There is a reason for this, and it is not that the models are bad. It is structural.

A single AI review bot is, at the end of the day, one LLM call wrapped in a webhook. It has one context window, which means it loses cross-file context. It has one judgment, which means nothing checks its work. It has no execution environment, which means it cannot observe behavior. It has no specialization, which means it averages over everything. It does not know whether your company is bound by HIPAA. It does not know about the CVE that landed in your dependency this morning. It has no memory of last week's incident, so it does not warn you when you reintroduce the bug. It has no adversary, so it never gets stress-tested on its own failures.

Every one of those limitations maps to a missing component. And here is the giveaway: the vendors who built the best single-shot review bots have all already moved past the pattern. Anthropic's Claude Code Review, which Boris Cherny describes as "a team of agents to hunt for bugs," dispatches parallel specialists, ranks findings, and verifies before posting. CodeRabbit's "agentic code validation" runs validators in sandboxes ("tools in jail") that actually execute and stress-test the code. Qodo added cross-repo context. Microsoft's internal PRAssistant runs across 90% of Microsoft PRs (about 600,000 a month) with per-repo configurations.

The architecture is converging. Nobody who is serious about this is shipping a single LLM call anymore.

What the slop actually costs

The May 2025 Lovable vulnerability disclosure (CVE-2025-48757) shipped 170+ production apps with missing Row Level Security policies, because the AI-generated stack skipped access checks. Escape.tech later scanned 5,600 live vibe-coded apps and found 2,000+ high-impact vulnerabilities, 175 instances of exposed personal data, and 400+ leaked secrets. These are not edge cases. They are what happens when generation is cheap and validation is a single bot leaving a summary.

03 / A Harness, Not A BotTreat validation like a CI platform.

The right mental model is borrowed from a decade ago. Jenkins started as one Java server with a web UI running cron-like jobs. Then it became Jenkinsfile and pipeline-as-code. Then distributed agents, parallel stages, artifact management, ephemeral runners. Then GitHub Actions and GitLab CI and Buildkite and CircleCI. Then internal developer platforms with self-service golden paths. Each iteration absorbed complexity behind an interface that the application team did not have to think about, while specialized teams owned each stage.

Validation is on the same arc. The "single AI review bot" is the Jenkins-1.0 stage. The next stage is a validation harness, which is just a name for what the more serious teams are already building: a layered, mostly parallel system of specialized validation agents running on a shared platform.

For a typical web SaaS team, the baseline harness has four layers:

A standards agent that knows your repo conventions, lint rules, idioms, and the way your team writes code. This is the layer that catches the boring stuff and silences the boring complaints.

A security agent that does SAST, SCA, secret scanning, and dependency analysis, with a hot feed of CVEs and a graph of your codebase. It is the layer that should know about CVE-2025-X this morning, not next quarter.

A functional agent that boots an ephemeral sandbox, pulls the branch, runs the relevant tests, and ideally drives the change behaviorally. It records a video of what the change actually does. It attaches the recording to the PR. The reviewer watches 30 seconds instead of inferring from diff lines.

A summarizer that takes the structured findings from the others, deduplicates, ranks by severity, and produces a single readable verdict with evidence pointers. Without this, a harness becomes noise. With it, the reviewer gets one panel that says "here is what matters, here is the evidence, here is what to look at."

That is the baseline. From there, the shape varies with what your team actually ships. A few concrete examples to make the point.

A fintech team handling payments and KYC adds a compliance agent for PCI DSS, PSD2, SOX, and the relevant local regulator. They add a behavioral agent that replays a slice of anonymized production transactions against the change, looking for monetary drift, rounding errors, or currency edge cases. Audit trail integrity becomes a validator. A data-migration safety agent sits alongside the functional layer, flagging schema changes that would lock a hot ledger table during peak hours or backfill a column without a chunked plan — the kind of migration that passes review and bricks production at 3am. The summarizer learns to weight financial-impact findings heavily because the cost of a missed bug is denominated in actual money.

A team at a regulated enterprise shipping a mobile app adds a compliance agent for the policies they answer to (SOC 2, HIPAA, GDPR, the EU AI Act's tiered obligations, internal data residency rules). They add an app-store policy agent that knows about Apple's Guideline 5.1.2(i) on third-party AI disclosure (which quietly landed in November 2025) and Google Play Protect's ML policy checks. They add a behavioral agent that watches crash rates and battery impact across representative device profiles. An accessibility validator becomes load-bearing in this profile — the EU Accessibility Act bound consumer mobile apps from June 2025 and ADA Title III suits against inaccessible apps keep climbing — and a design-regression validator catches the UI drift across device sizes that crash reports never surface.

A team shipping an agentic product adds a red-team agent as one of its most important validators. The red-teamer argues against every change from a permanent oppositional role, generating prompt injections, jailbreak attempts, and malformed tool inputs. They probably also wrap a model evaluation harness as a validator, since they ship prompts and tools alongside code, and a prompt regression is functionally identical to a code regression.

A healthcare team shipping a Software-as-a-Medical-Device platform adds a HIPAA compliance agent, a PHI redaction validator that scans every diff for leaked patient identifiers, and an FDA SaMD lifecycle agent that tracks risk classifications across changes. The functional layer here is heavy on safety scenarios. The bar for a false negative is much higher than for a false positive, and the harness is tuned accordingly.

A games studio shipping a live-ops title adds a cheat-detection validator that flags changes affecting server authority or client trust boundaries. They add an anti-tampering agent for the client binary, a console certification agent for platform requirements, and a live-ops regression agent that replays a slice of player traffic against the change.

An open source maintainer drowning in AI-generated PRs needs the harness less for catching deep bugs and more for triage. The functional agent here is the single most important layer. Did the PR actually run? Did the tests pass? Did the contributor touch files they have never touched before? That last signal alone is often enough to deprioritize most of the slop. A documentation-drift validator catches the next most common signal: a contributor's diff lands without touching the README, the OpenAPI spec, or the examples that still reference the old shape. A license and SBOM validator catches the more dangerous version, where AI-generated code pastes in a copyleft snippet with no provenance.

Across all of these, two layers tend to get added once teams have lived with the harness for a quarter. A telemetry and observability validator that checks every new code path emits the traces, metrics, and structured logs that on-call will need at the right cardinality — the layer that makes sure a silent regression in prod actually looks loud. And the inverse of the security layer: a license and SBOM validator that asks not "is this safe to run" but "is this safe to ship," catching copyleft contamination and dependency provenance gaps before they become a legal problem.

These compositions are not optional add-ons stacked on top of a "real" harness. They are different shapes of the same idea, configured for what the team actually ships. The principle is consistent. The composition is not. And the only way any of these work is if the layers run in parallel, return structured outputs, and are owned by the right teams.

∗ ∗ ∗

Interactive / Build Your Harness

Stack the layers. See what you actually catch.

Pick a starting profile, then toggle individual layers. The panel on the right shows what your harness catches, who owns it, and what it misses.

04 / The Platform UnderneathThe AI is the easy part.

Anyone who has actually shipped one of these will tell you something close to the same line. Stripe, in their own writeup of the Minions: "The primary reason the Minions work has almost nothing to do with the AI model powering them. It has everything to do with the infrastructure that Stripe built for human engineers, years before LLMs existed." Datadog frames their internal approach as "harness-first engineering": invest in checks that can tell you in seconds whether the agent's output is correct, then trust the agent. CodeRabbit's case study with Google Cloud reveals what a credible reviewer actually costs to run: 200+ Cloud Run instances at peak, each 8 vCPU and 32 GiB, two layers of sandboxing, 10 requests per second, reviews running 10 to 20 minutes.

That is the platform. The agents are tenants on it.

Concretely, a real validation harness needs:

Ephemeral sandboxes at scale

Functional validation requires actually running the code, which requires throwaway environments. E2B went from 40,000 sandbox sessions a month in early 2024 to 15 million a month a year later. Daytona claims 27 to 90 millisecond cold starts. Modal supports GPU workloads. Vercel Sandbox, Fly's Sprites (with checkpoint and restore), Northflank (VPC and BYOC for compliance), Bunnyshell and hopx.ai for full-stack environments with databases. The choice between Firecracker, gVisor, Kata, Sysbox, and microVMs is a real threat-model decision, not a procurement detail.

Threat intelligence as a live feed

The security layer is only as good as how recent its threat data is. That means CVE, EPSS, OSV, GHAS, plus your private incident history, all indexed and retrievable by the security agent at runtime. This is the data the dedicated SAST vendors have aggregated for a decade. The harness either ingests it or buys a layer that does.

Context routing between agents

Naive multi-agent fan-out is wildly expensive. AgentPrune found common multi-agent topologies use 2 to 11.8 times the tokens of a simple chain. LangChain's Deep Agents documentation describes offloading content above 20k tokens to filesystem references and compressing message history at 85% context window utilization. Anthropic's research-system writeup is explicit about persisting plans to memory before the 200k cutoff. The schema you use for inter-agent communication is the most important design decision in the harness. Structured JSON findings, not free-text prose, are what let the summarizer compose without re-running the model.

Cost discipline

Running a fleet of validation agents per PR adds up fast, especially when functional and adversarial layers boot real sandboxes and the security layer runs on a frontier model. The teams managing this well route by tier. Heavier models on security-critical layers. Cheaper models on style and conventions. Draft-tier validation on draft PRs, full-tier on main. Boris Cherny's "underfund projects on purpose" principle applies here. Compute scarcity at the validator layer forces good design.

Observability for the harness itself

You cannot manage a fleet of non-deterministic validators without trace-shaped telemetry. Charity Majors has been arguing this for two years: agents need observability designed for parallel async work, not for three-pillar dashboards. Honeycomb, Datadog, and the open Holistic Agent Leaderboard (21,730 rollouts across 9 models and 9 benchmarks) are all converging on the same shape. Without it, false-positive rates drift and nobody notices.

05 / Who Owns WhatYou regulate it. You validate it.

One of the most useful frames inherited from the DevOps shift is "you build it, you run it." The team that ships the code owns its operational fate. The validation harness extends this frame in a way that has not been named clearly yet, and that is the thing that matters most for the next two years of org design:

You regulate it. You validate it.

The team that owns a policy owns the agent that enforces it. AppSec owns the security validator and its threat-intel feed. GRC owns the compliance validator and its controls library. The mobile platform team owns the app store policy agent. QA or SRE owns the functional sandbox and its test fixtures. A dedicated AI safety or pen-test team owns the red-team agent. Platform engineering owns the orchestration, the sandboxes, the observability, and the budget.

This is exactly how mature CI ownership evolved. Today, in any well-run company, security owns the SAST stage, the platform team owns the runners, QA owns integration tests, and the application team owns the build. Each contributes their stage. The platform team owns the pipeline. The validation harness inherits all of that, and the AI part is mostly a delivery mechanism.

The reason this matters is regulatory. EU AI Act Article 55 GPAI obligations are already in effect; the high-risk system clauses bind from August 2026. OWASP's Agentic Top 10 shipped in December 2025. Apple updated Guideline 5.1.2(i) to add a third-party-AI disclosure requirement quietly. None of this slows down for a central platform team's ticket queue. If GRC cannot deploy a new compliance validator on its own when an amendment ships, you will be out of compliance before you ship the change.

"I don't ever see AI agents becoming a stand-in for an actual human engineer signing off on a pull request." Greg Foster, Graphite

The point of the harness is not to remove the human. The point is to make the human reviewer's job possible at AI-generated throughput. That means the harness handles the mechanical, the structural, the regulatory, and the reproducible. The human handles judgment, intent, and architecture. Anything else is unsustainable.

06 / The Adversarial LayerAgents reviewing agents.

One of the more interesting properties of a mature harness is that the validators argue with each other. Constitutional AI established the canonical pattern back in 2022: a critic evaluates outputs against an explicit constitution, then prompts revision. Multi-agent debate scaled this to two agents on opposing sides with a third as judge. RedDebate and BlueCodeAgent extended it to safety and to code generation specifically.

The applied form for a validation harness is straightforward. An implementor agent proposes a verdict ("this change is safe"). A critic agent argues against it from a permanent oppositional role, citing past incidents, known anti-patterns, CVE patterns, and your internal post-mortems compiled into a constitution. A judge agent reconciles, weighing evidence. The constitution updates whenever you have a production incident. Boris Cherny's CLAUDE.md file is a small-scale version of this: "every mistake becomes a rule." A team-level constitution is the same principle applied to validation policy.

Empirical sweeps of debate patterns across 322 benchmarks show that competitive refinement and debate tournaments outperform cooperative patterns on reasoning tasks by wide margins. Rigid red versus blue role assignments do worse than fluid adversarial roles, because the defender spends most of its token budget reacting. The lesson for harness design is to favor competition over cooperation and fluid roles over fixed ones. The lesson for the regulatory side is that adversarial validation is moving from a quality lever to a compliance requirement.

07 / Routing The HumanSome PRs still need a person. Make sure it's the right one.

There's a temptation, once the harness is working, to imagine the human reviewer eventually disappearing. That's the wrong target. The right target is making sure the human shows up only when their judgment is the thing being asked for, and that when they do show up, they're the right human.

Some changes have to go through a person. SOX requires it for anything that touches financial reporting code. HIPAA requires it for changes to PHI flows. Most internal policies require it for changes to authentication, authorization, payments, key rotation, or anything tagged as architecturally risky. In regulated industries the requirement is statutory: separation of duties means the person who wrote the code cannot be the only one who approves it, and the approval has to be recorded against a named human. No harness gets you out of this. The harness's job here is to make that human's work better, not to remove them.

This is where the harness gets to do something the old single-bot model never could. A routing agent reads the diff, queries the git history of every file touched, looks at who has actually shipped meaningful changes to that code, and produces a ranked list of reviewers. Whoever owns the auth module gets pinged for an auth change. Whoever wrote the original webhook handler gets pinged when someone touches it. CODEOWNERS files have done a static version of this for years; a routing agent does it dynamically and with context. It can also notice things CODEOWNERS cannot, like "the last person who meaningfully edited this file left the company eight months ago, escalate to their replacement" or "this PR touches three modules with disjoint owners, you need all three to sign off."

For changes that legally require a named approver, the routing agent has another job: making sure the right reviewer is selected for the right regulatory reason. A change touching financial reporting code needs a SOX-trained reviewer. A change touching PHI needs someone with the HIPAA training on file. The harness can read the change, identify the regime, and route accordingly, with the approval trail captured for the audit.

There are also changes where you want a different shape of human involvement entirely. Not "approve this PR" but "watch this run." The functional agent boots the change in a sandbox, drives it through a few representative flows, attaches the recording. The reviewer watches two minutes of video and approves. This is the right model for changes that are mechanically safe (tests pass, security clean) but have user-facing or behavioral consequences someone with judgment should sign off on. The reviewer doesn't read the diff. They watch the change happen.

And there are changes where the human's role is to confirm a verdict rather than produce one. The harness has determined the change is safe. The summarizer has produced a one-paragraph explanation. The reviewer reads it, glances at the evidence pointers if anything is surprising, and clicks approve. This is the bulk of routine changes, and it should look like routine work. If it doesn't, the harness is leaking too much detail to the reviewer.

The pattern across all of these is the same. The harness handles the mechanical, the structural, the regulatory, and the reproducible. The human handles judgment, risk, accountability, and the things you do not want to defend in an audit by saying "an agent approved it." Different changes need different mixes. The harness's job is to pick the right mix and put the right human in front of it.

08 / Where To StartA staged checklist.

Most teams cannot go from a single review bot to a seven-layer harness in a quarter. Most should not try. The progression that actually works, drawn from the teams that have made it furthest, looks like this:

Measure first

Instrument PR cycle time, comment-action rate, defect escape rate, incident rate per PR, and review-to-merge ratio. Set a 90-day baseline. Without these numbers you cannot tell whether the harness is working or whether you are just adding noise.

Adopt one structured AI reviewer

Pick one tool. Calibrate its false-positive rate below 15%. Above that threshold, engineers stop reading and the whole thing becomes ambient noise. CodeRabbit, Greptile, Qodo, Claude Code Review, and Copilot Code Review are all reasonable starting points. The right answer depends on platform fit, not model rank.

Add functional validation in a real sandbox

This is where you get the largest single quality jump and where pure-static reviewers cap out. E2B, Daytona, Modal, or Vercel Sandbox for cloud-only. Bunnyshell or Northflank in VPC for stacks that need databases and compliance.

Decompose your single reviewer into specialists

Split security off first. Give it its own threat-intel feed and its own constitution. Then compliance. Then a summarizer that takes their structured outputs and produces one human-readable verdict. Now you have a real harness.

Hand each validator to its natural owner

AppSec owns security. GRC owns compliance. Mobile platform owns app-store policy. QA owns functional. Platform engineering owns orchestration, sandboxes, observability, and budget. Document the contract between layers as a JSON schema, not a Slack thread.

Add adversarial and learn from incidents

Stand up a red-team agent. Start a team-level constitution. Every production incident becomes a new clause. This is where you cross from a "reviewer farm" into a real harness that gets better over time.

Treat the harness as a product

Give it SLOs. A budget. A false-positive rate target. A roadmap. A team. The same way Backstage and Humanitec turned CI plumbing into internal developer platforms, the harness becomes an internal product with paying tenants (the application teams).

A few signals that should change your priorities at each stage. False-positive rate above 15% means recalibrate before adding layers. Incidents per PR rising while review time falls means you have over-automated and need to slow down. Median PR size above 400 lines means the upstream problem is batch size, not downstream validation.

09 / The Honest PartWhat this does not solve.

It would be dishonest to end without naming what a harness still cannot do.

It cannot replace architectural judgment. It cannot tell you whether a design is the right one for where the system is going next year. It cannot mentor the junior on your team or notice when someone is struggling. It cannot prevent the failure mode Gergely Orosz called out, where engineers at large companies game AI-usage metrics by spawning autonomous agents to produce junk code. It cannot fix a team that does not write tests. It cannot save you if your batch sizes are too big or your branching strategy is broken. It cannot prevent the slow erosion of comprehension that Addy Osmani and Simon Willison have both written about, where the codebase keeps growing and nobody fully understands any part of it anymore.

What it can do is move the bottleneck back to where humans add the most value. It can free senior engineers from the mechanical parts of review. It can make AI-generated code shippable in regulated environments where right now it is not. It can give you back the ability to merge with confidence at the rate your generators are producing.

That is the actual prize. Not "AI ships your code." Not "no more code review." The prize is a system where AI generation throughput and human review capacity stay in some kind of balance, and where the balance gets better over time rather than worse.

The teams that get there will be the ones who realized early that "AI code review" is the wrong shape of the problem, stopped looking for a better bot, and started building a platform.

"Your ability to get any returns on your investments into AI will be limited by how swiftly you can validate your changes and learn from them. Another word for this is observability." Charity Majors