I Sent 7 AI Agents to Fix Bugs Across 3 Repos. Here's What Broke.

Open Table of contents

Three Articles That Changed How I Think About AI Agents
The Problem with One Agent
Orbit: The Harness
The Numbers
Where It Still Fails
The Ceiling
What This Means

Three Articles That Changed How I Think About AI Agents

In February 2026, three blog posts landed within weeks of each other:

Ramp’s Inspect — they run coding agents in Modal sandboxes, each task isolated, with self-verification loops before any human sees the output.
Stripe’s Minions — they introduced “Blueprints” (deterministic scaffolding around LLM reasoning), limited CI iterations (because returns diminish fast), and shift-left feedback (catch errors locally before pushing).
A USCardForum post on agent architecture — this one hit different. The author argued that AI agents have a structural blind spot: they can’t question their own premises. They optimize for task completion, not for being right.

I’d been running my own issue-hunter tools for months — single-agent scripts that pick up GitHub issues, write fixes, and open PRs. They worked, sometimes. But the failure modes were always the same: the agent gets stuck in a loop, duplicates existing code it didn’t bother to search for, or confidently submits a fix that doesn’t actually fix anything.

These three articles gave me a framework for why.

The Problem with One Agent

A single agent doing everything — read issue, explore codebase, implement fix, run tests, create PR — is like asking one person to be architect, developer, tester, and code reviewer simultaneously. On the same code. In the same brain.

The failure modes are predictable:

Information cocoon. The agent commits to approach A early. When approach A doesn’t work, it tries A’, A”, A'''. It never steps back and asks “should I try B?” Humans do this naturally — we call it “sleeping on it” or “asking a colleague.” Agents don’t sleep, and they don’t have colleagues.

No meta-cognition. The agent can’t think about what it’s thinking. It can’t evaluate whether its current direction is promising or a dead end. It just… keeps going. The USCardForum post nailed this: humans have the ability to judge the current state and refuse to go along with a bad plan. Agents don’t.

Context pollution. When one agent works on multiple tasks sequentially, state leaks between tasks. I’ve had PRs that included changes from a completely different issue because the agent’s working directory wasn’t clean.

Orbit: The Harness

So I built Orbit — not a smarter agent, but a smarter way to use agents.

You (Orchestrator)
  │
  ├── Scout agent → explores repo, saves conventions
  │
  ├── Router → assesses complexity, picks agent type
  │
  ├── Hunter agent (background) ← simple issues
  ├── Hunter agent (background) ← simple issues
  ├── Hunter-Pro agent (background) ← complex issues
  │
  ├── Verifier agent ← independent review
  │
  └── Reworker agent ← fixes based on verifier feedback

The key ideas, stolen shamelessly from the three articles above:

1. State Lives in Files, Not in Agent Brains

Inspired by Ramp’s approach and lean-collab’s “Ensue Memory” pattern. Every task’s lifecycle lives in .orbit/tasks.json:

{
  "id": "4a5266b3",
  "repo": "tw93/Kaku",
  "issue": 153,
  "state": "rework",
  "pr_url": "https://github.com/tw93/Kaku/pull/155",
  "verify_status": "fail",
  "verify_feedback": "Implementation over-engineered",
  "rework_attempts": 1
}

Agents don’t need to remember what happened. The file remembers. When an agent dies, crashes, or runs out of context — the state survives. Another agent can pick up where it left off.

This is the single most important design decision. Context windows are expensive and fragile. Files are free and permanent.

2. One Task, One Directory

Every agent clones into /tmp/orbit-{task_id}/. Not the current directory. Not a shared workspace.

I learned this the hard way. In my second run, two agents both cloned googleworkspace/cli into the same cli/ directory. Agent B picked up Agent A’s uncommitted changes. The PR included fixes for two different issues mashed together. The verifier caught it — “Scope: 1/2, includes unrelated gmail scope changes.”

Isolation is not optional. It’s structural.

3. Complexity Router

Not every issue needs a senior engineer. The router assesses each issue and dispatches accordingly:

Signal	Agent
Typo, docs, config change	Hunter (fast, 2 iterations max)
Bug with clear repro, single module	Hunter (fast, 2 iterations max)
Multi-file, cross-module	Hunter-Pro (deep analysis, 10 iterations)
Architecture change	Skip — report to human

Simple issues had a 100% pass rate. Complex issues: 75%. Knowing when NOT to use the expensive agent matters as much as having it.

4. Independent Verification

This is from the USCardForum post’s core argument: the agent that wrote the code should never be the one to verify it.

The verifier is a separate agent. It never saw the implementation. It reads the issue, reads the PR diff, and scores on four dimensions:

Relevance (0-2): Does this change address the issue?
Completeness (0-2): Is the fix complete?
Correctness (0-2): Is the logic right?
Scope (0-2): Are changes minimal and focused?

PASS requires ≥ 6/8 and no dimension at 0. In practice, this caught:

A PR that duplicated an existing get_logs_dir() function (scored 2/8 — FAIL)
A PR that copied an entire 137-line file to change 3 lines (scored 5/8 — FAIL)
Cross-task pollution where unrelated changes leaked in (scored 6/8 — borderline PASS)

The verifier is not smarter than the implementer. It’s just looking from a different angle. That’s enough.

5. Feedback Loop

When the verifier fails a PR, the task enters “rework” state. A reworker agent checks out the existing PR branch, reads the verifier’s specific criticism, and makes surgical fixes. Not a rewrite — a targeted improvement.

Agent submits PR → Verifier: FAIL "over-engineered, copied entire file"
→ Reworker: checks out branch, simplifies to 3-line patch, pushes
→ Verifier: PASS (8/8)

Max one rework. If it fails twice, it’s a human problem.

6. Repo Knowledge — Agents That Learn Rules

Here’s something no article prepared me for.

I ran Orbit against openclaw/openclaw. The agent submitted a PR. Five minutes later, a bot closed it:

“Closing this PR because the author has more than 10 active PRs in this repo.”

The agent didn’t know this rule existed. How could it? It’s not in the issue, not in the README, not in any code. It’s a bot configuration that you only discover by hitting the wall.

So I added a scout agent. Before dispatching any workers to a new repo, the scout explores:

Branch naming conventions
Test commands
PR templates
Bot rules (PR limits, CLA requirements)
Gotchas from recent closed PRs

It saves everything to .orbit/repo-knowledge/openclaw-openclaw.md. Future agents read this before starting work. They know the rules before they break them.

This is the piece that makes agents actually viable for contributing to open source. Every repo is different. Every repo has unwritten rules. An agent that doesn’t learn them will keep hitting walls.

The Numbers

Two runs across 4 repos (HKUDS/nanobot, tw93/Kaku, openclaw/openclaw, googleworkspace/cli), 10 issues total:

Metric	Result
PRs created	10/10 (100%)
Verified PASS	8/10 (80%)
Perfect score (8/8)	3 PRs
Average score (passing)	7.0/8
Simple issues pass rate	100%
Complex issues pass rate	75%

Run 1 (single repo, no harness improvements): 67% pass rate. Run 2 (multi-repo, with isolation + verification): 86% pass rate.

The improvement came entirely from the harness, not from a better model.

Where It Still Fails

The Direction Problem

The verifier can catch bad implementations. It cannot catch bad directions. If an agent decides to solve an issue by adding a new utility file when the fix should go in an existing module, the verifier might still pass it — the logic is correct, the tests pass, the scope is reasonable. But the approach is wrong.

A human would look at the PR and say “wait, why didn’t you just modify the existing helper?” An agent doesn’t have that instinct.

The Stuck Loop

Despite iteration limits, agents still get stuck. They try the same approach with minor variations, burning through their iteration budget without changing strategy. A human debugging the same issue would step away, re-read the error message, and try something completely different.

The agent’s context window is its prison. Everything it’s tried is right there, biasing it toward more of the same. Humans forget their failed attempts — and that forgetting is actually useful.

The Unwritten Rules Problem (Partially Solved)

The scout agent helps, but it only catches rules that are visible in the repo’s history. Some rules are communal knowledge — “don’t touch that module, it’s being refactored” or “the maintainer prefers functional style.” These exist in Slack channels, in contributors’ heads, nowhere an agent can find them.

The Ceiling

Here’s what I think the ceiling is:

AI agents are power tools. A power drill doesn’t decide where to put the hole. A human decides. The drill executes. If you point the drill at the wrong spot, it will very efficiently make a hole in the wrong place.

The harness (Orbit, Ramp’s system, Stripe’s Minions) is the jig — the guide that keeps the drill straight. It can prevent wobble, it can set depth limits, it can verify the result. But it can’t decide where the hole should go.

That decision — the direction, the strategy, the “should we even do this?” — remains human territory. Not because AI isn’t smart enough, but because it lacks the ability to question its own premises. It can reason within a frame. It cannot step outside the frame and ask “is this the right frame?”

The articles call this different things. Ramp calls it “self-verification.” Stripe calls it “shift-left.” The USCardForum post calls it “the limits of agent deception.” They’re all describing the same boundary: AI can optimize within constraints, but it can’t evaluate whether the constraints are right.

What This Means

The most productive setup isn’t “human vs AI” or “AI replaces human.” It’s:

Human: direction, judgment, strategy
  ↓
Harness: orchestration, isolation, verification
  ↓
Agents: implementation, testing, iteration

The harness amplifies human judgment. One human, through Orbit, can direct 7 agents across 3 repos simultaneously. The leverage isn’t in the agents being smart — it’s in the human’s judgment being multiplied.

In the cloud computing era, EC2 was the outlet and Kubernetes was the harness. The real value was in the orchestration layer, not the compute.

In the AI era, Claude/GPT is the outlet. The harness is what makes it useful at scale. And the human is what makes it useful at all.