Can AI find the commit that broke prod?
A monitor pages. p99 latency is up 20x, or pods are getting OOMKilled. Forty commits landed in the last three hours and someone flipped a feature flag. The on-call engineer has to find the one commit that did it, without getting fooled by the innocent deploy that landed thirty seconds before the graph went vertical. RootCauseBench asks whether an LLM can do that. Every model gets the same data and the same shell. We measure the reasoning, not the tooling.
Model ranking
| # | Model | Overall | easy | medium | hard | no-code-cause |
|---|---|---|---|---|---|---|
| 1 | claude-sonnet-4.6 | 99% | 100% | 100% | 97% | 95% |
| 2 | gemini-3.5-flash | 99% | 100% | 100% | 97% | 95% |
| 3 | gpt-5.5 | 96% | 100% | 100% | 91% | 90% |
| 4 | gpt-5.4 | 96% | 100% | 100% | 91% | 90% |
| 5 | gemini-3.1-pro-preview | 94% | 100% | 96% | 91% | 86% |
| 6 | claude-opus-4.8 | 93% | 83% | 92% | 97% | 100% |
| 7 | gpt-5.4-mini | 90% | 100% | 96% | 85% | 100% |
| 8 | kimi-k2.5 | 85% | 83% | 100% | 70% | 71% |
| 9 | kimi-k2-thinking | 85% | 100% | 92% | 73% | 71% |
| 10 | gpt-oss-120b | 58% | 83% | 62% | 48% | 43% |
| 11 | gemini-3.1-flash-lite | 56% | 100% | 62% | 36% | 14% |
| 12 | claude-haiku-4.5 | 35% | 67% | 21% | 36% | 10% |
| 13 | gpt-oss-20b | 28% | 67% | 29% | 21% | 52% |
Key finding
claude-sonnet-4.6 and gemini-3.5-flash top the board at 99%, near perfect across every tier. The differentiator is the no-code-cause column, which measures confabulation resistance: when there is no guilty commit, does the model answer "none" or convict an innocent one? claude-opus-4.8 and gpt-5.4-mini abstain perfectly (100%), while gemini-3.1-flash-lite (14%) and claude-haiku-4.5 (10%) invent a culprit on most no-cause incidents.
Scenarios
| Scenario | Type | What it tests |
|---|---|---|
| checkout-latency-n-plus-one | Real culprit | An added per-item catalog query inside the order loop spikes checkout p99 (N+1). |
| payment-nil-deref-panic | Real culprit | A missing nil-check on an optional 3DS field panics every charge. |
| inventory-connection-pool-exhaustion | Real culprit | Inventory exhausts its pool, but the culprit is the shared DB client library. |
| recommendation-memory-leak | Real culprit | A package-level slice grows unbounded and OOMKills long after the deploy. |
| auth-jwt-validation-regression | Real culprit | Every service rejects tokens at once. The culprit is the shared JWT verify library. |
| cache-ttl-stampede | Real culprit | A TTL change stampedes productdb minutes after the deploy. |
| frontend-race-condition-5xx | Real culprit | A removed mutex in the rate limiter races under load and throws 5xx. |
| grpc-deadline-too-tight | Real culprit | A tightened gRPC deadline on a catalog call starts failing slow requests. |
| unbounded-query-delayed-onset | Real culprit | A deleted LIMIT clause makes a search query degrade as data grows. |
| logging-debug-disk-fill | Real culprit | A config flip to debug logging at full sample rate fills the disk. |
| search-mapping-query-break | Real culprit | A renamed index field (title to name) breaks search queries. |
| dashboard-db-schema-missing-table | Real culprit | Code queries a favorites table whose migration never shipped. |
| ai-agent-registration-missing | Real culprit | A bootstrap refactor deletes the agent-registration call at startup. |
| metric-ingestor-metadata-deser | Real culprit | A shared schema field rename (meta to metadata) breaks deserialization. |
| olapdb-tso-cas-retry-budget | Real culprit | A shared TSO client drops its CAS retry budget, causing timeouts. |
| dynamodb-write-capacity-breach | Real culprit | A shared persistence-lib change drives writes past DynamoDB capacity. |
| transformer-dependency-startup-crash | Real culprit | A protobuf-runtime version clash crashes the transformer into CrashLoopBackOff. |
| bad-data-poison-record | No code cause | An external partner feed sends one poison record. No commit. Answer: none. |
| upstream-payment-provider-outage | No code cause | An external Stripe outage, visible in the status feed. Answer: none. |
| cloud-region-impairment | No code cause | An AWS S3 us-east-1 regional impairment. Answer: none. |
| dns-resolver-degradation | No code cause | Cluster DNS and upstream resolver degradation, not a commit. Answer: none. |
| tls-cert-expiry | No code cause | The payment service leaf certificate expired. Answer: none. |
| traffic-surge-flash-sale | No code cause | A flash sale drives a 6x organic surge, not a regression. Answer: none. |
| noisy-neighbor-node-saturation | No code cause | A batch pod lands on the node and starves its neighbors. Answer: none. |
What we measure
Given a frozen incident, the model writes a single JSON answer: the root-cause commit, the first failing service, the blast radius, and a remediation. Only one field gates pass or fail.
- →Primary, binary: root_cause_commit must exactly match the ground-truth culprit SHA (a correct short prefix is accepted).
- →Commit messages are neutralized, so the model must reason from the diff, not the description.
- →Secondary, never fatal: first-failing-service, blast-radius overlap, remediation match, and whether the model fell for the innocent-deploy decoy.
- →Seven incidents have no code cause at all. The correct answer is "none", which measures confabulation resistance.
How scenarios are built
Scenarios are fault injections on a real microservices app (a fork of Online Boutique) or reconstructions of representative production incident classes on a fictional platform. Each is a frozen window of alert, logs, metrics, traces, patterns, and full change context (commits, deploys, flags).
- →Author one regression commit (N+1 query, nil deref, pool exhaustion, memory leak) and surround it with dozens of innocent commits.
- →Plant an innocent deploy near onset to punish the "blame the latest change" heuristic.
- →Allow delayed onset, so the bad deploy can detonate minutes later.
- →Hand-label the ground truth and keep it out of the agent container.
Run it yourself
Requires Harbor, Docker, and an OpenRouter key. Re-score any published trajectory yourself, no API key needed.
git clone https://github.com/edgedelta/root-cause-bench.git
cd root-cause-bench
# put OPENROUTER_API_KEY=... in .env, then:
source .env && uv run harbor run -c configs/all-models-docker.yaml
# summarize a run into a per-model table:
uv run scripts/process_results.py jobs/<timestamp>Put a real AI Teammate on call
Edge Delta's AI Teammates triage, investigate, and find root cause in your stack.