Benchmarks / RCA Bench

Can AI find the commit that broke prod?

24Scenarios13Models936Trials99%Top pass rateJun 30, 2026Updated

A monitor pages. p99 latency is up 20x, or pods are getting OOMKilled. Forty commits landed in the last three hours and someone flipped a feature flag. The on-call engineer has to find the one commit that did it, without getting fooled by the innocent deploy that landed thirty seconds before the graph went vertical. RootCauseBench asks whether an LLM can do that. Every model gets the same data and the same shell. We measure the reasoning, not the tooling.

Model ranking

claude-sonnet-4.6
99%
gemini-3.5-flash
99%
gpt-5.5
96%
gpt-5.4
96%
gemini-3.1-pro-preview
94%
claude-opus-4.8
93%
gpt-5.4-mini
90%
kimi-k2.5
85%
kimi-k2-thinking
85%
gpt-oss-120b
58%
gemini-3.1-flash-lite
56%
claude-haiku-4.5
35%
gpt-oss-20b
28%
#ModelOveralleasymediumhardno-code-cause
1claude-sonnet-4.699%100%100%97%95%
2gemini-3.5-flash99%100%100%97%95%
3gpt-5.596%100%100%91%90%
4gpt-5.496%100%100%91%90%
5gemini-3.1-pro-preview94%100%96%91%86%
6claude-opus-4.893%83%92%97%100%
7gpt-5.4-mini90%100%96%85%100%
8kimi-k2.585%83%100%70%71%
9kimi-k2-thinking85%100%92%73%71%
10gpt-oss-120b58%83%62%48%43%
11gemini-3.1-flash-lite56%100%62%36%14%
12claude-haiku-4.535%67%21%36%10%
13gpt-oss-20b28%67%29%21%52%

Key finding

claude-sonnet-4.6 and gemini-3.5-flash top the board at 99%, near perfect across every tier. The differentiator is the no-code-cause column, which measures confabulation resistance: when there is no guilty commit, does the model answer "none" or convict an innocent one? claude-opus-4.8 and gpt-5.4-mini abstain perfectly (100%), while gemini-3.1-flash-lite (14%) and claude-haiku-4.5 (10%) invent a culprit on most no-cause incidents.

Scenarios

ScenarioTypeWhat it tests
checkout-latency-n-plus-oneReal culpritAn added per-item catalog query inside the order loop spikes checkout p99 (N+1).
payment-nil-deref-panicReal culpritA missing nil-check on an optional 3DS field panics every charge.
inventory-connection-pool-exhaustionReal culpritInventory exhausts its pool, but the culprit is the shared DB client library.
recommendation-memory-leakReal culpritA package-level slice grows unbounded and OOMKills long after the deploy.
auth-jwt-validation-regressionReal culpritEvery service rejects tokens at once. The culprit is the shared JWT verify library.
cache-ttl-stampedeReal culpritA TTL change stampedes productdb minutes after the deploy.
frontend-race-condition-5xxReal culpritA removed mutex in the rate limiter races under load and throws 5xx.
grpc-deadline-too-tightReal culpritA tightened gRPC deadline on a catalog call starts failing slow requests.
unbounded-query-delayed-onsetReal culpritA deleted LIMIT clause makes a search query degrade as data grows.
logging-debug-disk-fillReal culpritA config flip to debug logging at full sample rate fills the disk.
search-mapping-query-breakReal culpritA renamed index field (title to name) breaks search queries.
dashboard-db-schema-missing-tableReal culpritCode queries a favorites table whose migration never shipped.
ai-agent-registration-missingReal culpritA bootstrap refactor deletes the agent-registration call at startup.
metric-ingestor-metadata-deserReal culpritA shared schema field rename (meta to metadata) breaks deserialization.
olapdb-tso-cas-retry-budgetReal culpritA shared TSO client drops its CAS retry budget, causing timeouts.
dynamodb-write-capacity-breachReal culpritA shared persistence-lib change drives writes past DynamoDB capacity.
transformer-dependency-startup-crashReal culpritA protobuf-runtime version clash crashes the transformer into CrashLoopBackOff.
bad-data-poison-recordNo code causeAn external partner feed sends one poison record. No commit. Answer: none.
upstream-payment-provider-outageNo code causeAn external Stripe outage, visible in the status feed. Answer: none.
cloud-region-impairmentNo code causeAn AWS S3 us-east-1 regional impairment. Answer: none.
dns-resolver-degradationNo code causeCluster DNS and upstream resolver degradation, not a commit. Answer: none.
tls-cert-expiryNo code causeThe payment service leaf certificate expired. Answer: none.
traffic-surge-flash-saleNo code causeA flash sale drives a 6x organic surge, not a regression. Answer: none.
noisy-neighbor-node-saturationNo code causeA batch pod lands on the node and starves its neighbors. Answer: none.

What we measure

Given a frozen incident, the model writes a single JSON answer: the root-cause commit, the first failing service, the blast radius, and a remediation. Only one field gates pass or fail.

  • Primary, binary: root_cause_commit must exactly match the ground-truth culprit SHA (a correct short prefix is accepted).
  • Commit messages are neutralized, so the model must reason from the diff, not the description.
  • Secondary, never fatal: first-failing-service, blast-radius overlap, remediation match, and whether the model fell for the innocent-deploy decoy.
  • Seven incidents have no code cause at all. The correct answer is "none", which measures confabulation resistance.

How scenarios are built

Scenarios are fault injections on a real microservices app (a fork of Online Boutique) or reconstructions of representative production incident classes on a fictional platform. Each is a frozen window of alert, logs, metrics, traces, patterns, and full change context (commits, deploys, flags).

  • Author one regression commit (N+1 query, nil deref, pool exhaustion, memory leak) and surround it with dozens of innocent commits.
  • Plant an innocent deploy near onset to punish the "blame the latest change" heuristic.
  • Allow delayed onset, so the bad deploy can detonate minutes later.
  • Hand-label the ground truth and keep it out of the agent container.

Run it yourself

Requires Harbor, Docker, and an OpenRouter key. Re-score any published trajectory yourself, no API key needed.

git clone https://github.com/edgedelta/root-cause-bench.git
cd root-cause-bench

# put OPENROUTER_API_KEY=... in .env, then:
source .env && uv run harbor run -c configs/all-models-docker.yaml

# summarize a run into a per-model table:
uv run scripts/process_results.py jobs/<timestamp>

Put a real AI Teammate on call

Edge Delta's AI Teammates triage, investigate, and find root cause in your stack.