Can AI find the commit that broke prod?

24Scenarios20Models1440Trials100%Top pass rateJul 10, 2026Updated

A monitor pages. p99 latency is up 20x, or pods are getting OOMKilled. Forty commits landed in the last three hours and someone flipped a feature flag. The on-call engineer has to find the one commit that did it, without getting fooled by the innocent deploy that landed thirty seconds before the graph went vertical. RootCauseBench asks whether an LLM can do that. Every model gets the same data and the same shell. We measure the reasoning, not the tooling.

Model ranking

glm-5.2

100%

grok-4.5

100%

claude-opus-4.8

99%

claude-fable-5

97%

gpt-5.4

97%

fugu-ultra

97%

deepseek-v4-flash

96%

gpt-5.5

96%

gemini-3.5-flash

96%

gemini-3.1-pro-preview

96%

claude-sonnet-4.6

96%

kimi-k2.5

88%

kimi-k2-thinking

86%

gpt-5.4-mini

82%

qwen3-235b-a22b-2507

75%

gemini-3.1-flash-lite

60%

gpt-oss-120b

50%

claude-haiku-4.5

47%

qwen3-32b

42%

gpt-oss-20b

29%

#	Model	Graded reward (95% CI)	Pass rate	easy	medium	hard	no-code-cause
1	glm-5.2	1.000 ± 0.000	100%	100%	100%	100%	100%
2	grok-4.5	1.000 ± 0.000	100%	100%	100%	100%	100%
3	claude-opus-4.8	0.986 ± 0.027	99%	100%	100%	97%	100%
4	claude-fable-5	0.979 ± 0.030	97%	100%	100%	94%	100%
5	gpt-5.4	0.972 ± 0.038	97%	100%	100%	94%	100%
6	fugu-ultra	0.972 ± 0.038	97%	100%	100%	94%	95%
7	deepseek-v4-flash	0.964 ± 0.041	96%	100%	96%	94%	90%
8	gpt-5.5	0.958 ± 0.046	96%	100%	100%	92%	90%
9	gemini-3.5-flash	0.958 ± 0.046	96%	100%	100%	92%	86%
10	gemini-3.1-pro-preview	0.958 ± 0.046	96%	100%	100%	92%	86%
11	claude-sonnet-4.6	0.958 ± 0.046	96%	100%	100%	92%	86%
12	kimi-k2.5	0.875 ± 0.077	88%	100%	96%	78%	67%
13	kimi-k2-thinking	0.875 ± 0.074	86%	100%	89%	81%	67%
14	gpt-5.4-mini	0.851 ± 0.076	82%	78%	89%	78%	90%
15	qwen3-235b-a22b-2507	0.767 ± 0.096	75%	89%	74%	72%	86%
16	gemini-3.1-flash-lite	0.607 ± 0.112	60%	100%	59%	50%	29%
17	gpt-oss-120b	0.534 ± 0.111	50%	100%	56%	33%	38%
18	claude-haiku-4.5	0.507 ± 0.111	47%	78%	41%	44%	33%
19	qwen3-32b	0.450 ± 0.111	42%	33%	52%	36%	38%
20	gpt-oss-20b	0.386 ± 0.097	29%	56%	30%	22%	52%

How graded reward is scored: 1.0 for naming the correct culprit commit; 0 for blaming an innocent deploy that just happened to land near the incident (the worst possible answer, because it sends the response team the wrong way); otherwise up to 0.5 partial credit when the commit is wrong but the diagnosis (failing service, blast radius, remediation) is right. Averaged over 72 trials with a 95% confidence interval; overlapping intervals mean the models are effectively tied.

Key finding

glm-5.2 repeats its perfect run under v2 conditions and newcomer grok-4.5 matches it: both score 72/72 with flawless graded rewards, never convicting an innocent commit. claude-opus-4.8 follows at 99% after the v2 timeout fix removed infra noise from its v1 number (93%), and newcomers claude-fable-5 (97%) and fugu-ultra (97%) debut in the top five. The differentiator remains the no-code-cause column, which measures confabulation resistance: when there is no guilty commit, does the model answer "none" or convict an innocent one? opus, fable-5, and gpt-5.4 abstain perfectly, while gemini-3.1-flash-lite (29%) and claude-haiku-4.5 (33%) still convict innocents on most no-cause incidents.

Scenarios

Scenario	Type	What it tests
checkout-latency-n-plus-one	Real culprit	An added per-item catalog query inside the order loop spikes checkout p99 (N+1).
payment-nil-deref-panic	Real culprit	A missing nil-check on an optional 3DS field panics every charge.
inventory-connection-pool-exhaustion	Real culprit	Inventory exhausts its pool, but the culprit is the shared DB client library.
recommendation-memory-leak	Real culprit	A package-level slice grows unbounded and OOMKills long after the deploy.
auth-jwt-validation-regression	Real culprit	Every service rejects tokens at once. The culprit is the shared JWT verify library.
cache-ttl-stampede	Real culprit	A TTL change stampedes productdb minutes after the deploy.
frontend-race-condition-5xx	Real culprit	A removed mutex in the rate limiter races under load and throws 5xx.
grpc-deadline-too-tight	Real culprit	A tightened gRPC deadline on a catalog call starts failing slow requests.
unbounded-query-delayed-onset	Real culprit	A deleted LIMIT clause makes a search query degrade as data grows.
logging-debug-disk-fill	Real culprit	A config flip to debug logging at full sample rate fills the disk.
search-mapping-query-break	Real culprit	A renamed index field (title to name) breaks search queries.
dashboard-db-schema-missing-table	Real culprit	Code queries a favorites table whose migration never shipped.
ai-agent-registration-missing	Real culprit	A bootstrap refactor deletes the agent-registration call at startup.
metric-ingestor-metadata-deser	Real culprit	A shared schema field rename (meta to metadata) breaks deserialization.
olapdb-tso-cas-retry-budget	Real culprit	A shared TSO client drops its CAS retry budget, causing timeouts.
dynamodb-write-capacity-breach	Real culprit	A shared persistence-lib change drives writes past DynamoDB capacity.
transformer-dependency-startup-crash	Real culprit	A protobuf-runtime version clash crashes the transformer into CrashLoopBackOff.
bad-data-poison-record	No code cause	An external partner feed sends one poison record. No commit. Answer: none.
upstream-payment-provider-outage	No code cause	An external Stripe outage, visible in the status feed. Answer: none.
cloud-region-impairment	No code cause	An AWS S3 us-east-1 regional impairment. Answer: none.
dns-resolver-degradation	No code cause	Cluster DNS and upstream resolver degradation, not a commit. Answer: none.
tls-cert-expiry	No code cause	The payment service leaf certificate expired. Answer: none.
traffic-surge-flash-sale	No code cause	A flash sale drives a 6x organic surge, not a regression. Answer: none.
noisy-neighbor-node-saturation	No code cause	A batch pod lands on the node and starves its neighbors. Answer: none.

Model-scenario matrix

Every model against every scenario. Toggle between pass rate, average cost, and average time per scenario to see which models handle specific failure patterns well, even when their overall score is lower. Hatched cells are scenarios a model never solved.

Pass rateCostTime

0%$0.0130s100%$1300s unsolved

payment-nil-deref-panic

dns-resolver-degradation

inventory-connection-pool-exhaustion

tls-cert-expiry

ai-agent-registration-missing

dashboard-db-schema-missing-table

dynamodb-write-capacity-breach

logging-debug-disk-fill

auth-jwt-validation-regression

checkout-latency-n-plus-one

frontend-race-condition-5xx

search-mapping-query-break

unbounded-query-delayed-onset

olapdb-tso-cas-retry-budget

noisy-neighbor-node-saturation

cloud-region-impairment

cache-ttl-stampede

transformer-dependency-startup-crash

grpc-deadline-too-tight

traffic-surge-flash-sale

upstream-payment-provider-outage

recommendation-memory-leak

metric-ingestor-metadata-deser

bad-data-poison-record

glm-5.2

grok-4.5

claude-opus-4.8

gpt-5.4

fugu-ultra

claude-fable-5

gpt-5.5

deepseek-v4-flash

gemini-3.1-pro-preview

claude-sonnet-4.6

gemini-3.5-flash

kimi-k2.5

kimi-k2-thinking

gpt-5.4-mini

qwen3-235b-a22b-2507

gemini-3.1-flash-lite

gpt-oss-120b

claude-haiku-4.5

qwen3-32b

gpt-oss-20b

Cost efficiency

Average API cost per scenario against pass rate. The dashed line is the Pareto frontier: the most cost-efficient models for a given level of accuracy. 1,440 runs cost $274.37 in total.

Speed vs quality

Average time per scenario against pass rate. The frontier shows the models that balance solution quality against how long they take to reason.

What we measure

Given a frozen incident, the model writes a single JSON answer: the root-cause commit, the first failing service, the blast radius, and a remediation. Only one field gates pass or fail.

→Graded reward (what the leaderboard ranks on): 1.0 for the right commit, 0 for blaming a decoy deploy, partial credit up to 0.5 for a correct diagnosis with the wrong commit.
→Primary, binary: root_cause_commit must exactly match the ground-truth culprit SHA (a correct short prefix is accepted).
→Commit messages are neutralized, so the model must reason from the diff, not the description.
→Secondary, never fatal: first-failing-service, blast-radius overlap, remediation match, and whether the model fell for the innocent-deploy decoy.
→Seven incidents have no code cause at all. The correct answer is "none", which measures confabulation resistance.

How scenarios are built

Scenarios are fault injections on a real microservices app (a fork of Online Boutique) or reconstructions of representative production incident classes on a fictional platform. Each is a frozen window of alert, logs, metrics, traces, patterns, and full change context (commits, deploys, flags).

→Author one regression commit (N+1 query, nil deref, pool exhaustion, memory leak) and surround it with dozens of innocent commits.
→Plant an innocent deploy near onset to punish the "blame the latest change" heuristic.
→Allow delayed onset, so the bad deploy can detonate minutes later.
→Hand-label the ground truth and keep it out of the agent container.

Run it yourself

Requires Harbor, Docker, and an OpenRouter key. Re-score any published trajectory yourself, no API key needed.

git clone https://github.com/edgedelta/root-cause-bench.git
cd root-cause-bench

# put OPENROUTER_API_KEY=... in .env, then:
source .env && uv run harbor run -c configs/leaderboard-v2-docker.yaml

# summarize a run into a per-model table:
uv run scripts/process_results.py jobs/<timestamp>

Put a real AI Teammate on call

Edge Delta's AI Teammates triage, investigate, and find root cause in your stack.

Activate Agents

Cookie Settings

Essential Cookies

Analytics Cookies

Marketing Cookies