Benchmarks / Blast Radius Bench

Can AI reconstruct the failure chain?

17Scenarios20Models1020Trials61%Top pass rateJul 10, 2026Updated

A cascading incident hands you traces, metrics, logs, k8s events, and a service dependency graph. One service failed first. Its failure propagated along call edges, but in causal terms a slow callee backs up its caller, so propagation runs opposite to the request flow. The service that pages is usually the last victim at the edge, not the source. Can the model reconstruct the chain, or does it blame the loudest box and invert the arrows? This is a neutral benchmark of the models, not of any vendor's product.

Model ranking

glm-5.2

59%

fugu-ultra

61%

gpt-5.5

61%

gemini-3.1-pro-preview

59%

claude-fable-5

57%

gpt-5.4

57%

gpt-5.4-mini

53%

grok-4.5

53%

claude-sonnet-4.6

53%

claude-opus-4.8

47%

gemini-3.5-flash

47%

deepseek-v4-flash

41%

gemini-3.1-flash-lite

37%

qwen3-235b-a22b-2507

41%

kimi-k2-thinking

39%

kimi-k2.5

39%

gpt-oss-120b

29%

qwen3-32b

20%

claude-haiku-4.5

22%

gpt-oss-20b

#	Model	Graded reward (95% CI)	Pass rate	easy	medium	hard
1	glm-5.2	0.678 ± 0.116	59%	100%	83%	47%
2	fugu-ultra	0.675 ± 0.116	61%	100%	100%	44%
3	gpt-5.5	0.672 ± 0.117	61%	100%	100%	44%
4	gemini-3.1-pro-preview	0.653 ± 0.123	59%	100%	58%	56%
5	claude-fable-5	0.637 ± 0.120	57%	100%	83%	44%
6	gpt-5.4	0.632 ± 0.123	57%	100%	75%	47%
7	gpt-5.4-mini	0.626 ± 0.120	53%	67%	75%	44%
8	grok-4.5	0.602 ± 0.122	53%	100%	75%	42%
9	claude-sonnet-4.6	0.587 ± 0.125	53%	100%	58%	47%
10	claude-opus-4.8	0.562 ± 0.120	47%	100%	75%	33%
11	gemini-3.5-flash	0.554 ± 0.123	47%	100%	58%	39%
12	deepseek-v4-flash	0.548 ± 0.116	41%	100%	67%	28%
13	gemini-3.1-flash-lite	0.542 ± 0.116	37%	100%	67%	22%
14	qwen3-235b-a22b-2507	0.496 ± 0.126	41%	67%	58%	33%
15	kimi-k2-thinking	0.494 ± 0.121	39%	33%	50%	36%
16	kimi-k2.5	0.465 ± 0.123	39%	33%	50%	36%
17	gpt-oss-120b	0.340 ± 0.125	29%	67%	42%	22%
18	qwen3-32b	0.301 ± 0.113	20%	0%	25%	19%
19	claude-haiku-4.5	0.262 ± 0.111	22%	0%	42%	17%
20	gpt-oss-20b	0.049 ± 0.057	4%	33%	0%	3%

How graded reward is scored: 0 if the model names a downstream victim as the origin or claims causality in the wrong direction (the two errors that misdirect a real incident response); 0.5 plus up to 0.5 more for how much of the failure chain it recovers when the origin is right; small partial credit otherwise. Averaged over 51 trials with a 95% confidence interval; overlapping intervals mean the models are effectively tied.

Key finding

Honesty is the product: if a model does badly here, that is a finding, not a bug. This is the hardest bench of the three, and v2 makes the tightness at the top explicit: glm-5.2 leads on mean graded reward (0.679) while newcomer fugu-ultra and gpt-5.5 edge it on pass rate at 61% — a statistical tie. Newcomer claude-fable-5 debuts fifth at 57%. The whole field still collapses on the hard tier (nobody clears 56%), because the cause is often a shared resource that is not an edge in the service graph, and the single most diagnostic error remains reversed causality: claiming a downstream victim caused an upstream service. The non-LLM baselines make the traps concrete: blaming the loudest service names a victim in 17 of 17 scenarios.

Scenarios

Scenario	Tier	The trap
shared-postgres-saturation	medium	The edge gateway is loudest and pages, but is the last victim. The cascade fans out into a small tree, not a line.
retry-storm-amplification	hard	Aggressive client retries put the observed load spike on the caller. The true origin is the slow downstream. Reversed-causality trap.
noisy-neighbor-node	hard	Three unrelated services fail at once with no call edge between them. The only link is the shared node, visible only in infra events.
fdb-tso-flink-cascade	hard	The loud FlinkJobUnhealthy page is the last victim. The origin is the Timestamp Oracle's FDB timeouts four hops upstream.
backend-connectivity-cascade	hard	The loudest 5xx is at the http-receiver edge. The origin is the backend whose write shard lost capacity.
shared-kafka-saturation	medium	The edge shows the traffic and latency spike, but it is backpressure from a downstream slow queue consumer.
disk-pressure-noisy-neighbor	hard	Three services in three namespaces evicted at once. The only link is the shared node, and each victim has its own red herring.
shared-redis-eviction	medium	Dependents page loudest with 5xx. The origin's app logs are clean. The kubelet is killing it on a misconfigured probe.
memory-pressure-eviction-cascade	hard	Query-failure 5xx loudest on platform-api. The chain starts with a node eviction, then a service cascade.
shared-dynamodb-throttle	medium	Retry amplification makes the caller look like the epicenter. The origin is the throttled DynamoDB-backed memory store.
cdn-origin-overload	easy	cdn-edge serves the customer-facing 5xx and pages, but origin-web CPU saturated first.
dual-independent-incidents	medium	Two unrelated incidents fire in one window. Separate them instead of merging into one chain.
fan-in-quiet-downstream	medium	A lock plus GC pause in feature-flags-svc backs up every caller that fans into it.
grpc-deadline-chain	medium	The deepest hop, pricing-svc, is slow. The loud timeouts are four hops up the gRPC chain.
mid-chain-cache-origin	medium	A cache-key format change collapses the hit ratio. The cache, not the db, is the origin.
shared-dns-resolver-degradation	hard	A CoreDNS config change degrades resolution. The link is shared DNS, not a call edge.
shared-nat-egress-saturation	hard	A shared NAT gateway saturates SNAT ports, so unrelated egress paths fail together.

Model-scenario matrix

Every model against every scenario. Toggle between pass rate, average cost, and average time per scenario to see which models handle specific failure patterns well, even when their overall score is lower. Hatched cells are scenarios a model never solved.

Pass rateCostTime

0%$0.0130s100%$1300s unsolved

fan-in-quiet-downstream

backend-connectivity-cascade

grpc-deadline-chain

retry-storm-amplification

noisy-neighbor-node

fdb-tso-flink-cascade

cdn-origin-overload

mid-chain-cache-origin

shared-kafka-saturation

shared-redis-eviction

dual-independent-incidents

shared-dns-resolver-degradation

shared-nat-egress-saturation

disk-pressure-noisy-neighbor

memory-pressure-eviction-cascade

shared-dynamodb-throttle

shared-postgres-saturation

gpt-5.5

fugu-ultra

gemini-3.1-pro-preview

glm-5.2

gpt-5.4

claude-fable-5

claude-sonnet-4.6

gpt-5.4-mini

grok-4.5

gemini-3.5-flash

claude-opus-4.8

qwen3-235b-a22b-2507

deepseek-v4-flash

kimi-k2-thinking

kimi-k2.5

gemini-3.1-flash-lite

gpt-oss-120b

claude-haiku-4.5

qwen3-32b

gpt-oss-20b

Cost efficiency

Average API cost per scenario against pass rate. The dashed line is the Pareto frontier: the most cost-efficient models for a given level of accuracy. 1,020 runs cost $198.16 in total.

Speed vs quality

Average time per scenario against pass rate. The frontier shows the models that balance solution quality against how long they take to reason.

What we measure

The model writes failure_chain.json: the origin service, the directed propagation path, the root cause, and the blast radius. Reconstructing the chain means recovering the causal edges, which run opposite to the request flow.

→Graded reward (what the leaderboard ranks on): 0 for blaming a victim or inverting causality, 0.5 plus chain-recall credit for a correct origin, small partial credit otherwise.
→Primary, binary: the origin service must be correct AND the propagation path must recover enough of the true directed causal edges.
→Secondary, never fatal: blast-radius overlap and a root-cause keyword check.
→Reversed-causality count: how many edges the model inverted, claiming a downstream victim caused an upstream service. This is the single most diagnostic error in incident reasoning.

How scenarios are built

Three scenarios are fault injections on a real microservices demo; the rest are reconstructions of representative production incidents with fictional names. Each captures a 10 to 15 minute window spanning baseline, onset, and escalation, downsampled to a few KB so the agent can read everything.

→Pin every service to a commit, then inject a fault tied to one culprit commit.
→Keep the buried first signal among innocent noise, with an innocent deploy planted at onset.
→Feature-flag changes appear only as decoys. In v1 the root cause is always a code change.
→Hand-label the ground truth: origin, directed edges, root cause, and blast radius.

Run it yourself

Runs on the external Harbor harness. You can also point any agentic CLI (Claude Code, Codex, Cursor) at a scenario's /workdir.

git clone https://github.com/edgedelta/blast-radius-bench.git && cd blast-radius-bench
cp .env.example .env   # add OPENROUTER_API_KEY=...

source .env && uv run harbor run -c configs/leaderboard-v2-docker.yaml
uv run scripts/process_results.py jobs/<timestamp>

Put a real AI Teammate on call

Edge Delta's AI Teammates triage, investigate, and find root cause in your stack.

Activate Agents

Cookie Settings

Essential Cookies

Analytics Cookies

Marketing Cookies