Can AI reconstruct the failure chain?
A cascading incident hands you traces, metrics, logs, k8s events, and a service dependency graph. One service failed first. Its failure propagated along call edges, but in causal terms a slow callee backs up its caller, so propagation runs opposite to the request flow. The service that pages is usually the last victim at the edge, not the source. Can the model reconstruct the chain, or does it blame the loudest box and invert the arrows? This is a neutral benchmark of the models, not of any vendor's product.
Model ranking
| # | Model | Overall | easy | medium | hard |
|---|---|---|---|---|---|
| 1 | gpt-5.4 | 62% | 100% | 92% | 49% |
| 2 | gpt-5.5 | 59% | 100% | 100% | 42% |
| 3 | gemini-3.1-pro-preview | 55% | 100% | 64% | 49% |
| 4 | claude-sonnet-4.6 | 49% | 100% | 58% | 42% |
| 5 | claude-opus-4.8 | 48% | 100% | 75% | 34% |
| 6 | gemini-3.5-flash | 47% | 67% | 75% | 36% |
| 7 | kimi-k2.5 | 46% | 100% | 75% | 31% |
| 8 | gpt-5.4-mini | 45% | 67% | 67% | 36% |
| 9 | gemini-3.1-flash-lite | 45% | 100% | 67% | 33% |
| 10 | kimi-k2-thinking | 42% | 67% | 50% | 37% |
| 11 | claude-haiku-4.5 | 27% | 100% | 17% | 25% |
| 12 | gpt-oss-120b | 25% | 33% | 17% | 28% |
| 13 | gpt-oss-20b | 6% | 0% | 25% | 0% |
Key finding
Honesty is the product: if a model does badly here, that is a finding, not a bug. The whole field collapses on the hard tier, where even the top model reaches only 49%, because the cause is often a shared resource that is not an edge in the service graph. The single most diagnostic error is reversed causality: claiming a downstream victim caused an upstream service. Open-weight models trail the frontier sharply, bottoming out at 6%.
Scenarios
| Scenario | Tier | The trap |
|---|---|---|
| shared-postgres-saturation | medium | The edge gateway is loudest and pages, but is the last victim. The cascade fans out into a small tree, not a line. |
| retry-storm-amplification | hard | Aggressive client retries put the observed load spike on the caller. The true origin is the slow downstream. Reversed-causality trap. |
| noisy-neighbor-node | hard | Three unrelated services fail at once with no call edge between them. The only link is the shared node, visible only in infra events. |
| fdb-tso-flink-cascade | hard | The loud FlinkJobUnhealthy page is the last victim. The origin is the Timestamp Oracle's FDB timeouts four hops upstream. |
| backend-connectivity-cascade | hard | The loudest 5xx is at the http-receiver edge. The origin is the backend whose write shard lost capacity. |
| shared-kafka-saturation | medium | The edge shows the traffic and latency spike, but it is backpressure from a downstream slow queue consumer. |
| disk-pressure-noisy-neighbor | hard | Three services in three namespaces evicted at once. The only link is the shared node, and each victim has its own red herring. |
| shared-redis-eviction | medium | Dependents page loudest with 5xx. The origin's app logs are clean. The kubelet is killing it on a misconfigured probe. |
| memory-pressure-eviction-cascade | hard | Query-failure 5xx loudest on platform-api. The chain starts with a node eviction, then a service cascade. |
| shared-dynamodb-throttle | medium | Retry amplification makes the caller look like the epicenter. The origin is the throttled DynamoDB-backed memory store. |
| cdn-origin-overload | easy | cdn-edge serves the customer-facing 5xx and pages, but origin-web CPU saturated first. |
| dual-independent-incidents | medium | Two unrelated incidents fire in one window. Separate them instead of merging into one chain. |
| fan-in-quiet-downstream | medium | A lock plus GC pause in feature-flags-svc backs up every caller that fans into it. |
| grpc-deadline-chain | medium | The deepest hop, pricing-svc, is slow. The loud timeouts are four hops up the gRPC chain. |
| mid-chain-cache-origin | medium | A cache-key format change collapses the hit ratio. The cache, not the db, is the origin. |
| shared-dns-resolver-degradation | hard | A CoreDNS config change degrades resolution. The link is shared DNS, not a call edge. |
| shared-nat-egress-saturation | hard | A shared NAT gateway saturates SNAT ports, so unrelated egress paths fail together. |
What we measure
The model writes failure_chain.json: the origin service, the directed propagation path, the root cause, and the blast radius. Reconstructing the chain means recovering the causal edges, which run opposite to the request flow.
- →Primary, binary: the origin service must be correct AND the propagation path must recover enough of the true directed causal edges.
- →Secondary, never fatal: blast-radius overlap and a root-cause keyword check.
- →Reversed-causality count: how many edges the model inverted, claiming a downstream victim caused an upstream service. This is the single most diagnostic error in incident reasoning.
How scenarios are built
Three scenarios are fault injections on a real microservices demo; the rest are reconstructions of representative production incidents with fictional names. Each captures a 10 to 15 minute window spanning baseline, onset, and escalation, downsampled to a few KB so the agent can read everything.
- →Pin every service to a commit, then inject a fault tied to one culprit commit.
- →Keep the buried first signal among innocent noise, with an innocent deploy planted at onset.
- →Feature-flag changes appear only as decoys. In v1 the root cause is always a code change.
- →Hand-label the ground truth: origin, directed edges, root cause, and blast radius.
Run it yourself
Runs on the external Harbor harness. You can also point any agentic CLI (Claude Code, Codex, Cursor) at a scenario's /workdir.
git clone https://github.com/edgedelta/blast-radius-bench.git && cd blast-radius-bench
cp .env.example .env # add OPENROUTER_API_KEY=...
source .env && uv run harbor run -c configs/all-models-docker.yaml
uv run scripts/process_results.py jobs/<timestamp>Put a real AI Teammate on call
Edge Delta's AI Teammates triage, investigate, and find root cause in your stack.