Can AI tell a real incident from alert noise?

17Scenarios20Models1020Trials92%Top pass rateJul 10, 2026Updated

Modern observability stacks don't have a data problem, they have a paging problem. Alerts fire constantly and most are noise: flaps, transients, deploy churn, duplicates of incidents already being worked. A few are real and need a human now. NoiseBench gives a model a batch of fired pages plus the context a good engineer would pull, and asks it to label each page "page" or "suppress". The cardinal rule: you may not suppress a real incident. Miss one real SEV1 and you score zero, no matter how clean the rest of your triage.

Model ranking

claude-fable-5

92%

claude-sonnet-4.6

92%

fugu-ultra

88%

gpt-5.5

88%

glm-5.2

88%

grok-4.5

86%

gpt-5.4

84%

claude-opus-4.8

82%

deepseek-v4-flash

82%

kimi-k2.5

75%

gpt-5.4-mini

76%

gemini-3.5-flash

75%

kimi-k2-thinking

67%

gemini-3.1-pro-preview

71%

gemini-3.1-flash-lite

63%

claude-haiku-4.5

47%

gpt-oss-120b

53%

qwen3-235b-a22b-2507

41%

qwen3-32b

20%

gpt-oss-20b

16%

#	Model	Graded reward (95% CI)	Pass rate	easy	medium	hard
1	claude-fable-5	0.917 ± 0.074	92%	100%	100%	83%
2	claude-sonnet-4.6	0.909 ± 0.074	92%	100%	100%	83%
3	fugu-ultra	0.882 ± 0.089	88%	100%	100%	75%
4	gpt-5.5	0.881 ± 0.089	88%	100%	100%	75%
5	glm-5.2	0.874 ± 0.089	88%	100%	100%	75%
6	grok-4.5	0.863 ± 0.095	86%	100%	100%	71%
7	gpt-5.4	0.829 ± 0.100	84%	100%	96%	71%
8	claude-opus-4.8	0.824 ± 0.106	82%	100%	100%	62%
9	deepseek-v4-flash	0.820 ± 0.105	82%	100%	100%	62%
10	kimi-k2.5	0.761 ± 0.113	75%	100%	96%	50%
11	gpt-5.4-mini	0.756 ± 0.111	76%	100%	92%	58%
12	gemini-3.5-flash	0.745 ± 0.121	75%	100%	100%	46%
13	kimi-k2-thinking	0.719 ± 0.113	67%	67%	92%	42%
14	gemini-3.1-pro-preview	0.704 ± 0.126	71%	100%	100%	38%
15	gemini-3.1-flash-lite	0.668 ± 0.121	63%	100%	92%	29%
16	claude-haiku-4.5	0.543 ± 0.121	47%	100%	62%	25%
17	gpt-oss-120b	0.539 ± 0.132	53%	100%	88%	12%
18	qwen3-235b-a22b-2507	0.426 ± 0.132	41%	100%	67%	8%
19	qwen3-32b	0.194 ± 0.103	20%	67%	29%	4%
20	gpt-oss-20b	0.167 ± 0.095	16%	67%	12%	12%

How graded reward is scored: a trial earns its F1 score, except it earns 0 if the model suppressed an incident that had to be paged (the one unforgivable mistake). The leaderboard ranks on the average across all 51 trials, with a 95% confidence interval, so a model that nearly gets a scenario right scores better than one that misses completely, and overlapping intervals mean the models are effectively tied.

Key finding

The easy and medium tiers are near solved; the hard tier separates the field, where alert features actively mislead. claude-fable-5 debuts at the top: it ties claude-sonnet-4.6 at 92% and edges it on mean graded reward (0.917 vs 0.909), and it is the first model to pass slow-burn-saturation, the hardest scenario on the board. v2 re-ran every model with a 3x agent timeout: claude-opus-4.8 jumped from 61% to 82% once OpenRouter-slowness timeouts stopped counting against it, while new severity-inversion traps dropped claude-haiku-4.5 to 47%. The top models clear 75% on hard; most of the pack still sits below 50%.

Scenarios

Scenario	Tier	The trap
noisy-night-shift	medium	A DB cascade fires 4 correlated pages to collapse into one, plus a cert expiry, an OOM loop, and the usual flaps and transients.
deploy-storm	hard	A dozen services deployed at once and almost all churn self-heals. A few regressions don't. Over-suppressing kills you.
quiet-but-deadly	hard	SEV1-tagged flappers that must be suppressed; quiet SEV2/SEV3 slow-burns that must page, including one that auto-resolved once before escalating.
disk-pressure-flapper-storm	medium	Disk warnings recover on rotation and the [ignore] monitor flaps everywhere (one mis-routed to SEV1). Two nodes cross into real DiskPressure risk and a PVC hits 100%, tagged only SEV3.
escalation-loopback-noise	medium	PagerDuty escalation meta-noise: loop-backs and missed-ack reminders on staging CI vs missed-acks on genuinely live incidents.
ci-e2e-test-noise	hard	CI and e2e test failures wired into PagerDuty vs real prod regressions, including a canary-fleet look-alike.
warning-spike-transients	medium	WARN spikes that self-heal in seconds. One is the leading edge of a real error cascade on http-receiver.
ai-platform-alert-noise	hard	LLM token-usage and spending-cap cost noise vs real AI-runtime incidents.
queue-backlog-vs-blip	hard	Queue blips that drain on their own vs sustained backlogs, a DLQ fill, and a consumer whose queue floor ratchets up while presenting as a flapper.
node-event-noise	medium	Normal Karpenter/PDB operational events vs a real NodeNotReady capacity loss. A SEV1-tagged crashloop flapper sits next to a real crashloop tagged only SEV3.
obvious-sev1-cluster	easy	Two clear SEV1 outages with symptom duplicates. Page the roots, suppress the rest.
cert-expiry-fanout	medium	One cert-expiry SEV1 plus three duplicates in a two-minute window. Page the root, suppress the dupes.
maintenance-window-masking	medium	An announced maintenance window explains about a dozen alerts, but not the out-of-scope CDN outage or the replication stall exceeding the expected blast.
region-failover-mixed	medium	Planned region-failover churn self-heals. The shifted traffic saturates one region's payments and replication; those don't.
mixed-triage-heavy	hard	The kitchen sink: every trap class in one 32-page batch. Precision and recall stressed simultaneously.
sev1-misconfigured-rule	hard	A misconfigured monitor rule tags 12 benign blips SEV1. Severity, rule source, and fire counts all mislead; only the metric trajectory disambiguates.
slow-burn-saturation	hard	Benign plateau-under-limit vs deadly ratcheting floor: both rising-means-page and flapping-means-suppress have counterexamples. Hardest scenario at a 5% pass rate.

Model-scenario matrix

Every model against every scenario. Toggle between pass rate, average cost, and average time per scenario to see which models handle specific failure patterns well, even when their overall score is lower. Hatched cells are scenarios a model never solved.

Pass rateCostTime

0%$0.0130s100%$1300s unsolved

obvious-sev1-cluster

escalation-loopback-noise

noisy-night-shift

region-failover-mixed

warning-spike-transients

cert-expiry-fanout

disk-pressure-flapper-storm

maintenance-window-masking

quiet-but-deadly

node-event-noise

ci-e2e-test-noise

deploy-storm

mixed-triage-heavy

ai-platform-alert-noise

sev1-misconfigured-rule

queue-backlog-vs-blip

slow-burn-saturation

claude-sonnet-4.6

claude-fable-5

glm-5.2

gpt-5.5

fugu-ultra

grok-4.5

gpt-5.4

claude-opus-4.8

deepseek-v4-flash

gpt-5.4-mini

kimi-k2.5

gemini-3.5-flash

gemini-3.1-pro-preview

kimi-k2-thinking

gemini-3.1-flash-lite

gpt-oss-120b

claude-haiku-4.5

qwen3-235b-a22b-2507

qwen3-32b

gpt-oss-20b

Cost efficiency

Average API cost per scenario against pass rate. The dashed line is the Pareto frontier: the most cost-efficient models for a given level of accuracy. 1,020 runs cost $183.8 in total.

Speed vs quality

Average time per scenario against pass rate. The frontier shows the models that balance solution quality against how long they take to reason.

What we measure

The model gets a batch of fired pages plus the context a good engineer would pull: recent metrics, clustered log patterns, deploy history, auto-resolve status, fire frequency, and open incidents. It labels each page "page" or "suppress".

→Graded reward (what the leaderboard ranks on): 0 if a must-page incident was suppressed, otherwise the F1 of that trial, so near-misses and total failures stop looking identical.
→Scored on the page class with precision, recall, and F1.
→Cardinal rule: you may not suppress a real incident. Suppress a must-page SEV1 and you score zero, no matter how clean the rest is.
→Over-paging is penalized. At full recall, one false page already drops you below threshold.
→It rewards exactly one behavior: wake a human for the real thing, and nothing else.

How scenarios are built

Each scenario is a frozen telemetry window built by fault injection. Run a microservices app under steady load, inject one real fault tied to a git commit, let it propagate, then capture the pages, metrics, patterns, deploy log, and any open incidents.

→Inject realistic distractors: chronic flappers, sub-minute self-healing transients, and downstream symptoms of the real incident.
→Plant an innocent deploy near onset to punish "blame the latest deploy", plus a duplicate of an already-open incident.
→Keep timestamps internally consistent: onset always after the culprit deploy.
→Emit per-page page/suppress labels plus the must-page list as ground truth.

Run it yourself

Requires Harbor, Docker, and an OpenRouter key. Ships only the tasks, datasets, and scoring. The harness and models are external.

git clone https://github.com/edgedelta/noise-bench.git
cd noise-bench

# put OPENROUTER_API_KEY=... in .env, then:
source .env && uv run harbor run -c configs/leaderboard-v2-docker.yaml
uv run scripts/process_results.py jobs/<run-dir>

Put a real AI Teammate on call

Edge Delta's AI Teammates triage, investigate, and find root cause in your stack.

Activate Agents

Cookie Settings

Essential Cookies

Analytics Cookies

Marketing Cookies