Benchmarks / Noise Bench

Can AI tell a real incident from alert noise?

17Scenarios13Models663Trials88%Top pass rateJun 30, 2026Updated

Modern observability stacks don't have a data problem, they have a paging problem. Alerts fire constantly and most are noise: flaps, transients, deploy churn, duplicates of incidents already being worked. A few are real and need a human now. NoiseBench gives a model a batch of fired pages plus the context a good engineer would pull, and asks it to label each page "page" or "suppress". The cardinal rule: you may not suppress a real incident. Miss one real SEV1 and you score zero, no matter how clean the rest of your triage.

Model ranking

gpt-5.5
88%
claude-sonnet-4.6
88%
gpt-5.4
80%
kimi-k2.5
78%
gemini-3.5-flash
76%
gemini-3.1-pro-preview
75%
gpt-5.4-mini
71%
kimi-k2-thinking
69%
gemini-3.1-flash-lite
63%
claude-opus-4.8
61%
claude-haiku-4.5
61%
gpt-oss-120b
51%
gpt-oss-20b
16%
#ModelOveralleasymediumhard
1gpt-5.588%100%100%75%
2claude-sonnet-4.688%100%96%79%
3gpt-5.480%100%100%58%
4kimi-k2.578%100%100%54%
5gemini-3.5-flash76%100%100%50%
6gemini-3.1-pro-preview75%100%100%46%
7gpt-5.4-mini71%100%88%50%
8kimi-k2-thinking69%100%92%42%
9gemini-3.1-flash-lite63%100%96%25%
10claude-opus-4.861%67%75%46%
11claude-haiku-4.561%33%83%42%
12gpt-oss-120b51%100%75%21%
13gpt-oss-20b16%0%33%0%

Key finding

The easy and medium tiers are near solved; the hard tier separates the field, where alert features actively mislead. The top models clear 75% on hard, but most of the pack sits below 50%, and gpt-oss-20b collapses to 16% overall, scoring 0% on the easy and hard tiers. Reading a single field is not enough; this triage needs real cross-signal reasoning.

Scenarios

ScenarioTierThe trap
noisy-night-shiftmediumA real DB cascade fires 4 correlated pages. Collapse them to one. The rest is flaps, transients, and a self-healed deploy.
deploy-stormhardTen services deployed at once and almost all churn self-heals. One deploy shipped a real regression that doesn't.
quiet-but-deadlymediumMostly low-grade noise plus a quiet slow-burn incident with no deploy to blame. Tests the blame-the-deploy bias.
disk-pressure-flapper-stormmediumDisk warnings self-resolve on rotation. One node crosses into real DiskPressure eviction risk.
escalation-loopback-noisemediumPagerDuty escalation meta-noise. One genuine missed-ack on a live SEV1 Platform API 5xx.
ci-e2e-test-noisehardCircleCI / Playwright e2e failures wired into PagerDuty. One reflects a real web-app regression.
warning-spike-transientsmediumWARN spikes that self-heal in seconds. One is the leading edge of a real error cascade on http-receiver.
ai-platform-alert-noisehardLLM token-usage and spending-cap cost noise. One real ai-agent-svc outage via OnCall AI Workflow Errors.
queue-backlog-vs-bliphardTransient queue-depth blips that drain on their own. One sustained backlog blocking the write path.
node-event-noisemediumNormal Karpenter/PDB operational events. One real NodeNotReady drops capacity.
obvious-sev1-clustereasyTwo clear SEV1 outages with symptom duplicates. Page the roots, suppress the rest.
cert-expiry-fanoutmediumOne cert-expiry SEV1 plus three duplicates in a two-minute window. Page the root, suppress the dupes.
maintenance-window-maskingmediumAn announced maintenance window makes about a dozen alerts expected. Do not page the planned work.
region-failover-mixedmediumA planned region failover throws transient errors that self-recover. One real incident hides inside.
mixed-triage-heavyhardFive real incidents buried in heavy noise. Both precision and recall are stressed.
sev1-misconfigured-rulehardA bad monitoring-rules deploy makes the usual noise signals mislead, so features no longer betray the flappers.
slow-burn-saturationhardReal incidents that look benign: a slow rise to a hard limit, single fire, no deploy. Failed by every model.

What we measure

The model gets a batch of fired pages plus the context a good engineer would pull: recent metrics, clustered log patterns, deploy history, auto-resolve status, fire frequency, and open incidents. It labels each page "page" or "suppress".

  • Scored on the page class with precision, recall, and F1.
  • Cardinal rule: you may not suppress a real incident. Suppress a must-page SEV1 and you score zero, no matter how clean the rest is.
  • Over-paging is penalized. At full recall, one false page already drops you below threshold.
  • It rewards exactly one behavior: wake a human for the real thing, and nothing else.

How scenarios are built

Each scenario is a frozen telemetry window built by fault injection. Run a microservices app under steady load, inject one real fault tied to a git commit, let it propagate, then capture the pages, metrics, patterns, deploy log, and any open incidents.

  • Inject realistic distractors: chronic flappers, sub-minute self-healing transients, and downstream symptoms of the real incident.
  • Plant an innocent deploy near onset to punish "blame the latest deploy", plus a duplicate of an already-open incident.
  • Keep timestamps internally consistent: onset always after the culprit deploy.
  • Emit per-page page/suppress labels plus the must-page list as ground truth.

Run it yourself

Requires Harbor, Docker, and an OpenRouter key. Ships only the tasks, datasets, and scoring. The harness and models are external.

git clone https://github.com/edgedelta/noise-bench.git
cd noise-bench

# put OPENROUTER_API_KEY=... in .env, then:
source .env && uv run harbor run -c configs/all-models-docker.yaml
uv run scripts/process_results.py jobs/<run-dir>

Put a real AI Teammate on call

Edge Delta's AI Teammates triage, investigate, and find root cause in your stack.