Successful observability practices require you to spot issues before they impact your customers. In the past, this meant:
- Understanding which conditions indicated a production issue.
- Building monitors to trigger alerts when telemetry trends outside of acceptable ranges.
Today, you can use artificial intelligence and machine learning to supplement these efforts. This article is a deep dive into AI anomaly detection. I will discuss how AI anomaly detection works and how it can improve your observability practice.
What is Anomaly Detection?
Anomaly detection is a capability within observability and monitoring products. It identifies abnormal or suspicious behaviors when compared to established patterns. Observability tools analyze different telemetry, including logs and metrics, to identify anomalies. As a byproduct, you will run across different forms of anomaly detection:
- Time-series anomaly detection: Identifies metrics or monitoring KPIs that fall outside normal ranges.
- Log file anomaly detection: Pinpoints unexpected behavior in either the volume or contents of your logs.
This article focuses on AI anomaly detection, which can support both metric and log-based approaches.
How is Anomaly Detection Different from Traditional Monitoring?
There is some overlap between modern anomaly detection and traditional monitoring practices. The latter requires your team to determine a monitoring objective and define alert thresholds. Then, when a metric goes beyond acceptable limits, you can address the issue.
This approach is sub-optimal for a few reasons. It is time-consuming to create alerts at scale. Your team must be cognizant of every condition they'd like to alert against. (In other words, you're not spotting "unknown unknowns.") Static alerts don't convey whether behavior is changing, nor do they self-correct when baselines change. They also don’t take into account seasonality, time of day, or other factors that impact behavior patterns.
When an anomaly detection algorithm identifies an issue, it similarly triggers an alert. Yet, anomaly detection relies on artificial intelligence and machine learning algorithms, not humans. Automating this process gives time back to your team and gives you better coverage when it comes to spotting issues.
AI anomaly detection factors in the context when identifying unusual behaviors. For example, a spike in failed logins on an internal system might be normal on Tuesday morning at 8:30 am when your employees start their day. However, this would be anomalous on a Saturday at 2:00 am. Additionally, 10x traffic on an e-commerce application might be normal on Black Friday. But, this would be a major issue during a random Wednesday.
Anomaly Detection in Log Events
Anomaly detection is not one-size-fits-all. Within the broader category, vendors use different artificial intelligence and machine learning models. In this section, I'll touch on different approaches for anomaly detection in log files.
More traditional approaches entail pinpointing anomalous keywords or sequences of log events. Using machine learning models, platforms can detect abnormal spikes in problematic log events. Here you’re identifying deviations in the established pattern of ERROR or FAIL logs, for example.
How Edge Delta Supports Log Anomaly Detection
To identify anomalies in log events, Edge Delta relies on log patterns (sometimes referred to as “pattern analytics.”) Log patterns cluster together similar or recurring events. While some other observability providers support log patterns, they do so as a batch processing job initiated by the user over a specified dataset. Edge Delta proactively runs log patterns across all data in real-time as it’s created.
Every event our agent collects is compared against existing log patterns. If the log event is similar to an existing pattern, it is bucketed into that group, and the count for that pattern is incremented. If the log event is unique, a new pattern is created based on the content of that message, and flagged as “new”.
Next to each log pattern, you can see:
- The volume or count of events that make up the pattern
- How the volume over a given time period compares to previous periods
- Whether the log events that make up the pattern are negative or neutral (more on this point later)
Skyline Pattern Monitor
We run a couple AI anomaly detection monitors on log patterns, the first being our Skyline Pattern monitor. This monitor uses a proprietary algorithm to spot unusual spikes in negative sentiment data from a given data source. Edge Delta triggers an alert in two scenarios:
- There is an abnormal volume of negative sentiment log messages
- There is an abnormal amount of unique, negative sentiment patterns
To determine negative sentiment, Edge Delta analyzes the contents of the log event. Keywords associated with an issue (like error, fail, or exception) impact the sentiment analysis.
We also use a concept of Anomaly Scoring to tune the Pattern Skyline algorithm. Anomaly Scoring improves alert integrity, reducing false positives and negating alert fatigue. We also refine the algorithm to account for repetitive behavior (e.g., recurring batch jobs) and normal fluctuations in log volume.
Anomaly Detection in Metrics
Another approach to AI anomaly detection leverages metrics. Here, your observability platform baselines each time series dataset. In doing so, it learns normal traffic patterns, error rates, CPU utilization, and so on. By creating baselines to understand what is "normal," you can pinpoint any “abnormal” behavior.
This approach is best suited for datasets that indicate strong trends. Otherwise, you can experience alert fatigue.
How Edge Delta Handles Anomaly Detection for Metrics
Edge Delta runs time series anomaly detection on both:
- Metrics scraped directly from the data source
- Metrics extracted from log events
We can also detect anomalies in data collected by individual agents, as well as data aggregated from multiple agents.
Agent Processor Alerts
Agent Processor Alerts run at the agent level on metrics derived from logs via the Log to Metric processor. It identifies abnormal behaviors over time and notifies you of its findings. For instance, you'll receive an alert if there's an unusual frequency of logs containing ERROR or EXCEPTION.
To assess the likelihood of an issue, the agent calculates an anomaly score. Edge Delta deems metric values anomalous if they surpass the anomaly score threshold within an interval.
When an Agent Processor Alert fires, you can also trigger an Anomaly Capture. This feature captures raw logs generated during the time of the anomaly. Our agent routes this data to the Edge Delta backend or a designated third-party streaming destination. This enables your team to easily identify the relevant data for investigation.
Metrics Alert Monitors
Metric Alert Monitors run on aggregated metrics. This approach is the best fit for services that operate across multiple hosts. Metric Alert Monitors trigger an alert when behaviors surpass the Anomaly Score threshold. (No different than Agent Processor Alerts.)
The Benefits of AI Anomaly Detection
There are several benefits to using AI anomaly detection in your observability practice. The two we see most commonly are (1) your ability to identify unknown unknowns and (2) time savings.
Identify Unknown Unknowns
When you define monitors yourself, you have to understand every behavior you ought to alert on. By nature, you're only going to detect known behaviors. Unfortunately, there is a long tail of errors in observability. It's impossible to understand every potential issue, even if you have lots of resources on the task.
When you use AI to automate anomaly detection, you can identify production issues without having to be conscious of them in advance. Moreover, this will allow you to alert on issues you've never seen before or would expect to occur.
This is a benefit that Fama, a leading background screening company, experienced firsthand. Previously, Fama's engineering team created monitors manually in Datadog. This allowed them to trigger alerts on behaviors they knew were problematic. However, when they started using Edge Delta, they began seeing new alerts on issues they didn't expect.
Brendten Eickstaedt, Fama’s CTO, explained, "Edge Delta provided us a view into issues we didn't know were going on. It detected anomalies on its own without us having to specify what it should be looking for.”
Save Time in Creating Alerts
Another shortcoming of manually-defined alerts is the amount of time they take to set up. Agi, a financial technology company based in Brazil that serves over 4.5 million clients, experienced this challenge.
Agi’s Cloud Engineering team built alerts for each application teams’ individual service. This process took anywhere from one to five days, including:
- Determining what to alert on
- Instrumenting code for custom metrics
- Configuring alerts
- Testing the alerts to ensure they worked as expected.
Switching to Edge Delta enabled the team to leverage AI anomaly detection, which automated this process. “It takes just a few minutes for alerts to start working,” explains Agi Cloud Engineer, Bruno da Silva Verch.
Summarizing AI Anomaly Detection
AI anomaly detection enables engineering teams to identify issues before they lead to an incident. Rather than relying on humans to recognize and define the conditions to alert on, this approach automates monitoring entirely. As a byproduct, you can give time back to your engineering team and also spot unknown issues. Moving forward, AI will be critical to accelerating MTTD and limiting application downtime.