You’ve heard of observability. You’ve heard of monitoring. And by now, you’ve perhaps read dozens of blog posts that purport to unpack the similarities and differences between observability and monitoring.
The problem with many of those posts, though, is that they don’t really dive deeply into what observability and monitoring look like in practice or how they relate to each other in a day-to-day, operational sense. They also tend to ignore issues like how observability and monitoring work in distributed environments as opposed to conventional infrastructures.
That’s why we think it’s worth adding yet another article to the long lineup of “observability vs. monitoring” pieces. In this piece, we’ll dive deeper than most others into the complex relationship between observability and monitoring by exploring the histories of each concept in addition to their similarities and differences. We’ll also use examples to contextualize what observability and monitoring look like in practice – and not just in theory – for modern IT, DevOps teams, or SRE teams.
What Is Observability?
The classic definition of observability – as you probably already know – is that it involves using external outputs to evaluate the internal state of a system and its overall health.
That’s a fancy way of saying that observability means collecting and analyzing data from the “surface” of a system to understand what is happening deep inside it.
In other words, when you observe an application, you collect data (like metrics, logs, and traces, the so-called three pillars of observability) and so you can analyze it to try to figure out what happens when the application processes a transaction, or which specific microservices within the application are the source of performance issues.
Observability, then, is about searching for information you haven’t identified beforehand. You don’t need to search for predefined patterns; you simply collect the data and use it to understand what is happening inside the system’s internal state and get to the root cause, in a proactive way.
The History of Observability
The concept originated within the academic field of control theory in the 1960s. It was the brainchild of a Hungarian-American engineer named named Rudolf Kalman.
Obviously, Kalman was not thinking about cloud-native software, microservices, automation, Kubernetes or anything else that modern IT teams care about when he came up with the concept of observability. Computer science at the time was in its infancy, and Kalman wasn’t a computer scientist, anyway. He was an engineer and a mathematician who was concerned with abstract system design and management, not about software delivery pipelines, lifecycles, telemetry pipelines, or application performance.
It wasn’t until much more recently that folks in the IT industry started glomming onto the concept of observability. Arguably, they did so mostly in response to the fact that IT systems over the past decade have become much more complex as we moved to distributed serverless, cloud native architectures, and because of that, they are more difficult to manage using conventional techniques (like monitoring) alone. The other part of the reason was arguably that marketers of monitoring tools wanted a new buzzword to help sell their products.
Either way, the fact is that observability today has become the core foundation for the way teams seek to gain insights into the complex, distributed environments that they manage.
What Is Monitoring?
Monitoring is the use of tooling to understand or watch the state of a system based on predefined knowledge or datasets. Monitoring systems help us analyze trends for building dashboards and alerting so that we know how apps are functioning. In other words, when you monitor, you look for patterns you’ve seen before. It’s inherently reactive.
The History of Monitoring
As you know, IT teams have been performing monitoring at least since the 1990s, which is when modern application and infrastructure monitoring platforms first appeared. Some tools that are still used today, like Nagios and Zabbix, trace their roots to this era.
Monitoring, then, is nothing new, even within the IT industry. We’ve seen generations of monitoring; the first focused on infrastructure monitoring like server uptime, and the second generation focused on gathering information through a myriad of tools like log management and error aggregation. Monitoring tools were created for two purposes: to understand what’s broken and why. It is crucial for analyzing long-term trends, for building dashboards, and for alerting. It lets you know how your apps are functioning, how they’re growing, and how they’re being utilized.
Nowadays, when IT professionals talk about monitoring, they focus on the tools that allow them to watch and understand the state of their systems.
Observability vs. Monitoring
That brings us to the relationship between observability and monitoring.
If you’ve read this far, the differences should be clear enough: by modern definitions, observability is the complete set of processes required to understand what is happening within a system (especially a complex, distributed system), whereas monitoring deals mostly with data collection and presentation.
Thus, you can think of monitoring as one process within the broader workflow of observability. You need monitoring to perform observability. Essentially, they are using different telemetry sources to provide the deep state of all system components.
But observability also hinges on activities beyond monitoring. To achieve full observability, you also need to be able to correlate discrete data sets (such as application metrics and infrastructure logs), then analyze them to understand how they relate to each other. Instrumentation complements observability by helping teams gauge whether releases are behaving as expected and ask new questions to debug an application or service.
Kubernetes Observability and Monitoring
To illustrate what this means in practice, take the example of Kubernetes. Kubernetes is a great platform to study in the context of monitoring and observability because it’s a complex, distributed system that involves multiple dependencies: nodes, pods, an API server, a scheduler, and so on.
When you monitor Kubernetes, you merely collect data from these various components. For example, you might collect:
- Logs from node operating systems.
- Metrics about pod start time, running time, failure rate, and similar trends.
- Logs from kube-scheduler that let you track which pods were scheduled on which nodes.
- Logs and metrics from the cloud service that hosts your clusters (if indeed you are running them in the cloud), which will reveal insights about the health of the underlying infrastructure.
- Audit logs, which track API requests and their outcomes within Kubernetes.
- Distributed traces from transactions that you send to applications running in Kubernetes.
Based on this data, you may be able to identify trends or patterns you’ve seen before. But that’s all you can do, because monitoring is based on reactively looking for familiar information. When it comes to debugging pods or services, your team must rely on intuition and/or explore known anomalies to get to the source of the problem.
By correlating and analyzing data, observability takes things a step further. In Kubernetes, observability entails the following steps:
- Deploying agents to the various components of Kubernetes so that you can collect data from them.
- Generating event streams from these components to expose data.
- Correlating data points from across these varied event streams.
- Analyzing the correlated data to understand whether events are interrelated – even if they are patterns you have not encountered before. For example, if a pod fails to start, you can use data from kube-scheduler logs and/or nodes to figure out whether the root of the issue lies with the scheduler, the node, or something else (like a bad API request, perhaps).
If you’ve ever worked with Kubernetes, it should be clear enough why monitoring alone won’t lead to actionable insights. Kubernetes includes so many moving parts, each of which depends in complex ways on the other parts, that collecting data from the individual parts doesn’t tell you very much. You need correlation and comprehensive analysis to get the full picture.
Monitoring vs. Observability at the Edge
The most efficient way to maximize observability is to collect data at the “edge”, which means at its original source, instead of centralizing data in an observability platform before analyzing it. When you observe Kubernetes this way, you can collect every data point and begin analyzing it as quickly as possible. Instead of having to look for predefined trends in conventional monitoring tools when analyzing or debugging an application or service, you can detect complex patterns at the original data source.
The Need for Speed
A consideration that’s often missing from discussions of observability and monitoring in the context of distributed systems is the fact that moving data from the source where it is generated to a place where it can be analyzed often introduces major delays. It may mean that you don’t glean insights in real time, which undercuts the value of observability and monitoring.
This is why teams should strive to perform both monitoring and observability at the data source whenever possible. Instead of pulling data from nodes and pods to an observability platform, for instance, you should generate data streams directly from the sources that can be correlated and analyzed immediately.
It’s a safe bet that we’ll continue to hear lots of talk about observability and monitoring in the IT industry for the foreseeable future. While part of the conversation is probably driven merely by people’s tendency to glomm onto buzzwords, there’s undeniably something real and valuable here that is worth thinking about.
In complex, distributed systems like Kubernetes or edge environments, you simply can’t understand what is happening using monitoring alone. You need observability – not just observability as the type of abstract theory that Kalman defined, but as a practical process that you can operationalize by systematically collecting, correlating, and analyzing data from across the resources you manage.