Data has gravity, and one of the most common themes coming out of this year’s Snowflake and Databricks conferences was pushing compute upstream combat that gravity. As Snowflake CEO Frank Slootman put it, “the work needs to come to the data, we want to stop the data from coming to the work.”
“Data gravity” refers to the tendency of data to attract more data, making it increasingly difficult to move and process as it accumulates over time. In observability, the standard model is routing all data to a central service before you can begin processing it. Given the rate logs, metrics, and traces have grown, the theme of data gravity is especially relevant to observability practices.
How Data Gravity Impacts Observability
Analysis Becomes Slower and More Expensive
As data grows and becomes more challenging to move, the costs associated with storage and analytics escalate significantly. Moreover, the time taken to move and analyze large datasets can slow insights, impacting real-time monitoring use cases.
Centralizing Data in One Place Leads to Vendor Lock-In
Observability teams aren’t only challenged by how much data is created, but also by where it resides. Relying on a central service can result in vendor lock-in, where it’s increasingly difficult to change observability backends as data accumulates. When all data is going to one place, teams lose flexibility to improve performance and right-size spending.
How to Combat Data Gravity in Observability
Part #1: Edge Processing
To address the challenges of data gravity upstream, you can embrace edge processing. Edge processing involves analyzing data at its source, right when it is created, rather than waiting for it to land in a central service.
This approach helps solve the challenges created by large scale datasets in two ways. First, you can move data into its optimal shape before you pay a premium to derive value from it. For example, you can summarize noisy data, standardize the format of your loglines, or even enrich data. Second, as a byproduct of processing data upstream, you reduce the footprint you ship to downstream platforms, optimizing costs and improving performance.
Example: Summarizing Data Upstream, Before Ingestion
Consider a scenario where one service is throwing hundreds of thousands of INFO logs communicating the same behavior. Does your team need access to each individual logline?
Instead of shipping all the raw log data to the central observability platform, edge processing can be used to identify common patterns and group together similar loglines. The edge processing layer can then send a summarized version of the data to the central service. This results in a more efficient dataset, reducing costs, traffic volume, and noise.
Part #2: The Flexibility to Route Data Anywhere
Another crucial aspect in overcoming data gravity is the ability to route data to different destinations. As I explained earlier, it becomes more complex to move off of a centralized service when everything accumulates there. Instead, you should be able to embrace a tiered or multi-faceted approach. In this scenario, you can ship different data to different storage targets, based on the specific use case and cost considerations.
For example, maybe you delegate data used to monitor your core application in real time to a premium data store. Then, data created from less critical resources can be retained in a more cost-effective alternative. You can even split up data by log level, as we’ll explore in the example below.
Example: Shipping Different Subsets of Data to the Optimal Destination
For example’s sake, let’s imagine an application team is hoping to both reshape their data and move different subsets of data to different destinations. They break the dataset down by the log level:
ERROR and FATAL log levels: These are critical for troubleshooting and need to be ingested in full fidelity. They are shipped directly to the observability platform. However, the team can use edge processing to enrich this data for faster troubleshooting.
WARN log level: While this data might still be useful troubleshooting, the team shouldn’t ingest all WARN logs by default. Instead, it can be sent to a low-cost storage target where specific portions of the data can be rehydrated only when needed.
INFO log level: The team does not need this data ingested raw. However, specific insights, such as the rate of 400 status codes, might be useful to track overtime. These insights can be extracted as a metric upstream and ingested in the observability platform to populate dashboards.
DEBUG log level: The team might forget to turn DEBUG off every once in a while, causing spikes in bills. Instead of relying on a human to turn DEBUG off and on, this data can be sent to a low-cost storage target and rehydrated selectively to avoid unnecessary ingestion.
In the example above, adding an external data source gives teams more optionality to maximize data usability and reduce costs. Moreover, there are other ways teams can get creative with their streaming destinations. For example, you can use edge processing to identify deviations in application behavior and route them directly to an alerting platform, cutting minutes off time-to-detect.
By using edge processing to identify and centralize critical data for analysis while preserving access to data that may be needed later, companies can right-size costs and efficiently manage observability.
Observability Should Start at the Source
To address data gravity challenges in observability effectively, companies must embrace an edge-oriented approach that enables transparency, control, and flexibility:
Transparency to understand the value and usage of each dataset enables optimal decision-making regarding data handling.
Control to optimally shape based on how it's used.
Flexibility to route data anywhere to balance the needs of the use case and cost efficiency.
By incorporating these principles, you can mitigate the impacts of data gravity.