Guides

How to Collect Log Data from Kubernetes Using emptyDir

In this article, we discuss how you can ingest logs into your observability pipeline using emptyDir — a semi-persistent storage volume available in Kubernetes.
Chad Sigler
Senior Solutions Engineer
Jun 13, 2024
7 minute read

Recently, I was helping a customer configure their observability pipeline. This team uses Kubernetes to ensure a scalable infrastructure with distributed nodes in many zones and regions. They also need to store logs in a more traditional way – on disk, rather than streaming them to Kubernetes stdout. However, they have a policy that prevents engineers from using persistent volumes to store log data created by their Kubernetes nodes. As a result, no persistent volume claims (PVCs) are allowed.

The team struggled to understand how to collect logs into their pipeline without a persistent storage volume in the container. To solve this problem, I helped the team implement a semi-persistent storage target, along with tools to ensure their log data is memorialized and processed appropriately.

In this blog, I want to share my approach. I hope this will help other engineering teams build effective observability pipelines under the same constraints. I’ll do so by walking through a scenario that uses the emptyDir storage type in Kubernetes. But first, let’s provide more context around persistent and semi-persistent storage.

Persistent Storage vs. Semi-Persistent Storage

In Kubernetes, pods are ephemeral in nature, meaning they can be easily created, destroyed, and rescheduled. Persistent storage provides a way to store data – such as telemetry – in a manner that survives beyond the lifecycle of the pod. In other words, if your pod is destroyed, the data remains.

While storing data in a persistent volume can be beneficial, it can also create excessive costs if the storage volumes aren’t well maintained and pruned over time.

On the other hand, semi-persistent storage preserves data when your pods are rescheduled. However, it gracefully deletes the storage when your pods are destroyed. This prevents your team from storing (and paying for) data that you no longer need.

A few examples of semi-persistent storage volumes include emptyDir, hostPath, and local. In this article, we’ll focus on emptyDir.

What are emptyDir volumes?

Since the customer I support cannot use persistent volumes, I helped them collect log data from the emptyDir storage type in Kubernetes. This allows somewhat resilient storage availability to Kubernetes containers and pods.

When you use the emptyDir volume type in Kubernetes, you create a temporary directory on the host node's filesystem. Data stored in this volume is preserved across container restarts within the same pod, but it is not persistent beyond the pod's lifecycle. The data is lost if the pod is terminated or scheduled on a different node.

Understanding the emptyDir Architecture

When you host Kubernetes infrastructure in a cloud provider, such as Amazon Web Services (AWS) or Microsoft Azure, ensuring each host has ample storage space is beneficial. This will prevent you from running out of resources as you create more data.

When your pod is assigned to a node, it automatically creates an emptyDir volume. The volume lives on the host, outside of the container. By default, it stores data on your node’s backup mechanism (disk, network storage, SSD, etc.). In part, this is why emptyDir is so beneficial to use here – you don’t need to spin up additional resources to support the volume.

All containers running in a given pod will independently read/write data to the same emptyDir volume. The default mountPath for each pod has a root location of:

/var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/

For this example, I will be using a volume that is located in this folder:

/var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/{volume}

Now, let’s take a look at my example workload. Here, I have deployed an Nginx workload to my Kubernetes cluster with the following configuration:

kind: Namespace
apiVersion: v1
metadata:
name: nginx-local
labels:
name: nginx-local
---
apiVersion: v1
kind: Pod
metadata:
name: nginx-local
namespace: nginx-local
labels:
app: nginx-local
spec:
containers:
- name: nginx-local
image: chadtsigler/ed-gen-nginx-local:latest
volumeMounts:
- mountPath: /var/log/test
name: logs
volumes:
- name: logs
emptyDir: {}

When I use this configuration, my Nginx pod will write all its logs to the proper mountPath, which is /var/log/test. This will ultimately translate into the following folder structure:

/var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/logs/{filename}

When I use this configuration, I know that all my logs will be stored in the emptyDir host storage. In doing so, my workload can read and write logs to the emptyDir volume.

Getting Logs from emptyDir into Edge Delta

Now that we’ve set up emptyDir, we need to answer the question: How can I get my logs into Edge Delta?

Edge Delta is deployed as a daemonset, which ensures the data from the logs are only gathered once per container. To ensure the host folder is available to Edge Delta, I will mount the folder inside the Edge Delta container on each host. This will allow the Edge Delta container to read all logs that are stored in the emptyDir volume.

To keep it simple, the emptyDir folder is mounted to the same folder inside the Edge Delta container.

Here’s a snippet from the kube manifest:

# Source: edgedelta/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: edgedelta
namespace: edgedelta
annotations:
prometheus.io/scrape: "true"
labels:
app.kubernetes.io/name: edgedelta
app.kubernetes.io/instance: edgedelta
edgedelta/agent-type: processor
version: v1
kubernetes.io/cluster-service: "true"


volumeMounts:
- name: emptydirs
mountPath: /var/lib/kubelet/pods
readOnly: true

volumes:
- name: emptydirs
hostPath:
path: /var/lib/kubelet/pods

To ensure the data from the files is available to the Edge Delta container, we mount the file paths, and then we add them to the agent configurations. To add data to these inputs, you can add the appropriate file input to the Edge Delta agent configuration YAML configuration:

path: /var/lib/kubelet/pods/*/volumes/kubernetes.io~empty-dir/**/*.log

You can also add a new file input from within our Visual Pipelines interface, updating the path with the appropriate value:

Enriching Log Data With Kubernetes Labels

Now that we’re collecting logs in Edge Delta, let’s enhance the data quality to ensure I can query the data in Edge Delta’s backend or another log management tool.

Let's say I have the same Pod running in multiple Namespaces. It would be beneficial to decorate the logs with enough information, so they’re easier to isolate by Namespace downstream. To do this, I can add Kubernetes labels to the log files.

By adding a Resource Transform Node to the path, I can transform the file logs into Kubernetes logs. Here’s the configuration YAML I will use:

target_source_type: k8s
source_field_overrides:
- field: k8s.namespace.name
expression: from_k8s(regex_capture(item["resource"]["ed.filepath"], "/var/lib/kubelet/pods/(?P<id>(.+))/volumes.*")["id"],
"k8s.namespace.name")
- field: k8s.pod.name
expression: from_k8s(regex_capture(item["resource"]["ed.filepath"], "/var/lib/kubelet/pods/(?P<id>(.+))/volumes.*")["id"],
"k8s.pod.name")
- field: k8s.container.name
expression: from_k8s(regex_capture(item["resource"]["ed.filepath"], "/var/lib/kubelet/pods/(?P<id>(.+))/volumes.*")["id"],
"k8s.container.name")
- field: container.image.name
expression: from_k8s(regex_capture(item["resource"]["ed.filepath"], "/var/lib/kubelet/pods/(?P<id>(.+))/volumes.*")["id"],
"container.image.name")

These logs will also contain all Kubernetes tags and labels from the deployment. Adding this Resource Transform Node ensures that all data looks the same. Moreover, it will enable my customer to locate data based on this Kubernetes metadata. Now that all the data looks like Kubernetes data, we can easily ensure that all data in a namespace can be grouped at the destination.

Final Thoughts

In this blog, I covered how you can collect data in Edge Delta using an emptyDir Kubernetes storage volume. This approach is useful if your team cannot use persistent volumes.

Across two Edge Delta Observability Pipeline nodes, we were able to gather logs from any file that has been written to the host via emptyDir. Additionally, all logs contain Kubernetes tags and labels to enhance the data quality.

Stay in Touch

Sign up for our newsletter to be the first to know about new articles.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.