Keeping Kubernetes Clusters Clean and Tidy

As your cluster grows, so does the number of resources, volumes or other API objects and sooner or later you will reach the limits somewhere. Whether it's etcd, volumes, memory or CPU. Why subject yourself to unnecessary pain and trouble when you can setup simple, yet sophisticated rules, automation and monitoring that can keep you cluster tidy and clean without rogue workloads eating your resources or stale objects lying around?

Why Bother?

Some forgotten Pods, unused persistent volume, Completed Jobs or maybe old ConfigMap/Secret doesn't matter, or does it? It's just sitting there and I might need it at some point!

Well, it isn't causing any damage right now, but when things accumulate over time they start having impact on cluster performance and stability. So, let's look at some common/basic issues which these forgotten resources can cause:

There are some basic limits that every Kubernetes cluster has. First of them is number of pods per node, which based on Kubernetes docs should be at most 110. That being said, if you have very big nodes with lots of memory/CPU, you surely can go higher - possibly even 500 per node as tested here with OpenShift, but if you push the limits, don't be surprised when things start acting up and not just because of insufficient memory or CPU.

Another issue you might encounter is insufficient ephemeral storage. With all the running pods on the nodes, all of them using at least a bit of ephemeral storage for things like logs, cache, scratch space or emptyDir volumes. You can hit the limit pretty quickly which could lead to pods getting evicted or new pods not being able to start on the node. Running too many pods on the node can also contribute to this issue because ephemeral storage is used for container images, and writeable layers of containers. If the node start to run low on ephemeral storage, it will become tainted and you will probably find out about that pretty quickly. If you want to check what's the current state of storage on node you can run df -h /var/lib on the node itself.

Similarly to ephemeral storage also persistent volumes can become source of issues, especially if you're running Kubernetes in some cloud and therefore paying for each PVC you provision. So, it's obvious that it's important to cleanup all the unused PVCs to save money. Additionally, keeping the used PVC clean is also important as to avoid having your applications run out of space, especially if you're running databases in your cluster.

Problems can arise also from too many objects lying around, as they're all stored in etcd storage. As etcd database grows, its performance can start to degrade and you should try really hard to avoid this considering that etcd is a brain of Kubernetes cluster. With that said, you would need really big cluster to reach limits of etcd as is shown in this post by OpenAI. There's however no single metric you could look for to gauge performance of etcd as it depends on number of objects, their size or how frequently they change, so you better keep it clean otherwise you might get some nasty surprises.

Finally, having messy cluster can also lead to security problems. Leaving role bindings or service accounts lying around when they're not needed/used is kind of an invite for someone to grab them and abuse them.

Basics

You don't need to do anything complicated to solve most of the above mentioned issues. And the best way to solve them is to prevent them altogether. One way to achieve that is using object quotas that you (as a cluster administrator) can enforce on individual namespaces.

First issue that you can solve with quotas is number and size of PVCs:


apiVersion: v1
kind: LimitRange
metadata:
  name: pvc-limit
spec:
  limits:
  - type: PersistentVolumeClaim
    max:
      storage: 10Gi
    min:
      storage: 1Gi
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: pvc-quota
spec:
  hard:
    persistentvolumeclaims: "10"
    requests.storage: "10Gi"
    # the sum of storage requested in the bronze storage class cannot exceed 5Gi
    bronze.storageclass.storage.k8s.io/requests.storage: "5Gi"

In the above snippet we have two objects - first of them is LimitRange that enforces minimum and maximum size of individual PVCs in a namespace. This can be helpful for stopping users from requesting huge volumes. The second object here (ResourceQuota) additionally enforces hard limit for both number of PVCs as well as their cumulative size.

Next, to stop people from creating bunch of objects and then leaving them around when not needed, you can use object count quotas which create hard limits on number of instances of specific type of resource in given namespace:


apiVersion: v1
kind: ResourceQuota
metadata:
  name: object-count-quota
spec:
  hard:
    configmaps: "2"
    secrets: "10"
    services: "5"
    count/jobs.batch: "8"

There are couple of builtin fields you can use to specify object count quotas, for example configmaps, secrets or services shown above. For all other resources you can use count/<resource>.<group> format as shown with count/jobs.batch which can be useful in preventing misconfigured CronJob from spawning huge number of jobs.

We all probably know that we can set memory and CPU quotas, but you might not know that you can also set quota for ephemeral storage. Quota support for local ephemeral storage was added as an alpha feature in v1.18 and allows you to set ephemeral storage limits the same way as for memory and CPU:


apiVersion: v1
kind: ResourceQuota
metadata:
  name: ephemeral-storage-quota
spec:
  hard:
    requests.ephemeral-storage: 1Gi
    limits.ephemeral-storage: 2Gi

Be careful with these however, as pods can get evicted when they exceed the limit, which can be caused for example by size of container logs.

Apart from setting quotas and limits to resources, one can also set revision history limit to deployments to reduce number of ReplicaSets that are kept in the cluster. This is done using .spec.revisionHistoryLimit which is by default 10.

Finally, you can also set TTL (time to live) to cleanup objects that exist in the cluster for too long. This uses TTL controller which is in beta since v1.21 and currently only works for Jobs using the .spec.ttlSecondsAfterFinished field, but might be extended to other resources (for example pods) in the future.

Manual Cleanup

If prevention isn't enough and you already have bunch of orphaned, unused or otherwise dead resources lying around, then you can do a one-time purge. This can be done with just kubectl get and kubectl delete. Some basic example of what you could do are these:


kubectl delete all -l some-label=some-value  # Delete based on label
kubectl delete pod $(kubectl get pod -o=jsonpath='{.items[?(@.status.phase=="Succeeded")].metadata.name}')  # Delete all "Succeeded" Pods

First command does basic delete using resource label, which obviously requires you to first label all the related resources with some key=value pair. The second one shows how you can delete one type of resource based on some field (usually some status field) - in this case all the completed/succeeded pods. This one could be applied to other resources too for example completed Jobs.

Beyond these 2 commands, it's quite difficult to find pattern based on which you would be able to delete things in bulk, so you would have to look for individual resources which are unused. There's however one tool that might help with this - it's called k8spurger. This tool looks for unused RoleBinding, ServiceAccounts, ConfigMaps, etc. and produces list of resources that are good candidates for removal, which can help narrow down your search.

Kube-janitor

Previous sections explored some ways for simple, ad-hoc cleanup, but the ultimate solution for getting rid of any clutter is using kube-janitor. kube-janitor is a tool that runs in your cluster as any other workload and uses JSON queries to find resources which then can be deleted based on specified TTL or expiry date.

To deploy this tool into your cluster you can run the following:


git clone https://codeberg.org/hjacobs/kube-janitor.git
cd kube-janitor
kubectl apply -k deploy/

This deploys kube-janitor to default namespace and runs it with sample rules file and in dry-run mode (using --dry-run flag in kube-janitor Deployment).

Before switching off the dry-run mode, we should setup our own rules. Those live in kube-janitor config map, which looks something like this:


apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-janitor
  namespace: default
data:
  rules.yaml: |-
    rules:
      ...

We're obviously interested in the rules: section which we need to populate. So, here are some useful samples that you can just grab use for your cluster cleanup:


rules:
  # Delete Jobs in development namespaces after after 2 days.
  - id: remove-old-jobs
    resources:
    - jobs
    jmespath: "metadata.namespace == 'development'"
    ttl: 2d
  # Delete pods in development namespaces that are not in Running state (Failed, Completed).
  - id: remove-non-running-pods
    resources:
    - pods
    jmespath: "(status.phase == 'Completed' || status.phase == 'Failed') && metadata.namespace == 'development'"
    ttl: 2h
  # Delete all PVCs which are not mounted by a Pod
  - id: remove-unused-pvcs
    resources:
    - persistentvolumeclaims
    jmespath: "_context.pvc_is_not_mounted"
    ttl: 1d
  # Delete all Deployments with name starting with 'test-'
  - id: remove-test-deployments
    resources:
    - deployments
    jmespath: "starts_with(metadata.name, 'test-')"
    ttl: 1d
  # Delete all resources in playground namespace after 1 week
  - id: remove-test-deployments
    resources:
    - "*"
    jmespath: "metadata.namespace == 'playground'"
    ttl: 7d

This example shows some basic use-cases for cleanup of temporary, stale or unused resources. Apart from this kind of rules, you can also set absolute expiry dates/times for specific objects. That can be done using annotations, for example like this:


apiVersion: v1
kind: Namespace
metadata:
  annotations:
    # Gets deleted on 18.6.2021 at midnight
    janitor/expires: "2021-06-18"
  name: temp
spec: {}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    # Gets deleted on 20.6.2021 at 17:30
    janitor/expires: "2021-06-20T17:30:00Z"
  name: nginx
spec:
  replicas: 1
  ...

When you're done setting up your rules and annotations, you probably should let the tool run for a while in the dry-run mode and with debug logs turned on to see if correct object would get deleted and to avoid wiping something you don't want to (in other words - don't blame me if you wipe for example your production volumes because of your faulty config and lack of testing).

Finally, one thing to consider when using kube-janitor is that if you have a lot of objects in cluster, it might require more memory than its default 100Mi. So, to avoid having its pod getting stuck in CrashLoopBackOff, I prefer to give it 1Gi as memory limit.

Monitoring Cluster Limits

Not everything can be solved with manual or even automatic cleanup and for some cases monitoring is the best way to make sure you're not hitting any limits in your cluster, whether it's number of pods, available ephemeral storage or for example etcd object count.

Monitoring is however a huge topic and warrants an article (or a couple) of its own, so for purposes of this article, I will just list a couple of metrics from Prometheus which you might find useful when keeping your cluster nice and tidy:

etcd_db_total_size_in_bytes - Size of etcd database
etcd_object_counts - etcd object count
pod:container_cpu_usage:sum - CPU usage for each pod in cluster
pod:container_fs_usage_bytes:sum - Filesystem usage for each pod in cluster
pod:container_memory_usage_bytes:sum - Memory usage for each pod in cluster
node_memory_MemFree_bytes - Free memory for each node
namespace:container_memory_usage_bytes:sum - Memory usage per namespace
namespace:container_cpu_usage:sum - CPU usage per namespace
kubelet_volume_stats_used_bytes - Used space for each volume
kubelet_running_pods - Number of running pods in node
kubelet_container_log_filesystem_used_bytes - Size of logs for each container/pod
kube_node_status_capacity_pods - Recommended maximum of pods for each node
kube_node_status_capacity - Maximums for all metrics (CPU, Pods, ephemeral storage, memory, hugepages)

These are just a few of the many metrics you could use. What metrics will be available also depends on your monitoring tooling, meaning that you might be able to get some extra custom metrics exposed by services you run.

Closing Thoughts

In this article we explored a lot of options for cleaning up a Kubernetes cluster - some very simple, some more sophisticated. Regardless of which solution(s) you might choose, try to stay on top of this "cleanup duty" and clean things up as you go. It can save you a big headache and if nothing else, it will get rid of some unnecessary clutter in your cluster, which serves the same purpose as cleaning your desk. Also keep in mind that if you let thing lie around for a while, you will forget why they're there and whether they're needed or not, which makes it much more difficult to get things to clean state again.

Beyond the approaches presented here, you might want to also use some GitOps solution such as ArgoCD or Flux to create and manage resources, which can greatly reduce number of orphaned resources and it also makes the cleanup easier as it usually only requires you to delete single instance of a custom resource which will trigger cascading deletion of dependent resources.