Kubernetes v1.25 introduced Container Checkpointing API as an alpha feature. This provides a way to backup-and-restore containers running in Pods, without ever stopping them.
This feature is primarily aimed at forensic analysis, but general backup-and-restore is something any Kubernetes user can take advantage of.
So, let's take a look at this brand-new feature and see how we can enable it in our clusters and leverage it for backup-and-restore or forensic analysis.
Setup
Before we start checkpointing any containers, we need a playground where we can mess with kubelet and its workloads. For that we will need a v1.25+ Kubernetes cluster and container runtime that supports container checkpointing.
We will create such cluster using kubeadm
inside VM(s) built with Vagrant. I've created repository with everything necessary to spin up such cluster with just vagrant up
, so if you want to follow along, do check it out.
If you want to build your own cluster, then make sure it satisfies the following:
The cluster must have ContainerCheckpoint
feature flag enabled. For kubeadm
use following configuration:
# kubeadm-config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
ContainerCheckpoint: true
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.25.0
apiServer:
extraArgs:
feature-gates: "ContainerCheckpoint=true"
controllerManager:
extraArgs:
feature-gates: "ContainerCheckpoint=true"
scheduler:
extraArgs:
feature-gates: "ContainerCheckpoint=true"
networking:
podSubnet: 10.244.0.0/16
This will pass the --feature-gates
flag to each of the cluster components. For full list of available feature gates see docs.
Additionally, we will also need to use a container runtime that supports checkpointing. At the time of writing only CRI-O supports it, with containerd probably coming soon-ish.
To configure your cluster with CRI-O, install it using instructions in docs, or use the convenience script in above-mentioned repository (you should run this in VM, not on your local machine).
Additionally, we need to enable CRIU for CRI-O, which is the tool that does the actual checkpointing in the background.
To enable it, we need to set --enable-criu-support=true
flag. The above convenience script does that for you.
Also, if you plan to restore it back into a Pod, you will also need --drop-infra-ctr
set to false
, otherwise you will get CreateContainerError
with message like:
kubelet Error: pod level PID namespace requested for the container, ...
... but pod sandbox was not similarly configured, and does not have an infra container
With CRI-O installed, we also have to tell kubeadm
to use its socket, the following configuration with take care of that:
# kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
advertiseAddress: 192.168.56.2
bindPort: 6443
nodeRegistration:
criSocket: "unix:///var/run/crio/crio.sock"
---
# ... the previous snippet goes here
# Full config at:
# https://github.com/MartinHeinz/kubeadm-vagrant-playground/blob/container-checkpoint-api/kubernetes/kubeadm-config.yaml
With that in place we can spin-up cluster with:
kubeadm init --config=.../kubeadm-config.yaml --upload-certs | tee kubeadm-init.out
This should give us a single node cluster such as (note the container runtime version):
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION ... OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
kubemaster Ready control-plane 82s v1.25.4 ... Ubuntu 20.04.5 LTS 5.4.0-125-generic cri-o://1.25.0
Note: Generally, the best and simplest way to play with Kubernetes is using KinD. KinD however (as of time of writing) doesn't support container checkpointing. Alternatively you can also try local-up-cluster.sh script in Kubernetes repository.
Checkpointing
With that out of the way, we can try creating a checkpoint. The usual operations on Kubernetes can be done with kubectl
or by running curl
commands against cluster API server. This however won't work here, as the checkpointing API is only exposed on the kubelet
on each cluster node.
Therefore, we have to jump onto the node and talk to kubelet
directly:
vagrant ssh kubemaster
sudo su -
# Check if it's running...
systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sat 2022-11-12 10:25:29 UTC; 30s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 29501 (kubelet)
Tasks: 14 (limit: 2339)
Memory: 34.7M
CGroup: /system.slice/kubelet.service
└─29501 /usr/bin/kubelet --bootstrap-kubeconfig=... --kubeconfig=...
To create checkpoint we also need a running Pod. Rather than using system Pods in kube-system
, let's create a dummy Nginx webserver in default
namespace:
kubectl taint nodes --all node-role.kubernetes.io/control-plane-
kubectl run webserver --image=nginx -n default
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
webserver 1/1 Running 0 27s 10.85.0.4 kubemaster
Above you can see that we also removed taints from the node - this allows us to schedule workloads on the node even though it's part of control-plane.
Next, let's make a sample API request to kubelet
to see if we can get any valid response:
curl -skv -X GET "https://localhost:10250/pods" \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt
{
"kind": "PodList",
"apiVersion": "v1",
"metadata": {},
"items": [
{
"metadata": {
"name": "webserver",
"namespace": "default",
...
}
}
...
}
kubelet
by default runs on port 10250
, so we curl
it and ask for all its Pods. We also had to specify CA certificate, client certificate and key for authentication.
Now it's time to finally create a checkpoint:
curl -sk -X POST "https://localhost:10250/checkpoint/default/webserver/webserver" \
--key /etc/kubernetes/pki/apiserver-kubelet-client.key \
--cacert /etc/kubernetes/pki/ca.crt \
--cert /etc/kubernetes/pki/apiserver-kubelet-client.crt
# Response:
# {"items":["/var/lib/kubelet/checkpoints/checkpoint-webserver_default-webserver-2022-11-12T10:28:13Z.tar"]}
# Check the directory:
ls -l /var/lib/kubelet/checkpoints/
total 3840
-rw------- 1 root root 3931136 Nov 12 10:28 checkpoint-webserver_default-webserver-2022-11-12T10:28:13Z.tar
# Verify that original container is still running:
crictl ps --name webserver
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
880ee7ddff7f3 docker.io/library/nginx@sha256:... 48 seconds ago Running webserver 0 d584446dd8d5e webserver
The checkpointing API is available at .../checkpoint/${NAMESPACE}/${POD}/${CONTAINER}
, here we used the webserver
Pod created earlier. This request created an archive in /var/lib/kubelet/checkpoints/checkpoint-<pod>_<namespace>-<container>-<timestamp>.tar
.
Depending on the setup you're using, after running the above curl
, you might receive error along the lines:
checkpointing of default/webserver/webserver failed (CheckpointContainer is only supported in the CRI v1 runtime API)
# or
checkpointing of default/webserver/webserver failed (rpc error: code = Unknown desc = checkpoint/restore support not available)
That means that your container runtime doesn't (yet) support checkpointing, or it's not enabled correctly.
Analyzing
We now have a checkpointed container archive, so let's take a look at what's inside:
cd /var/lib/kubelet/checkpoints/
# Rename because "tar" doesn't like ":" in names
mv "checkpoint-webserver_default-webserver-2022-11-12T10:28:13Z.tar" webserver.tar
# View contents:
tar --exclude="*/*" -tf webserver.tar
dump.log
checkpoint/
config.dump
spec.dump
rootfs-diff.tar
io.kubernetes.cri-o.LogPath
# Extract:
tar -xf checkpoint-webserver_default-webserver-2022-09-04T10:15:37Z.tar
ls checkpoint/
cgroup.img fdinfo-4.img ids-31.img mountpoints-13.img pages-2.img tmpfs-dev-139.tar.gz.img
core-1.img files.img inventory.img netns-10.img pages-3.img tmpfs-dev-140.tar.gz.img
core-30.img fs-1.img ipcns-var-11.img pagemap-1.img pages-4.img tmpfs-dev-141.tar.gz.img
core-31.img fs-30.img memfd.img pagemap-30.img pstree.img tmpfs-dev-142.tar.gz.img
descriptors.json fs-31.img mm-1.img pagemap-31.img seccomp.img utsns-12.img
fdinfo-2.img ids-1.img mm-30.img pagemap-shmem-94060.img timens-0.img
fdinfo-3.img ids-30.img mm-31.img pages-1.img tmpfs-dev-136.tar.gz.img
cat config.dump
{
"id": "880ee7ddff7f3ce11ee891bd89f8a7356c97b23eb44e0f4fbb51cb7b94ead540",
"name": "k8s_webserver_webserver_default_91ad1757-424e-4195-9f73-349b332cbb7a_0",
"rootfsImageName": "docker.io/library/nginx:latest",
"runtime": "runc",
"createdTime": "2022-11-12T10:27:56.460946241Z"
}
tar -tf rootfs-diff.tar
var/cache/nginx/proxy_temp/
var/cache/nginx/scgi_temp/
var/cache/nginx/uwsgi_temp/
var/cache/nginx/client_temp/
var/cache/nginx/fastcgi_temp/
etc/mtab
run/nginx.pid
run/secrets/kubernetes.io/
run/secrets/kubernetes.io/serviceaccount/
If you don't need a running Pod/container for analysis, then extracting and reading through some of the above shown files might give you the necessary information.
With that said, I'm no security expert, so I'm not going to feed you questionable information here about how to analyze these files.
As a starting point though, you might want check out tools like docker-explorer or this talk on container forensics in Kubernetes.
Restoring
While the Checkpointing API is currently aimed more at forensic analysis, it still can be used to restore the Pod/container from the archive.
The simplest way is to create an image from checkpoint archive:
FROM scratch
# Need to use ADD because it extracts archives
ADD webserver.tar .
Here we use an empty (scratch
) image to which we add the archive. We need to use ADD
because it automatically extracts archives. Next, we build it with docker
or buildah
:
cd /var/lib/kubelet/checkpoints/
# Or docker build ...
buildah bud \
--annotation=io.kubernetes.cri-o.annotations.checkpoint.name=webserver \
-t restore-webserver:latest \
Dockerfile .
buildah push localhost/restore-webserver:latest docker.io/martinheinz/restore-webserver:latest
Above we also specify annotation which describes the original human-readable name of the container and then we push to some registry so that Kubernetes can pull it.
Finally, we create a Pod, specifying the previously pushed image:
# pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: restore-webserver
labels:
app: nginx
spec:
containers:
- name: webserver
image: docker.io/martinheinz/restore-webserver:latest
nodeName: kubemaster
To test if it worked, we can expose the Pod via Service and curl
its IP:
kubectl expose pod restore-webserver --port=80 --target-port=80
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 14m
restore-webserver ClusterIP 10.104.30.90 <none> 80/TCP 17s
curl http://10.104.30.90
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...
</html>
And it worked! We successfully backed-up a running Pod without stopping it and recreated it to its original state.
Closing Thoughts
General checkpoint-and-restore for containers has been possible for a while thanks to CRIU, but this is still a big step for Kubernetes, and hopefully, we will see this feature/API graduate to Beta and eventually GA at some point.
The previous sections demonstrated the usage of checkpointing API - it's very much usable, but it also lacks some basic features, such as native restore functionality or support from all major container runtimes. So, be aware of its limitations if you decide to enable it in production (or even development) environments/clusters.
With that said, this feature is very cool addition not just for forensic analysis - in the future when better/native restore process is available, this could become a proper backup-and-restore process for container workloads, which might be very useful for certain types of long-running Kubernetes workloads.