Hardening Docker and Kubernetes with seccomp

There are a lot of misconceptions about container security - A lot of people assume that containers are secure by default, which is unfortunately not true. There are quite a few tools that can help you improve security of your containers and therefore also security of Docker and Kubernetes. One of the ways to harden them is to apply proper seccomp profiles. If you have no idea what seccomp is, then read on and see what it is and how to use it to protect your Docker and Kubernetes from security threats!

What is seccomp, Anyway?

If you've been working with Docker or Kubernetes for a while, you might have heard term seccomp, but chances are, you haven't really looked deeper into this obscure tool, right?

The simplest and easiest to understand definition of seccomp is probably a "firewall for syscalls". seccomp is essentially a mechanism to restrict system calls that a process may make, so the same way one might block packets coming from some IPs, one can also block process from sending system calls to CPU.

That's cool and all, but how does that help us make the system more secure? The Linux kernel has a lot of syscalls (few hundred), but most of them are not needed by any given process. If process can get compromised and tricked into using some of these syscalls though, then it can lead to serious security issues for a whole system. So, restricting which syscalls process can make greatly reduces attack surface of a kernel.

Now, if you're running any decently up-to-date version of Docker (1.10 or higher), then you're already using seccomp. You can check that using docker info or by looking in /boot/config-*:

~ $ docker info
...
 Security Options:
  apparmor
  seccomp
   Profile: default  # default seccomp profile applied by default
...

~ $ grep SECCOMP /boot/config-$(uname -r)
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_SECCOMP=y

Well, that's great! We already have seccomp, everything is secure and we don't need to mess with any of this, right? Well, not quite - there're quite a few reasons you might want get your hand dirty and dig deeper into seccomp...

Why Bother Creating One?

The default seccomp profile in Docker may often be "good enough" and if you have no experience with it, then "improving" or otherwise customizing it might not be the best use of your time. There are however some reasons that might give you enough motivation to change the default seccomp profile:

From time-to-time a security breach is found. This is inevitable and further restricting default seccomp profile might lower the chance of the security breach effecting your applications. One such incident was CVE 2016-0728, which allowed user to escalate privileges using keyctl() syscall. After this incident, this syscall is now blocked by default seccomp profile, but there might be more such exploits in the future...

Another reason to restrict the seccomp profile would be that if certain subsystem has a security bug, attackers won’t be able to exploit it from your containers because the related syscalls have been blocked. This subsystem can be dependency, library or some part of the system of which you have no control of. You also probably have no way to fix the issue in that subsystem and to prevent the issue from being exploited you need to block it on different layer, e.g. in seccomp profile.

Potential exploits and security breaches become even more important when you're dealing with PII data or when building mission-critical applications, such as software used in health care or power grid. In these cases any little security improvement counts and customizing seccomp profile might be a worthwhile effort.

Besides all these specific reasons, limiting syscalls that can be used by your containers limits the attack vectors that can be used, e.g. backdoors in Docker upstream images or exploitable bugs in applications.

Customizing a Profile

Now, if we established that we have a good reason to change the seccomp profile, how would we actually go about doing that? One way to write seccomp filter is to use Berkeley packet filter (BPF) language. Using this language isn't really simple or convenient. Luckily we don't have to use it - instead we can write JSON that is compiled into profile by libseccomp. Simple/minimal example of such profile could look like this:

{
    "defaultAction": "SCMP_ACT_ERRNO",
    "syscalls": [
        {
            "names": [
                "accept",
                "chown",
                "kill",
                "mmap",
                ...
            ],
            "action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {},
			"excludes": {}
        }
    ]
}

Generally it's preferred to use whitelist instead of blacklist and to explicitly the list allowed syscalls and forbid any other. That's exactly what the above profile does - by default, it will use action SCMP_ACT_ERRNO which causes Permission denied for all syscalls. For the ones we do want to allow we list their names and specify the SCMP_ACT_ALLOW action.

These JSON profiles can use quite a few options and can become very complex, so the one above is really trimmed it down to bare minimum. To see how real profile would look like you can check out Dockers profile here.

Now that we understand seccomp profiles little more, let's play with Docker. Before we try applying any custom profiles though, let's first experiment a little bit and override the seccomp defaults:

# Run without seccomp profile:
~ $ docker run --rm -it --security-opt seccomp=unconfined alpine sh
/ # reboot  # Works, oops
# Run with default seccomp profile
~ $ docker container run --rm -it alpine sh
/ # reboot  # Doesn't work

Above we can see what happens if we disable seccomp completely - any syscall is available, which allows the user in container to - among other things - reboot host machine. The --security-opt seccomp=... in this example is used to disable the seccomp. It's also the way to apply custom profiles, so let's build and apply our own custom profile.

It's not a good idea to start from scratch, so we will modify the existing Dockers profile (referenced above) and we will restrict it a bit by removing chmod, fchmod and fchmodat syscall, which effectively denies permission to use chmod command. This might not be the most reasonable change or "improvement" but for demonstration purposes it works just fine:

# Run with custom seccomp profile (Disallowing chmod)
~ $ docker container run --rm -it --security-opt seccomp=no-chmod.json alpine sh
/ # whoami
root
/ # chmod 777 -R /etc
chmod: /etc: Operation not permitted

In this snippet we load the custom profile from no-chmod.json and try to chmod whole /etc. We can see what this causes in container - any attempt at running chmod results in Operation not permitted.

With this simple "no chmod" profile it was quite simple to pick which syscalls should be blocked. In general however it's not so simple to pick what can or cannot be blocked. You can make an educated guess, but you're risking blocking too much and not enough at the same time by blocking syscalls that will prevent your application from operating correctly, but also missing some syscalls that can and should be blocked.

Only feasible way to choose which syscall to block is call tracing. One way to trace syscalls is to use strace:

~ $ strace -c -f -S name chmod 2>&1 1>/dev/null | tail -n +3 | head -n -2 | awk '{print $(NF)}'
syscall
----------------
access
arch_prctl
brk
close
execve
fstat
mmap
mprotect
munmap
openat
pread64
read
write

The command above gives us a list of all syscalls used by a command (chmod in this case), from which we can choose what to block/allow. If this strace command is little too complicated for you, then also strace -c ls would work, which outputs some extra info like time and number of calls. We obviously can't block all of these, because that would render our container unusable, so it's best to lookup each of these, for example using this syscall table.

What About Kubernetes?

In the title and intro I mentioned also Kubernetes, yet we only talked about Docker so far. So, what's the situation around seccomp in Kubernetes world? Well, unfortunately in Kubernetes seccomp is not used by defaults and therefore syscalls are not filtered, except for a few really dangerous ones. So, with Docker you could get away with not configuring any profile and just using defaults, but in Kubernetes, there's really nothing preventing some security issues from being exploited.

How do we "fix" that? That depends on the version of Kubernetes you're using - for versions before 1.19, you will need to apply annotations to your pods. This would look something like this:

apiVersion: v1
kind: Pod
metadata:
  name: some-pod
  labels:
    app: some-pod
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: localhost/profiles/some-profile.json
spec:
    ...

For the 1.19 (which we will focus on here) and later, the seccomp profiles are GA feature and you can use seccompProfile section in securityContext of a pod. Definition of such pod would look like so:

apiVersion: v1
kind: Pod
metadata:
  name: some-pod
  labels:
    app: some-pod
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/some-profile.json
  containers:
    ...

Now we know how to solve it on Kubernetes, so it's time for a little demonstration. For that we will need a cluster - here I'm going to use KinD (Kubernetes in Docker) to set up a minimal local cluster. Additionally we will also need all the seccomp profiles before we spin up the cluster as these need to be available on the cluster nodes. So, the definition for cluster itself is as follows:

apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
  extraMounts:
  - hostPath: "./profiles"
    containerPath: "/var/lib/kubelet/seccomp/profiles"

This defines single node cluster which mounts local profiles directory into node at /var/lib/kubelet/seccomp/profiles. And what do we put in this profiles directory? For the purposes of this example, we will use 3 profiles: Dockers default profile and "no chmod" profile used previously, and additionally also a so-called "audit"/"complain" profile.

We will start with audit profile. This one doesn't allow/block any syscalls but rather just logs them to syslog logs when they are used by some command/program. This can be very useful for both debugging and exploring behavior of application and finding syscalls that can or cannot be blocked. The definition of this profile is really just one line and looks like this:

# ./profiles/audit.json
{
    "defaultAction": "SCMP_ACT_LOG"
}

One more thing we will need is - obviously - a pod. This pod will have single Ubuntu container, which runs ls to trigger some syscalls and then sleeps, so it doesn't terminate and restart:

# ./pods/audit.yaml
apiVersion: v1
kind: Pod
metadata:
  name: audit-seccomp
  labels:
    app: audit-seccomp
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/audit.json
  containers:
  - name: test-container
    image: ubuntu
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "ls /; while true; do sleep 30; done;" ]
    securityContext:
      allowPrivilegeEscalation: false

With that out of the way, let's build the cluster and apply the first profile:

~ $ tree .
.
├── kind.yaml
├── pods
│   ├── audit.yaml
│   ├── default.yaml
│   └── no-chmod.yaml
└── profiles
    ├── audit.json
    └── no-chmod.json

~ $ kind create cluster --image kindest/node:v1.19.4 --config=kind.yaml
~ $ kubectl apply -f pods/audit.yaml
~ $ tail /var/log/syslog

Nov 25 19:38:18 kernel: [461698.749294] audit: ... syscall=21 compat=0 ip=0x7ff8f8412d5b code=0x7ffc0000    # access
Nov 25 19:38:18 kernel: [461698.749306] audit: ... syscall=257 compat=0 ip=0x7ff8f8412ec8 code=0x7ffc0000   # openat
Nov 25 19:38:18 kernel: [461698.749315] audit: ... syscall=5 compat=0 ip=0x7ff8f8412c99 code=0x7ffc0000     # fstat
Nov 25 19:38:18 kernel: [461698.749317] audit: ... syscall=9 compat=0 ip=0x7ff8f84130e6 code=0x7ffc0000     # mmap
Nov 25 19:38:18 kernel: [461698.749323] audit: ... syscall=3 compat=0 ip=0x7ff8f8412d8b code=0x7ffc0000     # close

We perform the above example in directory with our KinD cluster definition, pods directory and profiles directory. We first use kind command to create cluster from the definition. We then apply the audit-seccomp pod which uses the audit.yaml profile. Finally we inspect syslog messages to see audit messages logged by audit-seccomp pod. Each of these messages contains syscall=..., which specifies syscall ID (for x86_64 architecture), which can be translated to the name in the comment at the end of each line.

Now that we confirmed that seccomp works as expected, we can apply real profile (the Dockers default). To do that we will use different pod:

# pods/default.yaml
apiVersion: v1
kind: Pod
metadata:
  name: default-seccomp
  labels:
    app: default-seccomp
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: test-container
    image: r.j3ss.co/amicontained
    command: [ "/bin/sh", "-c", "--" ]
    args: [ "amicontained" ]
    securityContext:
      allowPrivilegeEscalation: false

We made a few changes here. Namely, we changed seccompProfile section where we specify RuntimeDefault type and we also changed the image to amicontained which is a container introspection tool that will tell us which syscalls are blocked, as well as some other interesting security info.

After applying this pod, we can see see in logs the following:

~ $ kubectl apply -f pods/default.yaml
~ $ kubectl logs default-seccomp
Container Runtime: docker
Has Namespaces:
	pid: true
	user: false
AppArmor Profile: unconfined
Capabilities:
	BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: filtering
Blocked Syscalls (61):
	PTRACE SYSLOG SETPGID SETSID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE
    DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY
    REQUEST_KEY KEYCTL MIGRATE_PAGES UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF
    USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE

This shows that Dockers default profile will block the above 61 syscalls. If we didn't include the seccompProfile section with RuntimeDefault type it would be just 22 (you can test for yourself if you don't trust me on this 😉). This is great improvement in security in my opinion for very little actual effort.

If we decide that the default is not good enough or that we need to make some modifications, we can deploy our custom profile. Here we will demonstrate that with the "no chmod" profile and the following pod:

# pods/no-chmod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: no-chmod-seccomp
  labels:
    app: no-chmod-seccomp
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/no-chmod.json
  containers:
  - name: test-container
    image: ubuntu
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "touch test; chmod +x test; while true; do sleep 30; done;" ]
    securityContext:
      allowPrivilegeEscalation: false

This pod is very similar to the "audit" pod shown previously. We just switched the localhostProfile to point to different file and changed the container args to include chmod command so that we can see and confirm that our modified seccomp profile works as expected:

~ $ kubectl apply -f pods/no-chmod.yaml
~ $ kubectl logs no-chmod-seccomp
chmod: changing permissions of 'test': Operation not permitted

The logs show expected result - seccomp profile block the attempt to chmod the test file and returns Operation not permitted. This shows that it's quite straightforward to adjust the profile to our liking or needs.

Be careful though when modifying these kinds of default profiles - if you end up blocking a bit too much and your pod/container can't start, the pod will be in Error state, but you will see nothing in logs and nothing useful in events in kubectl describe pod ..., so bear that in mind when debugging seccomp related issues.

Conclusion

Even after reading this article, modifying or creating your own seccomp profile might not be one of your top priorities. It is however important to be aware of this powerful tool and be able to use it when needed - as for example in Kubernetes - where it's not enforced by default which can easily become big security problem. For this reason I would recommend to - at the very least - enable the "audit" profile, so you can monitor syscalls being used and use that information to later create your own profile or validate that the default will work for your applications.

Also, if you take away anything from this article, then it should probably be, that seccomp is an important security layer and you should never run your containers uncontained.

Subscribe: