Goodbye etcd, Hello PostgreSQL: Running Kubernetes with an SQL Database

etcd is the brain of every Kubernetes cluster, the key-value storage keeping track of all the objects in a cluster. It's intertwined and tightly coupled with Kubernetes, and it might seem like an inseparable part of a cluster, or is it?

In this article we will explore how we could replace etcd with PostgreSQL database, as well as why and when it might make sense to do so.

Why?

If you're running your own Kubernetes cluster, then you know the pains of managing etcd. Apart the user perspective, etcd has little usage outside of Kubernetes and has been in state of decline because no one wants to maintain it. Due to this, critical bugs take a long time to fix.

Besides, why not? Why shouldn't we use other storage backends with Kubernetes? Having more options is a good thing and there really are no downsides to running Kubernetes with RDBMS, whether it's PostgreSQL, MySQL or anything else you might be comfortable with.

Also, running Kubernetes with RDBMS isn't such a novel idea either, k3s, a production-grade Kubernetes distribution can run with relational DB instead of etcd. If it works for k3s, why wouldn't it work for any other cluster?

How?

As was mentioned in the beginning, Kubernetes and etcd are tightly coupled. Cluster components (API server) expect an etcd-like interface to which they can write and from which they can read. Therefore, to use SQL database as a storage, we need to provide an etcd-to-SQL translation layer, which is called Kine. Kine is the component of k3s that allows it to use various RDBMS as an etcd replacement. It provides an implementation of the GRPC functions that Kubernetes relies upon. As far as Kubernetes is concerned, it is talking to an etcd server.

First things first though, before we get to running Kine, if we want to run Kubernetes with PostgreSQL, we will obviously need an instance of PostgreSQL:

Note: If you want to follow along or quickly spin up a cluster in VM backed by PostgreSQL, then you can check out my repository (k8s-without-etcd branch).


apt -y install postgresql postgresql-contrib
systemctl start postgresql.service

In this tutorial we will use PostgreSQL running as a systemd service, we will also set up SSL for the database:


# Generate self signed root CA cert
openssl req -addext "subjectAltName = DNS:localhost" -nodes \
    -x509 -newkey rsa:2048 -keyout ca.key -out ca.crt -subj "/CN=localhost"

# Generate server cert to be signed
openssl req -addext "subjectAltName = DNS:localhost" -nodes \
    -newkey rsa:2048 -keyout server.key -out server.csr -subj "/CN=localhost"

# Sign the server cert
openssl x509 -extfile <(printf "subjectAltName=DNS:localhost") -req \
    -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out server.crt

chmod og-rwx ca.key
chmod og-rwx server.key

cp {server.crt,server.key,ca.crt} /var/lib/postgresql/
chown postgres.postgres /var/lib/postgresql/server.key

sed -i -e "s|ssl_cert_file.*|ssl_cert_file = '/var/lib/postgresql/server.crt'|g" /etc/postgresql/14/main/postgresql.conf
sed -i -e "s|ssl_key_file.*|ssl_key_file = '/var/lib/postgresql/server.key'|g" /etc/postgresql/14/main/postgresql.conf
sed -i -e "s|#ssl_ca_file.*|ssl_ca_file = '/var/lib/postgresql/ca.crt'|g" /etc/postgresql/14/main/postgresql.conf

systemctl restart postgresql.service

We use openssl to generate root CA certificate, signing request and the server certificate and key. We then limit permissions of the keys, so that only their owner can read them. We copy both the certificates and key to /var/lib/postgresql/ and make postgres user an owner of the server key. Finally, we modify the postgresql.conf to tell PostgreSQL to use our new SSL certs and key.

With the database running, we can move over to creating a cluster. We will do so using kubeadm and below configuration:


# kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 192.168.56.2
  bindPort: 6443
nodeRegistration:
  criSocket: "unix:///var/run/crio/crio.sock"
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.25.0
networking:
  podSubnet: 10.244.0.0/16
etcd:
  external:
    endpoints:
    - http://127.0.0.1:2379
apiServer:
  timeoutForControlPlane: 1m0s

You can customize this to your needs. The only important part here is etcd.external.endpoints array that tells Kubernetes where the etcd (or etcd-compatible-interface) is - in our case, this is where Kine will listen.

To build a cluster using this configuration, run:


kubeadm init --config=/.../kubeadm-config.yaml --upload-certs --ignore-preflight-errors ExternalEtcdVersion 2>&1 || true
kubectl taint nodes --all node-role.kubernetes.io/control-plane-

In addition to passing in the configuration file, we also specify that we want to ignore etcd related error during startup and considering that this playground cluster has only one node, we also un-taint the master node so that we can run workloads on it.

Now comes the fun part - setting up Kine. There are couple ways we could deploy it - as a basic process, as a systemd service, as a Pod in kube-system namespace, or my preferred option - as an API server sidecar.

When we deployed the cluster with kubeadm, it already generated the static Pod manifests for us, including one for API server (kube-apiserver.yaml), therefore, we will need to patch it to include a container running Kine:


# vim /etc/kubernetes/manifests/kube-apiserver.yaml
# ...
  containers:
  - name: kube-apiserver
  # ...  Existing API server container
  # -------------------------------------
  - image: rancher/kine:v0.10.1-amd64
    name: kine
    securityContext:  # Don't do this in real deployment...
      runAsUser: 0
      runAsGroup: 0
    command: [ "/bin/sh", "-c", "--" ]
    args: [ 'kine --endpoint="postgres://$(POSTGRES_USERNAME):$(POSTGRES_PASSWORD)@localhost:5432/postgres"
      --ca-file=/var/lib/postgresql/ca.crt
      --cert-file=/var/lib/postgresql/server.crt
      --key-file=/var/lib/postgresql/server.key' ]
    env:
      - name: POSTGRES_USERNAME  # This should be a secret
        value: "postgres"
      - name: POSTGRES_PASSWORD  # This should be a secret
        value: "somepass"
    volumeMounts:
      - mountPath: /var/lib/postgresql/
        name: kine-ssl
        readOnly: true
  volumes:
  # -------------------------------------
  # ... Existing volumes used by API Server container
  # -------------------------------------
  - hostPath:
      path: /var/lib/postgresql
      type: DirectoryOrCreate
    name: kine-ssl

Above is the container that we need to add to the API server static Pod, it uses Kine image from Docker Hub at rancher/kine:.... It specifies entrypoint that points it to the PostgreSQL database, as well as the SSL certificates and key we generated earlier.

After applying these changes to the API server, we will see that Kubelet will successfully start both kube-apiserver and kine container:


crictl ps
CONTAINER      IMAGE              CREATED         STATE    NAME            ATTEMPT  POD ID         POD
09172381b42f1  4d2edfd10d3e3f...  29 seconds ago  Running  kube-apiserver  0        3187286722eba  kube-apiserver-kubemaster
55ac5108ae677  ccdd8a15f4ca3e...  29 seconds ago  Running  kine            0        3187286722eba  kube-apiserver-kubemaster

And just to prove that we are up-and-running without etcd, we can check kube-system namespace and see that there are no etcd pods:


kubectl get pods -n kube-system
NAME                                 READY   STATUS    RESTARTS   AGE
kube-controller-manager-kubemaster   1/1     Running   0          5d19h
kube-scheduler-kubemaster            1/1     Running   0          5d19h
kube-apiserver-kubemaster            1/1     Running   0          5d19h
kube-proxy-mrfs5                     1/1     Running   0          5d19h
coredns-565d847f94-wrn82             1/1     Running   0          5d19h
coredns-565d847f94-7zvwr             1/1     Running   0          5d19h

Or as an extra test, we can deploy some resources to the cluster, for example these ones.


kubectl get pods -n default
NAME                                  READY   STATUS    RESTARTS   AGE
example-deployment-78d75878cc-b56kl   1/1     Running   0          22h
example-deployment-78d75878cc-4ftj5   1/1     Running   0          22h
example-deployment-78d75878cc-bvpx6   1/1     Running   0          22h

kubectl get all -n default
NAME                                      READY   STATUS    RESTARTS   AGE
pod/example-deployment-78d75878cc-b56kl   1/1     Running   0          22h
pod/example-deployment-78d75878cc-4ftj5   1/1     Running   0          22h
pod/example-deployment-78d75878cc-bvpx6   1/1     Running   0          22h

NAME                      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/kubernetes        ClusterIP   10.96.0.1      <none>        443/TCP   6d17h
service/example-service   ClusterIP   10.98.111.70   <none>        80/TCP    22h

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/example-deployment   3/3     3            3           22h

NAME                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/example-deployment-78d75878cc   3         3         3       22h

And hooray, we can see everything running, which wouldn't be possible without a working backing storage.

Exploring/Poking Around

With complete cluster running, we can poke around and explore the PostgreSQL database, first we log in:


psql -U postgres -p 5432 -h 127.0.0.1  # somepass

And then we can view the schema and tables:


-- List tables:
\dt public.*
        List of relations
 Schema | Name | Type  |  Owner
--------+------+-------+----------
 public | kine | table | postgres
(1 row)

-- Describe Kine table
\d public.kine;
                                        Table "public.kine"
     Column      |          Type          | Collation | Nullable |             Default
-----------------+------------------------+-----------+----------+----------------------------------
 id              | integer                |           | not null | nextval('kine_id_seq'::regclass)
 name            | character varying(630) |           |          |
 created         | integer                |           |          |
 deleted         | integer                |           |          |
 create_revision | integer                |           |          |
 prev_revision   | integer                |           |          |
 lease           | integer                |           |          |
 value           | bytea                  |           |          |
 old_value       | bytea                  |           |          |
Indexes:
    "kine_pkey" PRIMARY KEY, btree (id)
    "kine_id_deleted_index" btree (id, deleted)
    "kine_name_id_index" btree (name, id)
    "kine_name_index" btree (name)
    "kine_name_prev_revision_uindex" UNIQUE, btree (name, prev_revision)
    "kine_prev_revision_index" btree (prev_revision)

As you can see above, there's a single table named kine which holds all the data. Kine uses the database as log-structured storage, so every write from API server creates a new row that stores the created or updated Kubernetes object.

Let's take a look at the data:


-- Some 1000+ rows
select count(*) from public.kine;
 count
-------
  1319
(1 row)

select name, encode(public.kine.value, 'escape') as value_plain
from public.kine where name like '/registry/pods/default/example%' limit 1;

select name, encode(public.kine.value, 'escape') as value_plain
from public.kine where name like '/registry/configmaps/default/example%' limit 1;

The name column uses the same structure as etcd - it specifies the path to the object in cluster - /registry/RESOURCE_TYPE/NAMESPACE/NAME. The value column holds the actual manifest as a byte array. Here, I omit the actual result of the query because the decoded data is quite ugly because of whitespace decoding and presence of "managed fields" data, but if you try it yourself, you will be able to decipher it.

Scaling/Performance

We figured out how to run Kubernetes without etcd, but should you? How is the performance of such a cluster and does it scale?

As already mentioned, Kine and therefore also RDBMS backends are used by k3s, so we can check k3s resource profiling docs to compare how SQL databases perform in comparison to etcd.

k3s also has a test suite with customizable database engine/backend, so you could run the tests and compare if you really wanted to.

For scaling and PostgreSQL in particular, you could also follow the advice in this GitHub issue and store different data types in different tables using PostgreSQL's partitioned tables.

Closing Thoughts

I believe that there are no downsides to using RDBMS instead etcd for a Kubernetes cluster, but getting rid of etcd won't solve all your issues. Every tool brings its own set problems and challenges and same applies to PostgreSQL or any other SQL database.

You'd have to weigh the pros and cons of running your cluster with SQL database for your particular use case, maybe for example the familiar SQL interface and your expertise in managing RDBMS outweighs the overhead, hassle, or any possible issues that might come with replacing etcd.

...or, maybe just use k3s 😉.

Subscribe: