Synthetic monitoring can be a great tool for proactively identifying performance problems, checking availability of servers, monitor DNS resolution and much more.
However, when it comes to synthetic testing, engineers often rely on 3rd party platforms such as Datadog or New Relic that provide this type of monitoring. If you're running your applications and services on Kubernetes though, you can spin up synthetic monitoring platform yourself using Kuberhealthy, and in this article we will take a look at how to deploy it, configure it, create synthetic checks and set up monitoring and alerting, all inside your own cluster.
Deploy It
Before we go ahead and deploy it, let's first answer one question, what Kuberhealthy actually is?
Kuberhealthy is a CNCF incubator project. It's a Kubernetes operator that provides KuberhealthyCheck
custom resource, which lets you create builtin or custom synthetic checks that can test whether your cluster, its components or even external services are running as expected.
To deploy it, we first need to satisfy some prerequisites:
Kuberhealthy exposes synthetic check results as Prometheus metrics, so naturally we need Prometheus stack running on our cluster. If you're running applications and services on Kubernetes, chances are you're also running Prometheus stack, but in case you're not or want to follow along in playground cluster, you can use the following:
minikube delete && minikube start \
--kubernetes-version=v1.26.1 \
--memory=6g \
--bootstrapper=kubeadm \
--extra-config=kubelet.authentication-token-webhook=true \
--extra-config=kubelet.authorization-mode=Webhook \
--extra-config=scheduler.bind-address=0.0.0.0 \
--extra-config=controller-manager.bind-address=0.0.0.0
minikube addons disable metrics-server
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack -f values.yaml
# https://raw.githubusercontent.com/kuberhealthy/kuberhealthy/master/deploy/grafana/dashboard.json
kubectl apply -f https://gist.githubusercontent.com/MartinHeinz/6c52f677d7a2fd3a3ff1819190ecd59d/raw/\
f53a93c425027d9051d1bcdfc3568c6f4a7d9505/kuberhealthy-dashboard-cm.yaml
The above creates a new minikube
cluster with flags necessary for running kube-prometheus-stack
(Prometheus Operator), which it then also deploys using a Helm Chart. You may also notice that we also provided values.yaml
during chart installation - this file includes Alertmanager configuration that we will use to send Slack notifications/alerts for failing Kuberhealthy checks later on. This values.yaml
can be found in this gist.
Finally, we also deployed custom Grafana dashboard for Kuberhealthy. This dashboard is available in Kuberhealthy repository and here we deploy it as a ConfigMap which will be automatically read by kube-prometheus-stack
.
With all of this deployed, we can access the Prometheus, Grafana and Alertmanager using:
kubectl port-forward -n default svc/monitoring-kube-prometheus-prometheus 9090
kubectl port-forward -n default svc/monitoring-grafana 3000:80 # user: admin, password: prom-operator
kubectl port-forward -n default svc/monitoring-kube-prometheus-alertmanager 9093
With Prometheus out of the way, let's now deploy Kuberhealthy:
helm repo add kuberhealthy https://kuberhealthy.github.io/kuberhealthy/helm-repos
helm install -n kuberhealthy kuberhealthy kuberhealthy/kuberhealthy --create-namespace --values values.yaml
Kuberhealthy is also available as a Helm Chart, which makes it easy to deploy. We again provide values.yaml
to tweak the configuration:
# https://github.com/kuberhealthy/kuberhealthy/tree/master/deploy/helm/kuberhealthy
prometheus:
enabled: true
serviceMonitor:
enabled: true
release: monitoring
namespace: default
endpoints:
# https://github.com/kuberhealthy/kuberhealthy/issues/726
bearerTokenFile: ''
prometheusRule:
enabled: true
release: monitoring
namespace: default
check:
daemonset:
enabled: false
deployment:
enabled: false
dnsInternal:
enabled: false
We use values.yaml
to enable integration with Prometheus and to specify how serviceMonitor
and prometheusRule
provided by Kuberhealthy should be configured so that Prometheus recognizes them. This is done by setting the release
field to the name of Prometheus deployment (helm install monitoring ...
).
Kuberhealthy also comes with some builtin checks enabled by default - we disable those for the time being (check
stanza).
Now we can test if it's running:
kubectl port-forward -n kuberhealthy svc/kuberhealthy 8080:80
curl localhost:8080 | jq .
{
"OK": true,
"Errors": [],
"CheckDetails": {},
"JobDetails": {},
"CurrentMaster": "kuberhealthy-6b897c89cf-2jpt7"
}
curl 'localhost:8080/metrics'
# HELP kuberhealthy_running Shows if kuberhealthy is running error free
# TYPE kuberhealthy_running gauge
kuberhealthy_running{current_master="kuberhealthy-6b897c89cf-2jpt7"} 1
# HELP kuberhealthy_cluster_state Shows the status of the cluster
# TYPE kuberhealthy_cluster_state gauge
kuberhealthy_cluster_state 1
...
Generally, you shouldn't need to query these endpoints, because Kuberhealthy metrics are automatically scraped by Prometheus, but it can be handy for doing a manual check or debugging.
Here we curl
the kuberhealthy
service which gives us JSON status of the Kuberhealthy cluster. This by default also returns info for all checks in the cluster. If you want to filter by namespace you can instead use e.g. curl 'localhost:8080/?namespace=default'
for default
namespace.
In the response from /metrics
endpoint we see two metrics - kuberhealthy_cluster_state
and kuberhealthy_running
. Value of the former provides an "aggregated status" of all checks, meaning that if any of checks in the cluster returns 0 (fail), then kuberhealthy_cluster_state
will also be 0. The latter - kuberhealthy_running
- tells us whether the Kuberhealthy cluster itself runs.
Configuring Checks
With deployment and configuration done, we can start deploying our checks:
apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
name: ping-check
namespace: kuberhealthy
spec:
runInterval: 30m
timeout: 10m
podSpec:
containers:
- env:
- name: CONNECTION_TIMEOUT
value: "10s"
- name: CONNECTION_TARGET
value: "tcp://google.com:443"
image: kuberhealthy/network-connection-check:v0.2.0
name: main
Each check is an instance of KuberhealthyCheck
custom resource and each of them specifies runInterval
, timeout
, and podSpec
. These three fields set how often the check should run; after how long it should fail due to timeout; and YAML spec of Pod that will run the check. For the podSpec
, the important parts are image
and env
. The image decides what check we will run, e.g. kuberhealthy/network-connection-check
performs ping check, while kuberhealthy/ssl-expiry-check
image will run SSL certificate expiration check. While env variables passed to the Pod (and container) configure how the check will be run, e.g. which host it should query (CONNECTION_TARGET
).
After deploying this check, a check Pod will be created in kuberhealthy
namespace. We can check its logs:
kubectl logs -n kuberhealthy ping-check-1679831147
time="2023-03-26T11:45:53Z" level=info msg="Found instance namespace: kuberhealthy"
time="2023-03-26T11:45:53Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
time="2023-03-26T11:45:53Z" level=info msg="Check time limit set to: 9m48.53977037s"
time="2023-03-26T11:45:53Z" level=info msg="CONNECTION_TARGET_UNREACHABLE could not be parsed."
time="2023-03-26T11:45:53Z" level=info msg="Running network connection checker"
time="2023-03-26T11:45:53Z" level=info msg="Successfully reported success to Kuberhealthy servers"
time="2023-03-26T11:45:53Z" level=info msg="Done running network connection check for: tcp://google.com:443"
And it was successful!
We've now tested a simple ping, what else can we deploy?
apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
name: duration-check
namespace: kuberhealthy
spec:
runInterval: 5m
timeout: 10m
podSpec:
containers:
- name: main
image: kuberhealthy/http-check:v1.5.0
imagePullPolicy: IfNotPresent
env:
- name: CHECK_URL
value: "https://httpbin.org/delay/9"
- name: COUNT
value: "5"
- name: SECONDS
value: "1"
- name: REQUEST_TYPE
value: "GET"
- name: PASSING
value: "80"
---
apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
name: http-content-check
namespace: kuberhealthy
spec:
runInterval: 60s
timeout: 2m
podSpec:
containers:
- image: kuberhealthy/http-content-check:v1.5.0
imagePullPolicy: IfNotPresent
name: main
env:
- name: "TARGET_URL"
value: "https://httpbin.org/anything/whatever"
- name: "TARGET_STRING"
value: "whatever"
- name: "TIMEOUT_DURATION"
value: "30s"
Here we use 2 more builtin checks - http-check
and http-content-check
- which serve similar use case as the ping check. First one lets you do HTTP request(s) to specified URL, with customizable number of request, expected passing percentage, timeout for individual requests, as well as request type.
While the other lets you check whether some value (TARGET_STRING
) is present in response from a requested URL.
Another useful synthetic test is to check whether SSL certificate of a website is about to expire, there's builtin check for that too:
apiVersion: comcast.github.io/v1
kind: KuberhealthyCheck
metadata:
name: website-ssl-expiry-30d
namespace: kuberhealthy
spec:
runInterval: 24h
timeout: 15m
podSpec:
containers:
- env:
- name: DOMAIN_NAME
value: "martinheinz.dev"
- name: PORT
value: "443"
- name: DAYS
value: "30"
- name: INSECURE
value: "false" # Switch to 'true' if using 'unknown authority' (intranet)
image: kuberhealthy/ssl-expiry-check:v3.2.0
imagePullPolicy: IfNotPresent
name: main
There are a couple more builtin checks you can use, but I don't want to go over every single one of them, instead I would recommend you check out check registry which includes list of all the available builtin checks along with examples.
Monitoring
With all the checks in place, it's time to start monitoring their results. To do so, we will write some PromQL queries. Besides the kuberhealthy_cluster_state
and kuberhealthy_running
mentioned earlier, Kuberhealthy provides kuberhealthy_check{check='namespace/check-name'}
and kuberhealthy_check_duration_seconds{check='namespace/check-name'}
for each check. We will use those to build our monitoring rules and alerts.
For availability reasons, Kuberhealthy runs multiple replicas of the operator, which means that we will get multiple results (series) - one for each operator Pod - for each queried check.
To avoid that, we will only query results (series) from current master Pod in the Kuberhealthy cluster which provides the authoritative data. To do so we will use following query:
label_replace(kuberhealthy_check{check="kuberhealthy/ping-check"}, "current_master", "$1", "pod", "(.+)") \
* on (current_master) group_left() \
topk(1, kuberhealthy_running{}) < 1
Let's explain what's happening here - kuberhealthy_running
metric has current_master
label that refers to the master Pod in Kuberhealthy cluster, while kuberhealthy_check
metric has pod
label which refers to pod from which it originates. We only want to query kuberhealthy_check
metrics that have pod
label value that equals to current_master
label value of kuberhealthy_running
metric, but to be able to match 2 metrics against each other they need to have a label with the same name. So, we use label_replace
function to replace the pod
label in kuberhealthy_check
metric with current_master
. Now both metrics have current_master
label and we can query only the ones that match.
Alternatively, instead of taking results only from current master, you can simply use topk(1, kuberhealthy_check{check="kuberhealthy/ping-check"})
to grab the first series, which should work fine almost always, assuming the data is consistent across all cluster Pods.
Now, when you query any of the above kuberhealthy_*
metrics, you will notice that they have a status
label with value of 0
/1
. While a metric value obviously describes whether the check succeeded or failed, the status
label describes whether there was an error (0
) or not (1
). This is important for monitoring, because there is a big difference between a failing check and a broken one.
Now we know what queries we can build, but we need to create PrometheusRule(s) from them to be able to set up monitoring/alerting:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: synthetics
namespace: default
labels:
prometheus: prometheus
release: monitoring
spec:
groups:
- name: synthetics
rules:
- alert: PingFailed
expr: >
label_replace(kuberhealthy_check{check="kuberhealthy/ping-check"}, "current_master", "$1", "pod", "(.+)")
* on (current_master) group_left() topk(1, kuberhealthy_running{}) < 1
for: 5m
labels:
severity: warning
annotations:
summary: HTTP Ping failed
description: "Kuberhealthy was not able to reach tcp://google.com:443"
- alert: SslExpiryLessThan30d
expr: topk(1, kuberhealthy_check{check='kuberhealthy/website-ssl-expiry-30d'}) < 1
for: 5m
labels:
severity: warning
annotations:
summary: Server certificate will expire in less than 30 days
description: "Server certificate will expire in less than 30 days"
- alert: DurationCheckFailed
expr: >
avg without (endpoint, container, service, current_master, exported_namespace, job)
(label_replace(kuberhealthy_check_duration_seconds{check='kuberhealthy/duration-check'}, "current_master", "$1", "pod", "(.+)")
* on (current_master) group_left() topk(1, kuberhealthy_running{})) > 50
for: 5m
labels:
severity: warning
annotations:
summary: Check taking more than 50s to complete
description: Check taking more than 50s to complete ().
The important parts above really are just expr
stanzas, in the first one we use the query that takes data from Kuberhealthy master Pod and expects the value to be 1
, otherwise the rule/alerts gets triggered (after 5 minutes). In the second one we take the simpler approach and just grab the first value using topk
.
In the third check we use kuberhealthy_check_duration_seconds
metric and compute the average time it takes to run the check and we have it fail if it's more than 50 seconds. Word of caution for the "duration" metrics though - they describe the length of lifetime of whole check Pod, not the individual check attempts - you should take that into consideration when deciding the threshold for rule success/fail.
Finally, if you used the provided Alertmanager configuration shown in the beginning, you should be able to receive Slack alerts such as:
In addition to these binary (true
/false
) rules, Kuberhealthy docs provide examples of calculating availability, utilization and latency from the available metrics.
Writing Custom Checks
So far we only used the builtin checks which to be honest will be sufficient for most of the tests. However, it's possible to built your own. Such custom check could - for example - implement checking a service that requires authentication or database-native check that validates if it's possible to connect/run query against DB.
Building custom check would warrant a separate article, so instead, to avoid making this article way too long, I will just leave you with docs link which explains how you can build a check image in language of your choice.
Also, if you want some inspiration or a starting point, I have a repository with custom check(s), such as jq-check
which uses jq
query to check whether a HTTP JSON response contains expected data/value.
Conclusion
If you're running applications and services on Kubernetes, chances are you're also running Prometheus stack and relying on application metrics for monitoring. While this is a good practice, it's not necessarily sufficient, as metrics aren't suitable for monitoring everything, such as SSL expiration or server/database connectivity.
Tools like Kuberhealthy are a great for filling these gaps in monitoring, while allowing you to use the same, familiar interface - that is - Kubernetes and Prometheus.
Finally, while Kuberhealthy runs on Kubernetes and aims at testing Kubernetes cluster itself, it is not confined only to the cluster. It can be also used to test external resources, such as databases or services in the Cloud, or legacy applications that don't expose metrics.