Advanced Features of Kubernetes' Horizontal Pod Autoscaler

Most people who use Kubernetes know that you can scale applications using Horizontal Pod Autoscaler (HPA) based on their CPU or memory usage. There are however many more features of HPA that you can use to customize scaling behaviour of your application, such as scaling using custom application metrics or external metrics, as well as alpha/beta features like "scaling to zero" or container metrics scaling.

So, in this article we will explore all of these options so that we can take full advantage of all available features of HPA and to get a head start on the features that are coming in future Kubernetes releases.

Setup

Before we get started with scaling, we first need a testing environment. For that we will use KinD (Kubernetes in Docker) cluster defined by the following YAML:


# cluster.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  HPAScaleToZero: true
  HPAContainerMetrics: true
  LogarithmicScaleDown: true
nodes:
- role: control-plane
- role: worker
- role: worker
- role: worker

This manifest configures the KinD cluster with 1 control plane node and 3 workers, additionally it enables a couple of feature gates related to autoscaling. These feature gates will later allow us to use some alpha/beta features of HPA. To create a cluster with the above configuration, you can run:


kind create cluster --config ./cluster.yaml --name autoscaling --image=kindest/node:v1.23.6

Apart from the cluster, we will also need an application that we will scale. For that we will use resource consumer tool and it's image, which are used in Kubernetes end-to-end testing. To deploy it, you can run:


kubectl create deployment resource-consumer --image=gcr.io/k8s-staging-e2e-test-images/resource-consumer:1.11
kubectl set resources deployment resource-consumer --requests=cpu=500m,memory=256Mi

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  labels:
    app: resource-consumer
  name: resource-consumer
  namespace: default
spec:
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: resource-consumer
EOF

This application is very handy in this situation, as it allows us to simulate CPU and memory consumption of a Pod. It can also expose custom metrics which are needed for scaling based on custom/external metrics. To test this out we can run:


# Consume CPU (300m for 10min):
kubectl run curl --image=curlimages/curl:7.83.1 \
  --rm -it --restart=Never -- \
  curl --data "millicores=300&durationSec=600" http://resource-consumer:8080/ConsumeCPU

# Expose metric "custom_metric" with value 100 for 10min at endpoint /metrics
kubectl run curl --image=curlimages/curl:7.83.1 --rm \
  -it --restart=Never -- \
  curl --data "metric=custom_metric&delta=100&durationSec=600" http://resource-consumer:8080/BumpMetric

Next, we will also need to deploy services that collect metrics based on which we will later scale our test application. First of these is Kubernetes metrics-server which is usually available in cluster by default, but that's not the case in KinD, so to deploy it we need to run:


kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.5.0/components.yaml
kubectl patch -n kube-system deployment metrics-server --type=json \
  -p '[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--kubelet-insecure-tls"}]'

metrics-server allows us to monitor for basic metrics such as CPU and memory usage, but we also want to implement scaling based on custom metrics, such as the ones exposed by an application on its /metrics endpoint, or even external ones like queue depth of a queue running outside of cluster. For these we will need:

Prometheus Operator to gather the custom/external metrics.
ServiceMonitor object(s) to tell Prometheus how to scrape our application's metrics.
Prometheus adapter to get custom/external metrics from Prometheus instance into Kubernetes API.

You can refer to the end-to-end walkthrough for more details of the setup.

The above requires a lot of setup, so for purpose of this article and for your convenience, I've made a script and a set manifests that you can use to spin up KinD cluster along with all the required components. All you need to do is run setup.sh script from this repository.

After running the script, we can verify that everything is ready using following commands:


# To verify availability of metrics run:
kubectl top nodes
# NAME                        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# autoscaling-control-plane   113m         0%     1024Mi          1%
# autoscaling-worker          49m          0%     385Mi           0%
# autoscaling-worker2         42m          0%     381Mi           0%
# autoscaling-worker3         37m          0%     276Mi           0%

kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq .  # also works with "pods"
# ...
#     {
#      "metadata": {
#        "name": "autoscaling-worker3",
#        "labels": { ... }
#      },
#      "window": "20s",
#      "usage": {
#        "cpu": "43077193n",
#        "memory": "283212Ki"
#      }

# To query/verify custom metrics:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq . # also works for "external" instead of "custom"
# ...
# "name": "pods/custom_metric",
# "singularName": "",
# "namespaced": true,
# "kind": "MetricValueList",
# "verbs": [ "get" ]
# ...

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/custom_metric" | jq .
# ...
# {
#   "kind": "MetricValueList",
#   "apiVersion": "custom.metrics.k8s.io/v1beta1",
#   "metadata": {"selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/custom_metric"},
#   "items": [{
#       "describedObject": {
#         "kind": "Pod",
#         "namespace": "default",
#         "name": "resource-consumer-6bf5898d6f-gzzgm",
#         "apiVersion": "/v1"
#       },
#       "metricName": "custom_metric", "value": "100",
#     }]}

More helpful commands can be found in output of above mentioned script or in the repository README.

Basic Autoscaling

Now that we have our infrastructure up-and-running, we can start scaling the test application. The simplest way to do so is to create HPA using command like kubectl autoscale deploy resource-consumer --min=1 --max=5 --cpu-percent=75, this however creates HPA with apiVersion of autoscaling/v1, which lacks most of the features.

So, instead, we will create the HPA with YAML, specifying autoscaling/v2 as a apiVersion:


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: resource-consumer-v2
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: resource-consumer
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 75
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 200Mi

The above HPA will use basic metrics gathered from application Pod(s) by metrics-server. To test out the scaling we can simulate heavy memory usage:


kubectl run curl --image=curlimages/curl:7.83.1 \
  --rm -it --restart=Never -- \
  curl --data "megabytes=500&durationSec=600" http://resource-consumer:8080/ConsumeMem
kubectl get hpa -w
# NAME                   REFERENCE                      TARGETS                       MINPODS   MAXPODS   REPLICAS   AGE
# resource-consumer-v2   Deployment/resource-consumer   4689920/200Mi, 0%/75%         1         5         1          81s
# resource-consumer-v2   Deployment/resource-consumer   530415616/200Mi, 0%/75%       1         5         1          2m23s
# resource-consumer-v2   Deployment/resource-consumer   265820160/200Mi, 0%/75%       1         5         3          2m31s
# resource-consumer-v2   Deployment/resource-consumer   212226867200m/200Mi, 0%/75%   1         5         5          5m50s

Custom Metrics

Scaling based on CPU and memory usage is often enough, but we're after the advanced scaling options. First of them is scaling using custom metrics exposed by an application:


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: resource-consumer-v2-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: resource-consumer
  minReplicas: 1
  maxReplicas: 5
  metrics:
  # kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/custom_metric" | jq .
  - type: Pods
    pods:
      metric:
        name: custom_metric
      target:
        type: AverageValue
        averageValue: 100

This HPA is configured to scale the application based on the value of custom_metric that was scraped by Prometheus from application's /metrics endpoint. This will scale the application up if average value of specified metric across all pods (.target.type: AverageValue) goes over 100.

The above uses Pod metric to scale, but it's possible to specify any other object which has a metric attached to itself:


# ...
  - type: Object
    object:
      metric:
        name: custom_metric
      describedObject:
        apiVersion: v1
        kind: Service
        name: resource-consumer
      target:
        type: Value
        value: 100

This snippet achieves the same as the previous one, this time however, using Service instead of Pod as the source of the metric. It also shows that you can use direct comparison to measure the scaling threshold by setting .target.type to Value instead of AverageValue.

To figure out which objects expose metrics that you can use in scaling, you can traverse the API using kubectl get --raw. For example to look up the custom_metric for either Pod or Service you can use:


# Pod
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/custom_metric" | jq .
# Service
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/services/resource-consumer/custom_metric" | jq .
# Everything
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/" | jq .

Also, to help you troubleshoot, the HPA object provides a status stanza, that shows whether the applied metric was recognized:


kubectl get hpa resource-consumer-v2-custom -o json | jq .status.conditions
[
...
  {
    "lastTransitionTime": "2022-05-17T12:36:03Z",
    "message": "the HPA was able to successfully calculate a replica count from pods metric custom_metric",
    "reason": "ValidMetricFound",
    "status": "True",
    "type": "ScalingActive"
  },
...
]

Finally, to test out the behavior of the above HPA, we can bump the metric exposed by the application and see how the application scales up:


# Raise custom_metric to 150
kubectl run curl --image=curlimages/curl:7.83.1 \
  --rm -it --restart=Never -- curl \
  --data "metric=custom_metric&delta=150&durationSec=600" http://resource-consumer:8080/BumpMetric

kubectl get hpa -w
# NAME                          REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# resource-consumer-v2-custom   Deployment/resource-consumer     0/100   1         5         1          10s
# resource-consumer-v2-custom   Deployment/resource-consumer   150/100   1         5         1          24s
# resource-consumer-v2-custom   Deployment/resource-consumer   150/100   1         5         2          40s
# resource-consumer-v2-custom   Deployment/resource-consumer   150/100   1         5         5          75s

External Metrics

To show full potential of HPA, we will also try scaling an application based on external metric. This would require us to scrape metrics from external system running outside of a cluster, such Kafka or PostgreSQL. We don't have that available, so instead we've configured Prometheus Adapter to treat certain metrics as external. The configuration that does this can be found [here](https://github.com/MartinHeinz/metrics-on-kind/blob/master/custom-metrics-config-map.yaml). All you need to know though is that with this test cluster, any application metrics prefixed with external will go to external metrics API. To test this out, we bump up such a metric and check if the API gets populated:


# Set external_queue_messages_ready to 150 for 10min
kubectl run curl --image=curlimages/curl:7.83.1 \
  --rm -it --restart=Never -- \
  curl --data "metric=external_queue_messages_ready&delta=150&durationSec=600" \
  http://resource-consumer:8080/BumpMetric

kubectl get --raw /apis/external.metrics.k8s.io/v1beta1/namespaces/default/external_queue_messages_ready | jq .
{
  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "items": [
    {
      "metricName": "external_queue_messages_ready",
      "value": "150"
    }
  ]
}

To then scale our deployment based on this metric we can use following HPA:


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: resource-consumer-v2-external
spec:
# ...
  metrics:
  - type: External
    external:
      metric:
        name: external_queue_messages_ready
      target:
        type: Value
        value: 100

HPAScaleToZero

Now that we've gone through all the well known features of HPA, let's also take a look at the alpha/beta ones that we enabled using feature gates. First one being HPAScaleToZero.

As the name suggests, this will allow you to set minReplicas in HPA to zero, effectively turning the service off if there's no traffic. This can be useful in "bursty" workflow, for example in case where your application receives data from an external queue. In this use case the application can be safely scaled to zero when there are messages waiting to be processed.

With the feature gate enabled we can simply run:


kubectl patch hpa resource-consumer-v2-external -p '{"spec":{"minReplicas": 0}}'

Which sets the minimum replicas of previously shown HPA to zero.

Be aware though, that this will only work for metrics of type External or Object.

HPAContainerMetrics

Another feature gate that we can make use of is HPAContainerMetrics which allows us to use metrics of type: ContainerResource:


apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: resource-consumer-v2-container
spec:
# ...
  metrics:
  - type: ContainerResource
    containerResource:
      name: cpu
      container: resource-consumer
      target:
        type: Utilization
        averageUtilization: 75

This makes it possible to scale based on resource utilization of individual containers rather than whole Pod. This can be useful if you have multi-container Pod with application container and sidecar, and you want to ignore the sidecar and scale the deployment only based on the application container.

You can also view the breakdown of Pod/container metrics by running the following command:


POD=$(kubectl get pod -l app=resource-consumer -o jsonpath="{.items[0].metadata.name}")
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/$POD" | jq .

{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "resource-consumer-6bf5898d6f-gzzgm",
    "namespace": "default",
  },
  "window": "16s",
  "containers": [{
      "name": "resource-consumer",
      "usage": {
        "cpu": "0",
        "memory": "11028Ki"
      }}]}

LogarithmicScaleDown

Last but not least is LogarithmicScaleDown feature flag.

Without this feature, the Pod that's been running for least amount of time gets deleted first during downscaling. That's not always ideal though as it can create imbalance in replica distribution because newer Pods tend serve less traffic than the older ones.

With this feature flag enabled, a semi-random selection of Pods will be used instead when selecting Pod to be deleted.

For a full rationale and algorithm details see KEP-2189.

Closing Thoughts

In this article, I tried to cover most of the things you can do with Kubernetes HPA to scale your application. There are however, many more tools and options for scaling applications running in Kubernetes, such as vertical pod autoscaler which can help to keep Pod resource requests and limits up-to-date.

Another option would be predictive HPA by Digital Ocean, which will try to predict how many replicas a resource should and application have.

Finally, autoscaling doesn't end with Pods - next step after setting up Pod autoscaling is to also set up cluster autoscaling to avoid running out of available resources in you whole cluster.