So, you are interested to implement autoscaling in your Kubernetes cluster. I wrote a blog on that topic last year to help you see clearly what you can configure depending on the API version of HorizontalPodAutoscaler (HPA). With version 2 of HPA, you can use CPU and memory metrics as criteria for dynamically scaling up or down your application.

Now you get what it is and how to configure it but you need to understand in more detail the calculation done by HPA to design it for your application in your cluster. Read on, you are in for a treat!

Using HPA for scaling a deployment

The HPA algorithm is given in the Kubernetes documentation with a great deal of details. They also provide a nice walkthrough with some examples. However I wanted a concrete example I can relate to with autoscaling in my cluster. In my previous blog I’ve showed some testing I did with minikube but today buckle up as we will jump into a live vanilla Kubernetes cluster!

Below is an example of what we see when using API version 2 of HPA with memory and CPU metrics used as criteria for autoscaling:

$ kubectl get hpa -n nginx
NAMESPACE   NAME            REFERENCE                    TARGETS             MINPODS   MAXPODS   REPLICAS   AGE
nginx       nginx-ingress   Deployment/nginx-ingress     4%/50%, 6%/50%    2         10        2          198d

First, let’s understand fully the output of the command above. We have a deployment named nginx-ingress that is piloted by an HPA called nginx-ingress. Both objects (deployment and HPA) are configured in the namespace called nginx.

HPA is then used here to dynamically scale out and in this deployment with a minimum of 2 pods and a maximum of 10. Currently there are 2 pods (called replicas) in this deployment doing the job of an NGINX Ingress Controller.

Targets in this cluster configuration mean HPA uses the memory metrics (4%/50%) where the current calculated average memory is 4% and the target used for scaling is set to 50%. The second metric is the CPU (6%/50%) using the same logic.

Criteria for calculation in depth

Now that we have refreshed our memory on the basics of HPA we can go further. To design and fine tune the autoscaling of an application (here the deployment of NGINX Ingress Controller) you will need to better understand how this calculation is done.

Here we use 2 metrics: memory and CPU. According to the Kubernetes documentation, “the HPA will calculate proposed replica counts for each metric, and then choose the one with the highest replica count”. It uses the following formula:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

The currentMetricValue can be seen by using the kubectl top command as follows:

$ kubectl top pod -n nginx
NAME                     CPU(cores)   MEMORY(bytes)
nginx-defaultbackend-1   1m           10Mi
nginx-ingress-1          4m           81Mi
nginx-ingress-2          8m           87Mi

I have simplified the name of the pods to make it easier to read. The nginx-ingress deployment has currently 2 nginx-ingress pods (=currentReplicas in the formula) and nginx-defaultbackend is not part of it. currentMetricValue is the average value of all pods in this deployment, so (4 + 8) / 2 for the CPU and (81 + 87) / 2 for the memory.

The desiredMetricValue of the formula comes from the resources of the container as defined in our deployment configuration:

...
        resources:
          limits:
            cpu: 200m
            memory: 4Gi
          requests:
            cpu: 100m
            memory: 2Gi
...

Let’s start by understanding where the targets calculation of HPA comes from. We had 4% for the memory and 6% for the CPU. This is calculated every 15 seconds by default. We use the following part of the formula: currentMetricValue / desiredMetricValue. It gives below for the memory:

((81 + 87) / 2) / (2 * 1024) = 0.0410 = 4% by keeping only the integer part

currentMetricValue is the average of the memory metric for the 2 current pods and desiredMetricValue is the memory request of our deployment converted to Mi to use the same unit as currentMetricValue.

The same logic applies to the CPU metric as follows:

((4 + 8) / 2) / 100 = 0.06 = 6%

Now we can calculate the desiredReplicas:

desiredReplicas = ceil[2 * 0.04] = 1 for the memory
desiredReplicas = ceil[2 * 0.06] = 1 for the CPU

The ceil function round the result to the next integer above it, so here 1. As HPA is configured to have 2 pods minimum, then the current replicas for this deployment will be 2 as in the current state. If the result for the memory or CPU showed a higher desiredReplicas than 2 then this value would be used as the current replicas and this deployment would dynamically scales out.

Autoscaling in action

With this understanding of how the metrics and the number of replicas are calculated we can now look at an example of autoscaling in action.

So we have our deployment with 2 pods but at some point we had a peak of traffic that required more processing from the NGINX Ingress pods and the average CPU calculated went up to 104%. Our configured HPA average utilization target is set to 50%. The calculated memory stayed around 4%. Immediately our deployment autoscaled by using the formula:

desiredReplicas = ceil[2 * 1.04] = 3

A new pod is then created to cope with this increase of CPU.

Then the CPU calculation went down again at 36% and we can monitor this calculation as follows:

$ kubectl get hpa -n nginx -w
nginx-ingress   Deployment/nginx-ingress   4%/50%, 36%/50%    2   7   3          199d
nginx-ingress   Deployment/nginx-ingress   4%/50%, 43%/50%    2   7   3          199d
nginx-ingress   Deployment/nginx-ingress   4%/50%, 36%/50%    2   7   3          199d
nginx-ingress   Deployment/nginx-ingress   4%/50%, 103%/50%   2   7   3          199d
nginx-ingress   Deployment/nginx-ingress   4%/50%, 43%/50%    2   7   4          199d
nginx-ingress   Deployment/nginx-ingress   4%/50%, 43%/50%    2   7   4          199d
nginx-ingress   Deployment/nginx-ingress   4%/50%, 42%/50%    2   7   4          199d
nginx-ingress   Deployment/nginx-ingress   4%/50%, 37%/50%    2   7   4          199d

Each 15s we will see a new line displayed with the current calculation. You can see 2 things there. First when the memory and CPU are below the target value of 50%, the number of replicas doesn’t immediately goes down. Here it stays at 3 and it will go down to 2 if the calculation is below 50% for 5 minutes by default. Second, if at some point a metric goes above 50% again as for the CPU increase to 103% then the result of the formula here triggered the creation of a fourth pod. The formula in this case was:

desiredReplicas = ceil[3 * 1.03] = 4

For at least 5 minutes now we will have 4 replicas and if the calculation of memory and CPU metrics stays below 50% then it will go down to 3. If for 5 more minutes the calculation still stay below 50% then we will go down to 2 replicas which is the minimum number of pods in our configuration.

HPA tuning and autoscaling design

By now HPA calculation shouldn’t have any secrets for you! Here come the hardest part and it is to fine tune its parameters for an efficient design. There is no straight answer for it of course as it depends of the capacity of the nodes in your cluster as well as the number of application that are deployed in your cluster. In addition, each application is different and you’ll have to observe it first for days and weeks in order to have a baseline for it and have a more precise idea of the memory and CPU it needs and uses.

Here are a few thoughts to consider to help you with this design. First, set the memory and CPU request properly in your resource (deployment in our example) and this comes after having observed the application. For example we have used in our example a request of 100m for the CPU. If you set it to 10m instead (which is the desiredMetricValue), let’s have a look at the consequences by using our example with 2 current pods with an average currentMetricValue of 100m:

desiredReplicas = ceil[2 * (100 / 10)] = 20

In this configuration you would need (desire to be exact!) 20 pods! So it is important to set the request value properly that match a realistic need for that application. If by observing your application you see it will always use between 100m and 200m CPU then set a value in this range as the request CPU parameter. If you set it to 100m, with a currentMetricValue of 100m you’ll then start with 2 pods.

Similarly, don’t just increase the maxReplicas of you HPA without observing your application. If I take the same example as above with a request of 10m and I set maxReplicas to 10, I’ll quickly see 10 current pods for my application. If I increase that number to 15, I’ll quickly see 15 of them. By understanding and doing the calculation you know you need 20 of them for a currentMetricValue of 100m. So set properly first the request value based on the needs of your application before increasing the maximum number of replicas. Also set the resource limits based on the maximum capacity you have observed that your application needs and add a safety margin.

Finally do your design based on the physical capacity of the worker nodes that will host your application. You can use the command kubectl describe node <node_name> to have a view of all the resource requests and limits of all the pods currently hosted and running on that node. Also at the bottom you’ll see the total CPU and memory allocated for those pods that will give you some information about your node being under or overused. You can then decide to increase the resource request of your applications or just know you have capacity to add more applications in your cluster.

Let’s finish with an example of such command by using kubectl describe node node01

Non-terminated Pods:          (16 in total)
  Namespace     Name   CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------     ----   ------------  ----------  ---------------  -------------  ---
  ns1           pod1    350m (1%)     1 (3%)      512Mi (0%)       1536Mi (1%)    20d
  ns2           pod2    100m (0%)     4 (12%)     512Mi (0%)       4Gi (3%)       27d
  ns3           pod3    250m (0%)     250m (0%)   512Mi (0%)       1Gi (0%)       12d
  ns3           pod4    250m (0%)     250m (0%)   1Gi (0%)         1Gi (0%)       12d
  ns3           pod5    2 (6%)        2 (6%)      6Gi (5%)         6Gi (5%)       12d
  ns3           pod6    250m (0%)     1 (3%)      1Gi (0%)         2Gi (1%)       12d
  ns4           pod7    8 (25%)       8 (25%)     16Gi (14%)       32Gi (28%)     44d
  ns5           pod8    75m (0%)      100m (0%)   50Mi (0%)        100Mi (0%)     20d
  ns5           pod9    50m (0%)      0 (0%)      32Mi (0%)        0 (0%)         20d
  ns6           pod10   100m (0%)     200m (0%)   128Mi (0%)       256Mi (0%)     17d
  ns7           pod11   100m (0%)     100m (0%)   64Mi (0%)        64Mi (0%)      38d
  ns7           pod12   100m (0%)     200m (0%)   128Mi (0%)       256Mi (0%)     46d
  ns7           pod13   100m (0%)     200m (0%)   128Mi (0%)       256Mi (0%)     32d
  ns7           pod14   100m (0%)     200m (0%)   128Mi (0%)       256Mi (0%)     46d
  ns7           pod15   100m (0%)     200m (0%)   128Mi (0%)       256Mi (0%)     46d
  ns8           pod16   100m (0%)     2 (6%)      200Mi (0%)       2Gi (1%)       32d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                12025m (38%)   19700m (62%)
  memory             27098Mi (23%)  52132Mi (44%)

Note that 1 CPU = 1000m and that there is a mix of both ways to define a CPU request. 2 is then equal to 2000m and if you add the CPU Requests of all pods you’ll find the total allocated resources of 12025m for the CPU. This node has 32 CPU and so 19700m or 19.7 CPU is 62% of 32 CPU. The same addition is done for the memory with a mix of Mi and Gi and the total is expressed in Mi.

Conclusion

I hope this blog helped you better understand this topic by looking at some examples and going step by step from them. This is the method we are using in our course Docker and Kubernetes Essential Skills to learn from an example of containerized application. HPA is already an advanced topic but if you need to understand the basics of Docker and Kubernetes then check it out!