This is the second part of High scale web server with Kubernetes. We will go over Kubernetes Horizontal Pod Autoscale and how we are using it at Dynamic Yield.
While serving a huge amount of requests, we can easily observe that our traffic graph looks like a sine wave with a high rate at midday and a lower rate at night. The difference is relatively big, around 2–3 times more requests in the rush hours. Moreover, there are special occasions such as Black Friday, Cyber Monday, sale campaigns, etc, that our traffic can raise up to x3.
Using Kubernetes’s elasticity capabilities helped us in a lot of aspects:
- Latency — ensure the best user experience. All you have to know is the optimal load that a single replica can handle.
- Stability/Reliability — we want to ensure that we won’t lose even a single request.
- Velocity — Save time by avoiding manual interventions and letting our developers concentrate on building great software.
- Costs optimization — pay only for the resource you need at the moment, instead of being over-provisioned all the time.
Horizontal Pod Autoscaler
Kubernetes HPA supports different possible options for scaling out and in. We are using both container resource metrics (CPU and Memory) and custom metrics (applicative metrics collected by Prometheus).
While resource metrics are straightforward (targetAverageUtilization above 80%) — custom metrics are more interesting.
HPA based on resources wasn’t sufficient for us. Our web servers are generating asynchronous network calls to other internal services and databases, rather than doing some intensive CPU work.
In terms of memory, we are not doing anything fancy as well. We have some basic LRU/LFU cache layers to save some expensive calls to databases — but those are protected and limited to ensure we won’t exceed the container’s requests/limits. One concern that impacts our memory consumption is a burst of requests waited to be handled. In this scenario, our memory can increase drastically so we keep relatively enough extra space to handle a sudden spike in traffic.
Having said that, while exceeding memory limits will kill your containers, be aware that the CPU won’t kill your pod and will throttle instead. My advice is to keep enough memory so you won’t see your pods collapse one after another with not enough time to recover.
We decided to bet on a custom metric for our HPA — the average requests per pod. We tested how many requests/second each pod can handle while ensuring that our response-time meets our SLA and memory/CPU stay stable.
We are using prometheus-client (in our Python Tornado web-server) to collect applicative metrics:
from tornado.web import Application from prometheus_client import Counter MyApplication(Application): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.ready_state = True self.requests_total = Counter( namespace="ns", subsystem="app", name="requests_total", labelnames=("handler", "method", "status"), ) def log_request(self, handler): super(MyApplication, self).log_request(handler) handler_name = type(handler).__name__ method = handler.request.method request_time = handler.request.request_time() status = handler.get_status() self.requests_total.labels(handler_name, method, status).inc() def main(): application = MyApplication([...]) server = tornado.httpserver.HTTPServer(application) server.listen(80) tornado.ioloop.IOLoop.instance().start()
In the snippet above (inspired by tornado-prometheus) you can see how we count all incoming requests and labeling them with some useful information for visibility.
In addition to the custom metric we set CPU/Mem as a backup HPA metrics, but they were never kicked in. We are using kube-metrics-adapter to collect the custom metrics for the HPA. Now we can use the external metric:
- type: Resource
- type: Resource
- type: External
HPA response-time take us up to 2 minutes and it depends on:
- Prometheus scrape_interval (default 1min).
- Image size and its existence on the relevant node (along with imagePullPolicy: IfNotPresent policy).
- If there is not enough capacity on the existing Kubernetes nodes, we should wait for Kubernetes to scale out and add another node. Using AWS spot instances, it might take up to 2 minutes for another instance to join.
- Headroom (by spot.io), which is a buffer of spare capacity (in terms of both memory and CPU) used to ensure that whenever we want to scale more pods, we don’t have to wait for new instances.
Configuring minReplicas/maxReplicas replicas
There are several things to consider before deciding about minimum replicas:
- What the nature of your traffic? Do you have a sudden burst? (like publishing a new campaign). For example, one of your pods can handle 20 requests/second and you have a sudden burst of 100 requests/second. There’s a huge difference if you currently have 10 replicas running or 2 replicas running. In the first case, each replica has to deal with 10 additional requests, 150% load. On the other hand, having only 2 replicas running, each one has to deal with additional 50 requests 250% load.
- If you know what is your lowest traffic rate, your minReplicas should at least support that number. For example, if 500 requests/second is your lowest rate (while your users are napping) and each of your replicas can handle 20 requests per second, your minReplicas should be at least 25. That way, if something bad happens (your application crush, Prometheus scrape failure, kube-metrics-adapter goesdown) and HPA desired count is zero — you’ll still have minimal replicas for serving to ensure that your service is always available.
- In case you don’t have much traffic but still need to ensure high-availability, set minReplica to at least 2–3: nodes might drain behind the scenes, and you want to ensure that at least one replica is available. Read about Pod Disruption Budget for more information.
maxReplicas is easier. Setting limits is always good advice — as someone needs to pay the bill at the end of the day. You don’t want to wake up at the end of the month and find out that you ran x10 more replicas than you thought.
In the graph above we can see a requests/second (left y-axis) and HPA combined (right y-axis) in 24 hours resolution. Dash yellow line shows the number of replicas (pods) and the colorful lines below shows the traffic per replica. You can see the correlation between the incoming traffic and replicas running.
HPA metric together with some other useful metrics can be observed without any applicative metric:
The graph above (based on those metrics) shows the number of replicas in the last 7 days. You can see the min Replicas is 50 and the max Replicas is 300 (allow us to handle x3 more than the daily max replicas — ~100).
You can see the sine pattern, where the traffic is doubled in the rush hours (noon — afternoon hours) compared to night traffic (03:00–06:00). You can also see some spikes from time to time which means there was a sudden increase in the traffic rate. As you can guess, with this elasticity we saved a lot of money while adding and removing resources dynamically as we need them while ensuring that:
- Our customer’s serving experience stays the same.
- Our team can concentrate on developing the next feature.
We saw how using custom metrics can help with auto-scaling our service in and out. After running with this setup for almost a year, we can definitely say that it saves us a lot of time, money and brought us happiness and peace 🙂
This post was originally published in medium.