Startup Journey 0002 | Why Isn't Auto-Scaling Working? The System Just Crashes

(the post is automatically translated by AI)

In early 2026, I left my senior software engineer position at a large tech company to join a startup as a tech lead. Still figuring out both the technical and managerial sides, I’m hoping this series of journal entries will help me look back years later and see how I’ve grown.

In the previous entry, we unified the deployment process across all environments, giving us more confidence in changes and tests going forward. Now it was finally time to look at the system’s behavior under traffic spikes.

The system runs on an AKS Kubernetes cluster. In the past, when the system received sudden bursts of traffic, the auto-scaling mechanism failed to kick in in time, causing customers to complain about long wait times.

My first step was to understand the baseline: what does “peak load” actually look like in terms of concurrent users? What is the auto-scaling mechanism expected to achieve? And how long is the expected end-to-end processing time per request?

Based on current requirements, a traffic spike means roughly 100–200 simultaneous user requests. Consumer pods are responsible for processing these requests, and a mechanism called the “hot standby job” is responsible for auto-scaling before requests pile up. Ideally, since one end-to-end request is expected to take about six minutes, we want all users’ wait times to fall within that window.

200 concurrent requests isn’t a lot. With an auto-scaling mechanism in place and a maximum replica count of 200, you’d expect it to handle this gracefully. So why was there still delay?

I ran tests using k6 to simulate concurrent users. Without even reaching 200 users, a run with just 100 users took 50 minutes total. Worse, the per-request processing time increased as the number of users grew.

There was clearly a bottleneck in the system.

But before addressing the bottleneck, I noticed something more serious: the auto-scaling mechanism wasn’t actually scaling consumers as expected. Even under elevated load with K8s resources still available, the number of consumers only reached around 10. With so few workers, slow throughput was inevitable.

So I had to look at exactly how the “hot standby job” was doing its scaling. The mechanism works like this: a monitor-like worker periodically checks the status of each consumer. When a consumer enters “working” state, the monitor calculates how many additional workers are needed and calls the K8s API to increase the consumer count.

This sounds reasonable on the surface, but there were a few failure points. First, consumers save their state to Redis during state transitions; the monitor worker reads Redis to decide whether to scale. The two operate at different polling frequencies, which creates a timing gap. Second, the original implementation didn’t account for all state transition scenarios — so even after a request was done, the consumer state could linger in Redis.

The result was a fairly complex system. Which raised the question: did we actually need that complexity?

Common Kubernetes auto-scaling approaches include HPA and KEDA. I asked the engineer why neither of these was used and why the hot standby job mechanism was built from scratch.

The answer: customers had complained about slow processing, so the goal was to keep a few “pre-warmed” consumers on standby so they could accept new tasks immediately, skipping image pull and initialization time.

I later measured: that initialization time was around 200 milliseconds — negligible. This is exactly why data should drive decision-making.

Since the existing design couldn’t reliably meet our needs, we needed a better-maintained and more stable solution for auto-scaling. After evaluating options, KEDA was the right fit: its event-driven scaling model is better suited than HPA for our use case of monitoring RabbitMQ queue depth. We decided to migrate to KEDA.

The result: after switching to KEDA, the system became significantly more stable and easier to track. Peak traffic processing time dropped from nearly an hour to under 20 minutes — roughly a 30% improvement. While we haven’t hit the target window yet, the auto-scaling consumer count now reaches the expected level, and the system itself is simpler and more stable. A solid first step.

相關文章