Cloud Optimization

KEDA-based video frame processing optimization

A Kubernetes scaling redesign on AWS EKS that reduced video frame processing time from more than seven hours to about two hours.

2026-05-033 min

KEDAKubernetesAWS EKSscaling

The project processed film frames in parallel on AWS EKS. The existing approach could take more than seven hours for a processing run.

Challenge

The workload was highly parallel but uneven. A processing run could contain a large number of frames, and the runtime depended on how quickly workers could be provisioned, how well the queue drained, and where the infrastructure limits appeared. Simply increasing the default pod count would have increased cost and still left the team guessing about the right operating point.

The goal was to process all frames in the shortest practical time while keeping the architecture understandable enough for future runs. That meant using queue depth and throughput as the scaling signal instead of treating frame processing like a fixed web workload.

Change

KEDA was introduced to scale workers according to the processing queue and required throughput. The core idea was to let the queue describe current demand, then scale Kubernetes deployments up and down based on that demand instead of maintaining a permanently high number of pods.

The implementation involved tuning the relationship between queued work, worker concurrency, pod startup time, and cloud-provider limits. The team tested how many frames a worker could process, how fast pods became useful after scheduling, and where cluster or account quotas started to become the bottleneck.

Architecture considerations

The work ran on AWS EKS, so the optimization was not only a Kubernetes configuration change. Node capacity, scheduling behavior, container image pull time, and external service limits all affected end-to-end throughput. KEDA handled event-driven scaling, but the surrounding platform still needed enough headroom to make that scaling effective.

Observability was important during tuning. Queue length showed whether the system was falling behind. Worker completion rate showed whether additional pods were still useful. Cluster capacity and provider quotas showed whether the next bottleneck had moved outside the application.

Tradeoffs

The final configuration favored predictable completion time over maximum theoretical parallelism. Scaling too aggressively can move the problem into quota failures, noisy retries, or expensive idle capacity after the queue drains. Scaling too conservatively keeps the cluster stable but extends processing time.

The useful engineering question was not "how many pods can we start?" but "how many useful pods can process frames before another part of the system becomes the constraint?" That framing made the scaling conversation clearer for both engineering and product stakeholders.

Result

Processing time dropped to about two hours. During optimization the system reached cloud-provider limits, which made capacity planning and scaling constraints part of the architecture conversation.

The project also produced a repeatable way to reason about future batch workloads: define the queue signal, measure worker throughput, identify infrastructure ceilings, and tune autoscaling against the real completion target rather than a generic CPU threshold.