Currently (particularly when load is getting heavy), we queue jobs on each Thumbor pod's haproxy, which in turn dispatches jobs to the number of Thumbor workers within each pod. This is done for two reasons:
- A hangover from the on-metal setup wherein we had a single haproxy serving a large (30+) pool of workers as opposed to the current lower number (under 10, currently 8) of workers
- Thumbor's single-threaded nature for workers: Workers will block until they are finished processing a job. If a worker is busy, kube-probe won't poll fast enough to detect a busy worker, and we also don't necessarily want a busy worker to be considered unhealthy while it's busy.
The current setup is a fairly non-k8s-friendly method:
- Having n pods means having n different queues.
- Losing a pod with x workers due to an OOM or similar issue means that we have the potential to lose up to x jobs at once.
- Resources are shared between workers for a single pod, meaning that if we process n medium-sized jobs, we potentially hit CPU throttling (or worse still, memory-related kill) limits. Likewise, a number of smaller jobs can be impacted by a single job on a significantly-sized file
- Having to account for the above means that we have to set fairly onerously-large limits for CPU and memory for each pod that frequently are under-utilised (compare any server-8080 to the corresponding server-8086 here to see the relative usage)
Managing debugging of workers/jobs, resources and scaling would be much easier with a more standard k8s worker approach of one worker per pod. However, we need to keep the haproxy queuing model as it works pretty well for us. The HAProxy ingress for Kubernetes supports all of the core features that we use from HAProxy and would let us manage Thumbor workers on a one worker per pod basis.