Page MenuHomePhabricator

Consider moving to haproxy ingress for Thumbor workers
Open, Needs TriagePublic

Description

Currently (particularly when load is getting heavy), we queue jobs on each Thumbor pod's haproxy, which in turn dispatches jobs to the number of Thumbor workers within each pod. This is done for two reasons:

  • A hangover from the on-metal setup wherein we had a single haproxy serving a large (30+) pool of workers as opposed to the current lower number (under 10, currently 8) of workers
  • Thumbor's single-threaded nature for workers: Workers will block until they are finished processing a job. If a worker is busy, kube-probe won't poll fast enough to detect a busy worker, and we also don't necessarily want a busy worker to be considered unhealthy while it's busy.

The current setup is a fairly non-k8s-friendly method:

  • Having n pods means having n different queues.
  • Losing a pod with x workers due to an OOM or similar issue means that we have the potential to lose up to x jobs at once.
  • Resources are shared between workers for a single pod, meaning that if we process n medium-sized jobs, we potentially hit CPU throttling (or worse still, memory-related kill) limits. Likewise, a number of smaller jobs can be impacted by a single job on a significantly-sized file
  • Having to account for the above means that we have to set fairly onerously-large limits for CPU and memory for each pod that frequently are under-utilised (compare any server-8080 to the corresponding server-8086 here to see the relative usage)

Managing debugging of workers/jobs, resources and scaling would be much easier with a more standard k8s worker approach of one worker per pod. However, we need to keep the haproxy queuing model as it works pretty well for us. The HAProxy ingress for Kubernetes supports all of the core features that we use from HAProxy and would let us manage Thumbor workers on a one worker per pod basis.

Event Timeline

JMeybohm subscribed.

I would actually love if we could try to reproduce what we do with haproxy with istio ingressgateway before introducing another ingress controller (but tbh I did not check at all if that is feasible)

Thanks for the writeup. I agree on almost everything but I think some clarification would help me figure out some things.

If a worker is busy, kube-probe won't poll fast enough to detect a busy worker, and we also don't necessarily want a busy worker to be considered unhealthy while it's busy.

Does this still hold true with a single container per pod? If we adjust a bit the readiness Probe[1]

  • to switch from periodSeconds of 10 to 1, we 'll be doing an HTTP check every second.
  • switch from failureThreshold of 3 to 2 or 1

we 'll be pretty quick in depooling a busy pod from traffic, directing traffic elsewhere. To be clear, we don't touch the liveness probe (which currently is a TCP connection).

There might be some in flight traffic that reaches the pod of course. It depends a bit on how Thumbor handles these but the TCP connection will be queued for a bit in the OS TCP queue. Whether that will be enough or not depends on how quickly the already in process payload processing finishes ( might be queued for a bit in the TCP queue. Judging from https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&viewPanel=83&from=now-12h&to=now gifs should be ok, everything else will be timeouting (I am assuming here the 250ms conservative timeout we got everywhere)

[1] https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#Probe

I would actually love if we could try to reproduce what we do with haproxy with istio ingressgateway before introducing another ingress controller (but tbh I did not check at all if that is feasible)

That would be awesome but https://github.com/envoyproxy/envoy/issues/21121 paints a different picture :-(. We can try it out anyway of course, but it should be time capped as it doesn't look very promising.