We currently use Nginx to front Thumbor instances. However this comes with a big limitation, which is that specific thumbor instances that might be busy rendering expensive thumbnails can get their next request to process "too early" and have those wait needlessly, while other thumbor instances free up.
Ideally, due to the single-threaded nature of Thumbor, instances should only get new requests if they're not currently processing one. This would maximize core usage and ensure that requests are sent to a free instance as soon as it frees up. This requires combining queueing requests and load balancing, which Nginx cannot do. While Nginx is scriptable with lua, the lua code can't communicate across workers without using a service like memcache or redis. This is quite inefficient.
Instead, we should set up a proxy that meets Thumbor's needs exactly, to replace Nginx. The feature set should be the following:
- Retries. When Thumbor instances die (OOM, bug, upgrade), it's necessary to retry the request on another Thumbor instance.
- Queueing. Requests should only be sent to Thumbor instances when they're free.
- Max queue size. Send back 503s when it's reached
- Reading and adding headers.
- Monitoring (preferably with Prometheus). Request latency, duration, etc.
Testing scenario: Before performing any puppet changes, we disable puppet on thumbor1001 host and have its haproxy listen to 8800 temporarily. If successful, we move to puppet changes
This task is an alternative to T187203: Modify upstream Thumbor to allow true async engines