Page MenuHomePhabricator

Replace the Nginx fronting Thumbor with Haproxy
Closed, ResolvedPublic

Description

We currently use Nginx to front Thumbor instances. However this comes with a big limitation, which is that specific thumbor instances that might be busy rendering expensive thumbnails can get their next request to process "too early" and have those wait needlessly, while other thumbor instances free up.

Ideally, due to the single-threaded nature of Thumbor, instances should only get new requests if they're not currently processing one. This would maximize core usage and ensure that requests are sent to a free instance as soon as it frees up. This requires combining queueing requests and load balancing, which Nginx cannot do. While Nginx is scriptable with lua, the lua code can't communicate across workers without using a service like memcache or redis. This is quite inefficient.

Instead, we should set up a proxy that meets Thumbor's needs exactly, to replace Nginx. The feature set should be the following:

  • Retries. When Thumbor instances die (OOM, bug, upgrade), it's necessary to retry the request on another Thumbor instance.
  • Queueing. Requests should only be sent to Thumbor instances when they're free.
  • Max queue size. Send back 503s when it's reached
  • Timeouts.
  • Reading and adding headers.
  • Monitoring (preferably with Prometheus). Request latency, duration, etc.

Testing scenario: Before performing any puppet changes, we disable puppet on thumbor1001 host and have its haproxy listen to 8800 temporarily. If successful, we move to puppet changes

This task is an alternative to T187203: Modify upstream Thumbor to allow true async engines

Revisions and Commits

rTHMBREXT Thumbor Plugins
Restricted Differential Revision

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

It looks like this might be achievable with a combination of Apache, MPM prefork and mod_wsgi. Apache would run a fixed amount of child processes (eg. as many as there are cores), not use worker threads, and mod_wsgi would run a copy of the thumbor app in each process. Number or processes capped. When it's reached, Apache should be waiting until the next available process to treat the next request.

Change 417233 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Front Thumbor instances with Haproxy

https://gerrit.wikimedia.org/r/417233

Change 417982 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Front Thumbor instances with Haproxy instead of Nginx

https://gerrit.wikimedia.org/r/417982

Change 417982 merged by jenkins-bot:
[mediawiki/vagrant@master] Front Thumbor instances with Haproxy instead of Nginx

https://gerrit.wikimedia.org/r/417982

Gilles added a revision: Restricted Differential Revision.Mar 14 2018, 10:32 AM
Gilles changed the task status from Open to Stalled.Jun 25 2018, 6:45 PM

Stuck in review since March

Change 417233 merged by Filippo Giunchedi:
[operations/puppet@production] Front Thumbor instances with Haproxy

https://gerrit.wikimedia.org/r/417233

Change 449175 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] haproxy: drop default puppet server

https://gerrit.wikimedia.org/r/449175

Change 449175 merged by Filippo Giunchedi:
[operations/puppet@production] haproxy: drop default puppet server

https://gerrit.wikimedia.org/r/449175

Change 449183 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] haproxy: add stats socket to default config

https://gerrit.wikimedia.org/r/449183

Change 449183 merged by Filippo Giunchedi:
[operations/puppet@production] haproxy: add stats socket to default config

https://gerrit.wikimedia.org/r/449183

fgiunchedi changed the task status from Stalled to Open.Jul 30 2018, 1:44 PM

Sorry for the delay! I've merged the patches so haproxy is now running alongside nginx on thumbor instances.
Things still missing off top of my head:

Thanks, @fgiunchedi -- FYI, Gilles is on vacation for several more weeks, and so this will probably hang out waiting until he gets back.

Testing haproxy on thumbor1001, it works fine. The only thing that's missing is the Thumbor-Request-Id header, which Nginx generates. The haproxy config is supposed to add one, but somewhere it's not there in the response:

# Send a unique request id to Thumbor instance
unique-id-format %{+X}o\ %ci:%cp_%fi:%fp_%Ts_%rt:%pid
unique-id-header Thumbor-Request-Id

I suspect it's because in Nginx's case, we have 2 statements, one to add the header to the request sent to Thumbor, and one to add it to the response Nginx generates. In haproxy's case, it's only sending it to Thumbor. Should be easy to fix.

Change 456151 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456151

Change 456153 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456153

Change 456153 merged by jenkins-bot:
[mediawiki/vagrant@master] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456153

@fgiunchedi haproxy-exporter doesn't have a Debian package yet: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=906097 what do you want to do about it?

Concurrency/perf testing.

Nginx:

gilles@thumbor1001:~$ ab -n 1000 -c 5 http://127.0.0.1:8800/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/805px-Ariano_Irpino_ZI.jpeg

Concurrency Level:      5
Time taken for tests:   46.548 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      85208283 bytes
HTML transferred:       84073000 bytes
Requests per second:    21.48 [#/sec] (mean)
Time per request:       232.739 [ms] (mean)
Time per request:       46.548 [ms] (mean, across all concurrent requests)
Transfer rate:          1787.65 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:    25  219 2274.6     43   46071
Waiting:       25  219 2274.6     42   46071
Total:         25  219 2274.6     43   46071

Percentage of the requests served within a certain time (ms)
  50%     43
  66%     47
  75%     51
  80%     54
  90%     75
  95%    148
  98%    864
  99%   1830
 100%  46071 (longest request)

Haproxy:

gilles@thumbor1001:~$ ab -n 1000 -c 5 http://127.0.0.1:9800/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/805px-Ariano_Irpino_ZI.jpeg

Concurrency Level:      5
Time taken for tests:   16.323 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      85169043 bytes
HTML transferred:       84073000 bytes
Requests per second:    61.26 [#/sec] (mean)
Time per request:       81.617 [ms] (mean)
Time per request:       16.323 [ms] (mean, across all concurrent requests)
Transfer rate:          5095.33 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:    38   81 117.9     68    2284
Waiting:       38   81 117.9     67    2284
Total:         38   81 118.0     68    2284

Percentage of the requests served within a certain time (ms)
  50%     68
  66%     75
  75%     80
  80%     86
  90%    104
  95%    116
  98%    184
  99%    339
 100%   2284 (longest request)

This is extremely compelling in haproxy's favor...

Note that we'll only get the full picture once haproxy is running against organic parallel requests. Since we've just started mirroring requests to the inactive DC, it will be handy, as we'll be able to switch to haproxy only on the inactive DC and see the effect on concurrency/latency there.

Change 456151 merged by Filippo Giunchedi:
[operations/puppet@production] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456151

The header doesn't work because apparently this feature is available in 1.7+ and Jessie has haproxy 1.5.8-3...

This means I'm going to need to have Thumbor parrot the header it receives instead.

Change 456416 had a related patch set uploaded (by Krinkle; owner: Gilles):
[operations/puppet@production] Revert "Send Thumbor-Request-Id in haproxy response"

https://gerrit.wikimedia.org/r/456416

Change 456416 abandoned by Gilles:
Revert "Send Thumbor-Request-Id in haproxy response"

https://gerrit.wikimedia.org/r/456416

1.7.5 is actually in jessie-backports, just need to tell Puppet to install that.

@fgiunchedi haproxy-exporter doesn't have a Debian package yet: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=906097 what do you want to do about it?

We'll need the exporter package for haproxy also for databases anyways, since there's already an ITP open we should reach out to that person and see where they are at.

Change 456578 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Use jessie-backports version of haproxy

https://gerrit.wikimedia.org/r/456578

Change 456578 merged by Filippo Giunchedi:
[operations/puppet@production] Use jessie-backports version of haproxy

https://gerrit.wikimedia.org/r/456578

The header now works as expected, thanks to the haproxy upgrade:

Xkey: File:Ariano_Irpino_ZI.jpeg
Proxy-Request-Date: 31/Aug/2018:12:33:55 +0000
Content-Type: image/jpeg
Proxy-Response-Date: 31/Aug/2018:12:33:55 +0000
X-Upstream: 127.0.0.1:8801
Thumbor-Request-Id: 7F000001:8D60_7F000001:2648_5B8935B3_000B:02FE

Change 461596 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Set up Thumbor Haproxy Prometheus exporter

https://gerrit.wikimedia.org/r/461596

Change 461596 merged by Filippo Giunchedi:
[operations/puppet@production] Set up Thumbor Haproxy Prometheus exporter

https://gerrit.wikimedia.org/r/461596

Change 461631 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: do not quote haproxy-exporter systemd args

https://gerrit.wikimedia.org/r/461631

Change 461631 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: do not quote haproxy-exporter systemd args

https://gerrit.wikimedia.org/r/461631

Change 461724 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Define Haproxy Prometheus jobs

https://gerrit.wikimedia.org/r/461724

Change 461724 merged by Filippo Giunchedi:
[operations/puppet@production] Define Haproxy Prometheus jobs

https://gerrit.wikimedia.org/r/461724

Adding profile::prometheus::haproxy_exporter::listen_port: 9901 to hiera data for deployment-imagescaler* hosts in deployment-prep

Change 465185 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP: define haproxy for thumbor

https://gerrit.wikimedia.org/r/465185

Gilles changed the task status from Open to Stalled.Feb 5 2019, 9:27 AM

@jijiki I don't believe this is blocked by the Buster upgrade, is it? Would be nice if you can make some room for it in your Q4 goals.

@Gilles, I am a bit unclear as to what remains to be done for this. Could you shed some light?

@akosiaris finishing and deploying Filippo's patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/465185

haproxy is already configured and running on the thumbor servers, it's just a matter of pointing to it (and making sure that things still work as expected, of course).

Mentioned in SAL (#wikimedia-operations) [2019-04-08T14:06:04Z] <jijiki> Temporarily serve thumbor traffic on thumbor1001 via haproxy - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-09T07:10:31Z] <jijiki> Depool thumbor1004 for testing - T187765

jijiki changed the task status from Stalled to Open.Apr 9 2019, 12:22 PM
jijiki raised the priority of this task from Low to Medium.

Mentioned in SAL (#wikimedia-operations) [2019-04-10T15:00:01Z] <jijiki> Enable puppet on thumbor1001, switch back to nginx, pool thumbor1004 - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-16T13:58:35Z] <jijiki> Disable puppet on thumbor1001 for ~24h to serve traffic via haproxy - T187765

Change 465185 abandoned by Filippo Giunchedi:
WIP: define haproxy service for thumbor

Reason:
Not needed

https://gerrit.wikimedia.org/r/465185

Mentioned in SAL (#wikimedia-operations) [2019-04-20T07:52:01Z] <jijiki> depool thumbor1001, switch back to nginx - T187765

Change 505759 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] thumbor: Use port 8800 for haproxy

https://gerrit.wikimedia.org/r/505759

Change 505759 merged by Effie Mouzeli:
[operations/puppet@production] thumbor: Use port 8800 for haproxy

https://gerrit.wikimedia.org/r/505759

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:14:34Z] <jijiki> Depool thumbor1001 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:16:39Z] <jijiki> Depool thumbor2001 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:21:39Z] <jijiki> Depool thumbor1002 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:27:37Z] <jijiki> Depool thumbor2002 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:39:57Z] <jijiki> Depool thumbor1003 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:43:16Z] <jijiki> Depool thumbor2003 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:49:27Z] <jijiki> Depool thumbor1004 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:55:19Z] <jijiki> Depool thumbor2004 for 505759 and pool back - T187765

jijiki renamed this task from Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests to Replace the Nginx fronting Thumbor with Haproxy.Apr 23 2019, 5:27 PM
jijiki closed this task as Resolved.

All traffic is served by haproxy. If we have any issues, this can be easily reverted. Closing for now.