Replace the Nginx fronting Thumbor with Haproxy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	Feb 20 2018, 9:06 AM

Description

We currently use Nginx to front Thumbor instances. However this comes with a big limitation, which is that specific thumbor instances that might be busy rendering expensive thumbnails can get their next request to process "too early" and have those wait needlessly, while other thumbor instances free up.

Ideally, due to the single-threaded nature of Thumbor, instances should only get new requests if they're not currently processing one. This would maximize core usage and ensure that requests are sent to a free instance as soon as it frees up. This requires combining queueing requests and load balancing, which Nginx cannot do. While Nginx is scriptable with lua, the lua code can't communicate across workers without using a service like memcache or redis. This is quite inefficient.

Instead, we should set up a proxy that meets Thumbor's needs exactly, to replace Nginx. The feature set should be the following:

Retries. When Thumbor instances die (OOM, bug, upgrade), it's necessary to retry the request on another Thumbor instance.
Queueing. Requests should only be sent to Thumbor instances when they're free.
Max queue size. Send back 503s when it's reached
Timeouts.
Reading and adding headers.
Monitoring (preferably with Prometheus). Request latency, duration, etc.

Testing scenario: Before performing any puppet changes, we disable puppet on thumbor1001 host and have its haproxy listen to 8800 temporarily. If successful, we move to puppet changes

This task is an alternative to T187203: Modify upstream Thumbor to allow true async engines

Details

Subject	Repo	Branch	Lines +/-
thumbor: Use port 8800 for haproxy	operations/puppet	production	+3 -2
WIP: define haproxy service for thumbor	operations/puppet	production	+42 -1
Define Haproxy Prometheus jobs	operations/puppet	production	+19 -1
prometheus: do not quote haproxy-exporter systemd args	operations/puppet	production	+1 -1
Set up Thumbor Haproxy Prometheus exporter	operations/puppet	production	+64 -0
Use jessie-backports version of haproxy	operations/puppet	production	+8 -0
Revert "Send Thumbor-Request-Id in haproxy response"	operations/puppet	production	+0 -1
Send Thumbor-Request-Id in haproxy response	operations/puppet	production	+1 -0
Send Thumbor-Request-Id in haproxy response	mediawiki/vagrant	master	+1 -0
haproxy: add stats socket to default config	operations/puppet	production	+2 -0
haproxy: drop default puppet server	operations/puppet	production	+0 -9
Front Thumbor instances with Haproxy	operations/puppet	production	+101 -0
Front Thumbor instances with Haproxy instead of Nginx	mediawiki/vagrant	master	+279 -40

Revisions and Commits

rTHMBREXT Thumbor Plugins
	Restricted Differential Revision	rTHMBREXT8f851994136d Rename request header to more generic name

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	hnowlan	T216815 Upgrade Thumbor to Buster
Resolved	• Gilles	T187765 Replace the Nginx fronting Thumbor with Haproxy
Resolved	fgiunchedi	T204266 Backport prometheus haproxy exporter for Jessie
Resolved	jijiki	T220499 Export useful metrics from haproxy logs for Thumbor
		· · ·

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Gilles triaged this task as Low priority.Feb 20 2018, 9:07 AM

It looks like this might be achievable with a combination of Apache, MPM prefork and mod_wsgi. Apache would run a fixed amount of child processes (eg. as many as there are cores), not use worker threads, and mod_wsgi would run a copy of the thumbor app in each process. Number or processes capped. When it's reached, Apache should be waiting until the next available process to treat the next request.

• Gilles moved this task from Inbox, needs triage to Backlog: Maintenance, non-prioritized on the Performance-Team board.Feb 26 2018, 9:14 PM

Haproxy might actually work: https://stackoverflow.com/questions/8750518/difference-between-global-maxconn-and-server-maxconn-haproxy I'll try that first

Change 417233 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Front Thumbor instances with Haproxy

https://gerrit.wikimedia.org/r/417233

gerritbot added a project: Patch-For-Review.Mar 8 2018, 11:34 AM

Change 417982 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Front Thumbor instances with Haproxy instead of Nginx

https://gerrit.wikimedia.org/r/417982

Change 417982 merged by jenkins-bot:
[mediawiki/vagrant@master] Front Thumbor instances with Haproxy instead of Nginx

https://gerrit.wikimedia.org/r/417982

• Gilles moved this task from Backlog to Doing on the Thumbor board.Mar 14 2018, 10:23 AM

• Gilles added a revision: Restricted Differential Revision.Mar 14 2018, 10:32 AM

• Gilles added a commit: rTHMBREXT8f851994136d: Rename request header to more generic name.Mar 22 2018, 8:34 AM

• Gilles moved this task from Backlog: Maintenance, non-prioritized to Blocked (old) on the Performance-Team board.Apr 24 2018, 7:30 AM

@Gilles Schedule for Q1, talk with SRE to review

Stuck in review since March

elukey subscribed.Jul 29 2018, 10:46 AM

Change 417233 merged by Filippo Giunchedi:
[operations/puppet@production] Front Thumbor instances with Haproxy

https://gerrit.wikimedia.org/r/417233

Change 449175 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] haproxy: drop default puppet server

https://gerrit.wikimedia.org/r/449175

Change 449175 merged by Filippo Giunchedi:
[operations/puppet@production] haproxy: drop default puppet server

https://gerrit.wikimedia.org/r/449175

Change 449183 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] haproxy: add stats socket to default config

https://gerrit.wikimedia.org/r/449183

Change 449183 merged by Filippo Giunchedi:
[operations/puppet@production] haproxy: add stats socket to default config

https://gerrit.wikimedia.org/r/449183

Sorry for the delay! I've merged the patches so haproxy is now running alongside nginx on thumbor instances.
Things still missing off top of my head:

Prometheus stats (via https://github.com/prometheus/haproxy_exporter)
Firewall rules
Queueing behaviour testing

Thanks, @fgiunchedi -- FYI, Gilles is on vacation for several more weeks, and so this will probably hang out waiting until he gets back.

Krinkle removed a project: Patch-For-Review.Jul 30 2018, 11:31 PM

Testing haproxy on thumbor1001, it works fine. The only thing that's missing is the Thumbor-Request-Id header, which Nginx generates. The haproxy config is supposed to add one, but somewhere it's not there in the response:

# Send a unique request id to Thumbor instance
unique-id-format %{+X}o\ %ci:%cp_%fi:%fp_%Ts_%rt:%pid
unique-id-header Thumbor-Request-Id

I suspect it's because in Nginx's case, we have 2 statements, one to add the header to the request sent to Thumbor, and one to add it to the response Nginx generates. In haproxy's case, it's only sending it to Thumbor. Should be easy to fix.

Change 456151 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456151

Change 456153 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456153

Change 456153 merged by jenkins-bot:
[mediawiki/vagrant@master] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456153

@fgiunchedi haproxy-exporter doesn't have a Debian package yet: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=906097 what do you want to do about it?

Concurrency/perf testing.

Nginx:

gilles@thumbor1001:~$ ab -n 1000 -c 5 http://127.0.0.1:8800/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/805px-Ariano_Irpino_ZI.jpeg

Concurrency Level:      5
Time taken for tests:   46.548 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      85208283 bytes
HTML transferred:       84073000 bytes
Requests per second:    21.48 [#/sec] (mean)
Time per request:       232.739 [ms] (mean)
Time per request:       46.548 [ms] (mean, across all concurrent requests)
Transfer rate:          1787.65 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:    25  219 2274.6     43   46071
Waiting:       25  219 2274.6     42   46071
Total:         25  219 2274.6     43   46071

Percentage of the requests served within a certain time (ms)
  50%     43
  66%     47
  75%     51
  80%     54
  90%     75
  95%    148
  98%    864
  99%   1830
 100%  46071 (longest request)

Haproxy:

gilles@thumbor1001:~$ ab -n 1000 -c 5 http://127.0.0.1:9800/wikipedia/commons/thumb/c/ca/Ariano_Irpino_ZI.jpeg/805px-Ariano_Irpino_ZI.jpeg

Concurrency Level:      5
Time taken for tests:   16.323 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      85169043 bytes
HTML transferred:       84073000 bytes
Requests per second:    61.26 [#/sec] (mean)
Time per request:       81.617 [ms] (mean)
Time per request:       16.323 [ms] (mean, across all concurrent requests)
Transfer rate:          5095.33 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:    38   81 117.9     68    2284
Waiting:       38   81 117.9     67    2284
Total:         38   81 118.0     68    2284

Percentage of the requests served within a certain time (ms)
  50%     68
  66%     75
  75%     80
  80%     86
  90%    104
  95%    116
  98%    184
  99%    339
 100%   2284 (longest request)

This is extremely compelling in haproxy's favor...

Note that we'll only get the full picture once haproxy is running against organic parallel requests. Since we've just started mirroring requests to the inactive DC, it will be handy, as we'll be able to switch to haproxy only on the inactive DC and see the effect on concurrency/latency there.

Change 456151 merged by Filippo Giunchedi:
[operations/puppet@production] Send Thumbor-Request-Id in haproxy response

https://gerrit.wikimedia.org/r/456151

The header doesn't work because apparently this feature is available in 1.7+ and Jessie has haproxy 1.5.8-3...

This means I'm going to need to have Thumbor parrot the header it receives instead.

• Gilles removed a project: Patch-For-Review.Aug 30 2018, 4:37 PM

Change 456416 had a related patch set uploaded (by Krinkle; owner: Gilles):
[operations/puppet@production] Revert "Send Thumbor-Request-Id in haproxy response"

https://gerrit.wikimedia.org/r/456416

gerritbot added a project: Patch-For-Review.Aug 30 2018, 5:59 PM

Change 456416 abandoned by Gilles:
Revert "Send Thumbor-Request-Id in haproxy response"

https://gerrit.wikimedia.org/r/456416

1.7.5 is actually in jessie-backports, just need to tell Puppet to install that.

In T187765#4542419, @Gilles wrote:

@fgiunchedi haproxy-exporter doesn't have a Debian package yet: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=906097 what do you want to do about it?

We'll need the exporter package for haproxy also for databases anyways, since there's already an ITP open we should reach out to that person and see where they are at.

Change 456578 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Use jessie-backports version of haproxy

https://gerrit.wikimedia.org/r/456578

Change 456578 merged by Filippo Giunchedi:
[operations/puppet@production] Use jessie-backports version of haproxy

https://gerrit.wikimedia.org/r/456578

The header now works as expected, thanks to the haproxy upgrade:

Xkey: File:Ariano_Irpino_ZI.jpeg
Proxy-Request-Date: 31/Aug/2018:12:33:55 +0000
Content-Type: image/jpeg
Proxy-Response-Date: 31/Aug/2018:12:33:55 +0000
X-Upstream: 127.0.0.1:8801
Thumbor-Request-Id: 7F000001:8D60_7F000001:2648_5B8935B3_000B:02FE

Paladox subscribed.Sep 3 2018, 9:58 PM

fgiunchedi added a project: User-fgiunchedi.Sep 5 2018, 9:49 AM

fgiunchedi closed subtask T204266: Backport prometheus haproxy exporter for Jessie as Resolved.Sep 19 2018, 8:44 AM

Change 461596 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Set up Thumbor Haproxy Prometheus exporter

https://gerrit.wikimedia.org/r/461596

Joe subscribed.Sep 20 2018, 8:46 AM

Change 461596 merged by Filippo Giunchedi:
[operations/puppet@production] Set up Thumbor Haproxy Prometheus exporter

https://gerrit.wikimedia.org/r/461596

Change 461631 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: do not quote haproxy-exporter systemd args

https://gerrit.wikimedia.org/r/461631

Change 461631 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: do not quote haproxy-exporter systemd args

https://gerrit.wikimedia.org/r/461631

Change 461724 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Define Haproxy Prometheus jobs

https://gerrit.wikimedia.org/r/461724

Change 461724 merged by Filippo Giunchedi:
[operations/puppet@production] Define Haproxy Prometheus jobs

https://gerrit.wikimedia.org/r/461724

Adding profile::prometheus::haproxy_exporter::listen_port: 9901 to hiera data for deployment-imagescaler* hosts in deployment-prep

Change 465185 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] WIP: define haproxy for thumbor

https://gerrit.wikimedia.org/r/465185

jijiki added a project: User-jijiki.Dec 19 2018, 3:28 PM

jijiki subscribed.

jijiki mentioned this in T212946: Stream Thumbor logs to logstash.Jan 8 2019, 10:46 AM

jijiki mentioned this in T170817: Upgrade Thumbor servers to Stretch.Jan 8 2019, 11:18 AM

• Gilles changed the task status from Open to Stalled.Feb 5 2019, 9:27 AM

jijiki mentioned this in T216815: Upgrade Thumbor to Buster.Feb 26 2019, 12:22 PM

jijiki added a parent task: T216815: Upgrade Thumbor to Buster.

@jijiki I don't believe this is blocked by the Buster upgrade, is it? Would be nice if you can make some room for it in your Q4 goals.

@Gilles, I am a bit unclear as to what remains to be done for this. Could you shed some light?

@akosiaris finishing and deploying Filippo's patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/465185

haproxy is already configured and running on the thumbor servers, it's just a matter of pointing to it (and making sure that things still work as expected, of course).

jijiki moved this task from Inbox 🐅 to In Progress 🏋️‍♀️ on the User-jijiki board.Apr 4 2019, 9:22 PM

jijiki updated the task description. (Show Details)Apr 8 2019, 2:04 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-08T14:06:04Z] <jijiki> Temporarily serve thumbor traffic on thumbor1001 via haproxy - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-09T07:10:31Z] <jijiki> Depool thumbor1004 for testing - T187765

jijiki changed the task status from Stalled to Open.Apr 9 2019, 12:22 PM

jijiki raised the priority of this task from Low to Medium.

jijiki added a subtask: T220499: Export useful metrics from haproxy logs for Thumbor.

Mentioned in SAL (#wikimedia-operations) [2019-04-10T15:00:01Z] <jijiki> Enable puppet on thumbor1001, switch back to nginx, pool thumbor1004 - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-16T13:58:35Z] <jijiki> Disable puppet on thumbor1001 for ~24h to serve traffic via haproxy - T187765

Change 465185 abandoned by Filippo Giunchedi:
WIP: define haproxy service for thumbor

Reason:
Not needed

https://gerrit.wikimedia.org/r/465185

Mentioned in SAL (#wikimedia-operations) [2019-04-20T07:52:01Z] <jijiki> depool thumbor1001, switch back to nginx - T187765

• Gilles removed a project: Patch-For-Review.Apr 22 2019, 2:52 PM

• Gilles closed subtask T220499: Export useful metrics from haproxy logs for Thumbor as Resolved.Apr 22 2019, 5:21 PM

Change 505759 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] thumbor: Use port 8800 for haproxy

https://gerrit.wikimedia.org/r/505759

gerritbot added a project: Patch-For-Review.Apr 23 2019, 12:26 PM

Change 505759 merged by Effie Mouzeli:
[operations/puppet@production] thumbor: Use port 8800 for haproxy

https://gerrit.wikimedia.org/r/505759

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:14:34Z] <jijiki> Depool thumbor1001 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:16:39Z] <jijiki> Depool thumbor2001 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:21:39Z] <jijiki> Depool thumbor1002 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T14:27:37Z] <jijiki> Depool thumbor2002 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:39:57Z] <jijiki> Depool thumbor1003 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:43:16Z] <jijiki> Depool thumbor2003 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:49:27Z] <jijiki> Depool thumbor1004 for 505759 and pool back - T187765

Mentioned in SAL (#wikimedia-operations) [2019-04-23T16:55:19Z] <jijiki> Depool thumbor2004 for 505759 and pool back - T187765

All traffic is served by haproxy. If we have any issues, this can be easily reverted. Closing for now.

itshappening

akosiaris awarded a token.Apr 24 2019, 9:06 AM

Replace the Nginx fronting Thumbor with HaproxyClosed, ResolvedPublicActions