Page MenuHomePhabricator

Track incoming HTTP request count on the Thumbor boxes
Closed, ResolvedPublic

Description

We should probably do that at the nginx level, to get a clear picture of incoming traffic. At the moment we only track the end result once thumbor has tried to process it.

In fact, if possible, we should also track status codes returned by nginx, which can be different to thumbor's. That's particularly interesting in the timeout scenario where nginx gives up because thumbor isn't responding fast enough.

Event Timeline

@fgiunchedi what do you think of using something like https://github.com/zebrafishlabs/nginx-statsd or https://github.com/knyar/nginx-lua-prometheus ?

Do we have any precedent of doing something like this with nginx?

@Gilles AFAIK there's no precedent like that no, it would be useful though for other places where we have nginx deployed and want to gain more insights. The lua code seems better, it has tests and depends less of nginx api changing I think. I'd try that first in beta and see how it goes, trying the C plugin should be equally easy, there's already statsite listening on localhost:8125 on thumbor

As a note, just looking at yesterday's data, nginx 502s once per minute on average. Much larger old error log files suggest that this might peak at times. We really need to record that in a graph.

Gilles raised the priority of this task from Medium to High.

Change 372235 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Expose Thumbor nginx metrics in Prometheus format

https://gerrit.wikimedia.org/r/372235

Change 372235 merged by jenkins-bot:
[mediawiki/vagrant@master] Expose Thumbor nginx metrics in Prometheus format

https://gerrit.wikimedia.org/r/372235

Change 372254 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Expose Thumbor Nginx metrics in Prometheus format

https://gerrit.wikimedia.org/r/372254

Change 372543 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet/nginx@master] Add Prometheus lua script for nginx-full

https://gerrit.wikimedia.org/r/372543

Change 372543 merged by jenkins-bot:
[operations/puppet/nginx@master] Add Prometheus lua script for nginx-extras

https://gerrit.wikimedia.org/r/372543

Change 372254 merged by Filippo Giunchedi:
[operations/puppet@production] Expose Thumbor Nginx metrics in Prometheus format

https://gerrit.wikimedia.org/r/372254

Change 372821 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: put thumbor hosts in cluster thumbor

https://gerrit.wikimedia.org/r/372821

Change 372821 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: put thumbor hosts in cluster thumbor

https://gerrit.wikimedia.org/r/372821

Change 372823 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: poll nginx metrics from thumbor hosts

https://gerrit.wikimedia.org/r/372823

Change 372823 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: poll nginx metrics from thumbor hosts

https://gerrit.wikimedia.org/r/372823

Change 372825 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] hieradata: add thumbor hostgroups

https://gerrit.wikimedia.org/r/372825

Change 372825 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: add thumbor hostgroups

https://gerrit.wikimedia.org/r/372825

Change 372826 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] role: explicit cluster label for thumbor

https://gerrit.wikimedia.org/r/372826

Change 372826 merged by Filippo Giunchedi:
[operations/puppet@production] role: explicit cluster label for thumbor

https://gerrit.wikimedia.org/r/372826

Patches are merged and stats are being polled by prometheus in codfw and eqiad, I've added basic request rates by status to https://grafana.wikimedia.org/dashboard/db/thumbor

I've added request latency percentiles to the thumbor dashboard as well. Note we'll need to add more buckets for the histograms since the highest now is 10s and thumbor can take longer than that

Change 374208 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: tune histogram buckets for nginx request duration

https://gerrit.wikimedia.org/r/374208

Change 374208 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: tune histogram buckets for nginx request duration

https://gerrit.wikimedia.org/r/374208

Change 374327 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] thumbor: more latency buckets for nginx requests

https://gerrit.wikimedia.org/r/374327

Change 374327 merged by Filippo Giunchedi:
[operations/puppet@production] thumbor: more latency buckets for nginx requests

https://gerrit.wikimedia.org/r/374327