Page MenuHomePhabricator

Prometheus exporter for XHGui
Open, Needs TriagePublic

Description

We need monitoring for the XHGui service.

My plan is to add a simple Prometheus exporter endpoint, which exposes some basic stats about the profiles in the database. This would let us track growth over time, and alert if too much time passes without any profiles being written.

Event Timeline

Change 608973 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] [WIP] webperf: Enable prometheus-apache-exporter

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608973

Change 610948 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/debs/xhgui@master] [WIP] Add Prometheus exporter

https://gerrit.wikimedia.org/r/610948

Change 610948 abandoned by Dave Pifke:
[performance/debs/xhgui@master] [WIP] Add Prometheus exporter

Reason:
Wrong branch; will resubmit against wmf to remove dependency on packaging stuff.

https://gerrit.wikimedia.org/r/610948

Change 611448 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/debs/xhgui@master] Add Prometheus exporter

https://gerrit.wikimedia.org/r/611448

Change 612394 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/debs/xhgui@wmf] Add Prometheus exporter

https://gerrit.wikimedia.org/r/612394

Change 611448 abandoned by Dave Pifke:
[performance/debs/xhgui@master] Add Prometheus exporter

Reason:
git-review uploaded this to the wrong branch. Correct version is pending on the wmf branch.

https://gerrit.wikimedia.org/r/611448

Change 608973 abandoned by Dave Pifke:
[operations/puppet@production] [WIP] webperf: Enable prometheus-apache-exporter

Reason:
The information from this exporter is useful for some forms of troubleshooting, but not for the stated goal of generating alerts if one of the performance.wikimedia.org backends starts throwing errors. I therefore don't see a lot of benefit in collecting these metrics at the moment.

https://gerrit.wikimedia.org/r/608973

Change 612394 abandoned by Dave Pifke:
[performance/debs/xhgui@wmf] Add Prometheus exporter

Reason:
Merged upstream.

https://gerrit.wikimedia.org/r/612394

Change 622447 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] xhgui: scrape Prometheus exporter

https://gerrit.wikimedia.org/r/622447

I tried out the endpoint in prod and found the request timed out, I think from a proxy somewhere (upstream request timeout). I don't think an XHGui Apache or Webperf Apache timeout looks like this, so perhaps it is from ATS/Envoy middleware higher up in the stack.

Anyway, if the responses don't time out internally, it might be acceptable to have them ocasionally take 60 seconds to respond I suppose. I don't know how this affects Promethueus scraping and what the tolerance is there.

I checked Tendril, and did find the sql query there:

SELECT COUNT(*) AS profiles, MAX(request_ts) AS latest, SUM(LENGTH(profile)) AS bytes FROM xhgui /* db1107 xhgui 65s */

Depending on whether optimising the query is trivial, we might want to fence this off from the public Apache for now at least, eg. denying /metrics* entirely from the Webperf Apache config? Especially since this is targetting the m2 master database host. Which makes me wonder, would xhgui be able to run from an m2-replica? That might be worth exploring as well since afaik our xhgui setup is in theory meant to be read-only anyhow.

Removing inactive task assignee (please do so as part of offboarding processes).