Prometheus exporter for XHGui
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	• dpifke
	Jun 22 2020, 6:05 PM

Description

We need monitoring for the XHGui service.

My plan is to add a simple Prometheus exporter endpoint, which exposes some basic stats about the profiles in the database. This would let us track growth over time, and alert if too much time passes without any profiles being written.

Details

Subject	Repo	Branch	Lines +/-
xhgui: scrape Prometheus exporter	operations/puppet	production	+12 -0
Add Prometheus exporter	performance/debs/xhgui	wmf	+102 -0
[WIP] webperf: Enable prometheus-apache-exporter	operations/puppet	production	+3 -0
Add Prometheus exporter	performance/debs/xhgui	master	+96 -0
[WIP] Add Prometheus exporter	performance/debs/xhgui	master	+96 -0

Customize query in gerrit

Related Objects

Mentioned In: T260086: Performance team services health dashboard

Event Timeline

• dpifke created this task.Jun 22 2020, 6:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 22 2020, 6:05 PM

• dpifke moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Jun 22 2020, 6:06 PM

Change 608973 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] [WIP] webperf: Enable prometheus-apache-exporter

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608973

gerritbot added a project: Patch-For-Review.Jul 1 2020, 11:04 PM

Change 610948 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/debs/xhgui@master] [WIP] Add Prometheus exporter

https://gerrit.wikimedia.org/r/610948

Change 610948 abandoned by Dave Pifke:
[performance/debs/xhgui@master] [WIP] Add Prometheus exporter

Reason:
Wrong branch; will resubmit against wmf to remove dependency on packaging stuff.

https://gerrit.wikimedia.org/r/610948

Change 611448 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/debs/xhgui@master] Add Prometheus exporter

https://gerrit.wikimedia.org/r/611448

Change 612394 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/debs/xhgui@wmf] Add Prometheus exporter

https://gerrit.wikimedia.org/r/612394

Change 611448 abandoned by Dave Pifke:
[performance/debs/xhgui@master] Add Prometheus exporter

Reason:
git-review uploaded this to the wrong branch. Correct version is pending on the wmf branch.

https://gerrit.wikimedia.org/r/611448

fgiunchedi moved this task from Inbox to Radar on the observability board.Jul 20 2020, 12:57 PM

• dpifke mentioned this in T260086: Performance team services health dashboard.Aug 10 2020, 8:16 PM

Krinkle added a project: WikimediaDebug.Aug 14 2020, 1:44 PM

Change 608973 abandoned by Dave Pifke:
[operations/puppet@production] [WIP] webperf: Enable prometheus-apache-exporter

Reason:
The information from this exporter is useful for some forms of troubleshooting, but not for the stated goal of generating alerts if one of the performance.wikimedia.org backends starts throwing errors. I therefore don't see a lot of benefit in collecting these metrics at the moment.

https://gerrit.wikimedia.org/r/608973

Change 612394 abandoned by Dave Pifke:
[performance/debs/xhgui@wmf] Add Prometheus exporter

Reason:
Merged upstream.

https://gerrit.wikimedia.org/r/612394

Change 622447 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] xhgui: scrape Prometheus exporter

https://gerrit.wikimedia.org/r/622447

I tried out the endpoint in prod and found the request timed out, I think from a proxy somewhere (upstream request timeout). I don't think an XHGui Apache or Webperf Apache timeout looks like this, so perhaps it is from ATS/Envoy middleware higher up in the stack.

Anyway, if the responses don't time out internally, it might be acceptable to have them ocasionally take 60 seconds to respond I suppose. I don't know how this affects Promethueus scraping and what the tolerance is there.

I checked Tendril, and did find the sql query there:

SELECT COUNT(*) AS profiles, MAX(request_ts) AS latest, SUM(LENGTH(profile)) AS bytes FROM xhgui /* db1107 xhgui 65s */

Depending on whether optimising the query is trivial, we might want to fence this off from the public Apache for now at least, eg. denying /metrics* entirely from the Webperf Apache config? Especially since this is targetting the m2 master database host. Which makes me wonder, would xhgui be able to run from an m2-replica? That might be worth exploring as well since afaik our xhgui setup is in theory meant to be read-only anyhow.

Krinkle moved this task from Doing (old) to Doing: Goals on the Performance-Team board.Mar 2 2021, 9:08 PM

• dpifke moved this task from Doing: Goals to Backlog: Maintenance, non-prioritized on the Performance-Team board.Oct 18 2021, 6:42 PM

Removing inactive task assignee (please do so as part of offboarding processes).

Krinkle removed a project: Performance-Team.Aug 17 2023, 1:41 PM

Krinkle removed a subscriber: • dpifke.

Prometheus exporter for XHGuiOpen, Needs TriagePublicActions

Description

Details

Related Objects

Event Timeline

Prometheus exporter for XHGui
Open, Needs TriagePublic
Actions