Automated uploads of minimal & comprehensible timeseries metrics for statuspage display
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	CDanis
	Jun 25 2021, 5:32 PM

Description

There's a great benefit in displaying a few high-level timeseries metrics on our status page: it means that semi-technical users can tell that something is amiss even before SRE has had a chance to manually update the page with an incident notification.

After discussion we selected four metrics:

total edge HTTP requests per second
Number of appserver error (5xx) responses per second
Average* latency for all appserver requests
Number of successfully-saved wiki edits

These metrics can be summarized in a fairly self-explanatory way, will reflect many different kinds of possible outages, and are also things that most users are likely to care about.

In the future we might also want to include RUM latency data from a broad swath of users, possibly using it to replace appserver latency, possibly adding it as another graph.

Another likely addition is the rate of incoming NEL reports of certain types (and with a low age field). In recent networking-related outages they've been a good signal of trouble and we page on them now, so why not present cleaned-up data to users as well?

However we should also strive to keep the number of metrics to a minimum: the page should be readable at a glance and not overwhelming. I think six plots is an absolute upper bound.

The plan is to upload these metrics by a simple Python utility we wrote, named statograph, running on both alerting hosts via a systemd timer. It will query Prometheus & Graphite and push data to the Statuspage.io API.

(*: Yes, average latency; percentiles are tricky to explain to the uninitiated, and I believe that skewing displayed data towards the long tail is actually beneficial in this case.)

Code review of statograph
Puppetization and deployment as a systemd timer on alert1001/alert2001 hosts
Basic alerting provided by systemd unit failure (from statograph's exit code)
Add feature to statograph upload_metrics to export a Prometheus node_exporter textfile of the most_recent_data_at timestamp for each Metric, plus some basic IRC alerting on those timestamps being too far behind current time

Details

Subject	Repo	Branch	Lines +/-
Cross-ref Grafana dashboard in statograph hiera	operations/puppet	production	+3 -0
Add rate of high-signal NELs as a status page metric	operations/puppet	production	+7 -0
O:alerting_host: fix job command for statograph systemd timer	operations/puppet	production	+2 -2
O:alerting_host: create puppet class for statograph service.	operations/puppet	production	+142 -0
statograph: add debian folder allowing us to package	operations/software/statograph	master	+40 -0
setup.py: create an entry point so we have an executable script	operations/software/statograph	master	+5 -0
Report upload_metrics failures in our exit code	operations/software/statograph	master	+4 -0
statograph: Initial commit	operations/software/statograph	master	+1 K -0
[operations/software/statograph] Configure tox	integration/config	master	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T22079 Provide a better means of status update delivery in WMF error message
Open		None	T202061 Implement an accurate and easy to understand status page for all wikis
Resolved		CDanis	T285569 Automated uploads of minimal & comprehensible timeseries metrics for statuspage display
Resolved		cmooney	T290425 statograph_post service fail on alert hosts
Resolved		CDanis	T298619 "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC

Event Timeline

CDanis created this task.Jun 25 2021, 5:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 25 2021, 5:33 PM

Change 701597 had a related patch set uploaded (by CDanis; author: CDanis):

[integration/config@master] [operations/software/statograph] Configure tox

https://gerrit.wikimedia.org/r/701597

gerritbot added a project: Patch-For-Review.Jun 25 2021, 5:34 PM

CDanis updated the task description. (Show Details)Jun 25 2021, 5:37 PM

CDanis added a project: SRE-OnFire.Jun 25 2021, 5:39 PM

Change 701599 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/statograph@master] statograph: Initial commit

https://gerrit.wikimedia.org/r/701599

CDanis updated the task description. (Show Details)Jun 25 2021, 6:02 PM

Change 701597 merged by jenkins-bot:

[integration/config@master] [operations/software/statograph] Configure tox

https://gerrit.wikimedia.org/r/701597

CDanis updated the task description. (Show Details)Jun 28 2021, 8:28 PM

lmata moved this task from Inbox to Radar on the observability board.Jun 29 2021, 2:54 PM

cmooney subscribed.Jun 29 2021, 3:57 PM

As a fun curiosity, here's today's datacenter switchover as shown by Statuspage:

As a fun curiosity, here's today's datacenter switchover as shown by Statuspage:

To aid in the Puppetization and deployment: P16741 is a private-to-SRE paste that contains a configuration file suitable for use in production.

Change 701599 merged by CDanis:

[operations/software/statograph@master] statograph: Initial commit

https://gerrit.wikimedia.org/r/701599

Change 702187 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/statograph@master] Report upload_metrics failures in our exit code

https://gerrit.wikimedia.org/r/702187

Change 702187 merged by jenkins-bot:

[operations/software/statograph@master] Report upload_metrics failures in our exit code

https://gerrit.wikimedia.org/r/702187

herron triaged this task as Medium priority.Jul 1 2021, 5:29 PM

jbond added a project: User-jbond.Jul 12 2021, 11:15 AM

Change 704133 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/statograph@master] statograph: add debian folder allowing us to package

https://gerrit.wikimedia.org/r/704133

MoritzMuehlenhoff subscribed.Jul 13 2021, 10:56 AM

Change 704314 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/statograph@master] setup.py: create an entry point so we have an executable script

https://gerrit.wikimedia.org/r/704314

jbond moved this task from Unsorted 💣 to Active 🚁 on the User-jbond board.Jul 13 2021, 1:55 PM

Change 704314 merged by jenkins-bot:

[operations/software/statograph@master] setup.py: create an entry point so we have an executable script

https://gerrit.wikimedia.org/r/704314

Change 704133 merged by jenkins-bot:

[operations/software/statograph@master] statograph: add debian folder allowing us to package

https://gerrit.wikimedia.org/r/704133

Change 708095 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] First attempt to create puppet class for statograph service, which exports statistics about WMF infra to statuspage.io for external visibility. Info: https://gerrit.wikimedia.org/r/admin/repos/operations/software/statograph

https://gerrit.wikimedia.org/r/708095

Change 708095 merged by Cathal Mooney:

[operations/puppet@production] O:alerting_host: create puppet class for statograph service.

https://gerrit.wikimedia.org/r/708095

Change 709023 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] O:alerting_host: fix job command for statograph systemd timer

https://gerrit.wikimedia.org/r/709023

Change 709023 merged by Cathal Mooney:

[operations/puppet@production] O:alerting_host: fix job command for statograph systemd timer

https://gerrit.wikimedia.org/r/709023

jbond closed subtask T290425: statograph_post service fail on alert hosts as Resolved.Sep 8 2021, 11:14 AM

CDanis claimed this task.Oct 14 2021, 8:11 PM

CDanis updated the task description. (Show Details)

Change 731171 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Add rate of high-signal NELs as a status page metric

https://gerrit.wikimedia.org/r/731171

Change 731171 merged by CDanis:

[operations/puppet@production] Add rate of high-signal NELs as a status page metric

https://gerrit.wikimedia.org/r/731171

jbond moved this task from Active 🚁 to Watching 👀 on the User-jbond board.Nov 25 2021, 2:34 PM

CDanis added a subtask: T298619: "User-reported connectivity errors" (NEL data) not being posted to statuspage since 1 Jan 00:00 UTC .Jan 24 2022, 3:53 PM

Change 770944 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Cross-ref Grafana dashboard in statograph hiera

https://gerrit.wikimedia.org/r/770944

Change 770944 merged by CDanis:

[operations/puppet@production] Cross-ref Grafana dashboard in statograph hiera

https://gerrit.wikimedia.org/r/770944

@CDanis: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via Add Action... → Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

@CDanis: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

In practice the very basic alerting from systemd unit failures has been enough for every statograph issue so far.

	F34531856: image.png
	Jun 29 2021, 4:57 PM

Automated uploads of minimal & comprehensible timeseries metrics for statuspage displayClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Automated uploads of minimal & comprehensible timeseries metrics for statuspage display
Closed, ResolvedPublic
Actions

Related Objects
Search...