Page MenuHomePhabricator

Automated uploads of minimal & comprehensible timeseries metrics for statuspage display
Closed, ResolvedPublic

Description

There's a great benefit in displaying a few high-level timeseries metrics on our status page: it means that semi-technical users can tell that something is amiss even before SRE has had a chance to manually update the page with an incident notification.

After discussion we selected four metrics:

  • total edge HTTP requests per second
  • Number of appserver error (5xx) responses per second
  • Average* latency for all appserver requests
  • Number of successfully-saved wiki edits

These metrics can be summarized in a fairly self-explanatory way, will reflect many different kinds of possible outages, and are also things that most users are likely to care about.

In the future we might also want to include RUM latency data from a broad swath of users, possibly using it to replace appserver latency, possibly adding it as another graph.

Another likely addition is the rate of incoming NEL reports of certain types (and with a low age field). In recent networking-related outages they've been a good signal of trouble and we page on them now, so why not present cleaned-up data to users as well?

However we should also strive to keep the number of metrics to a minimum: the page should be readable at a glance and not overwhelming. I think six plots is an absolute upper bound.

The plan is to upload these metrics by a simple Python utility we wrote, named statograph, running on both alerting hosts via a systemd timer. It will query Prometheus & Graphite and push data to the Statuspage.io API.

(*: Yes, average latency; percentiles are tricky to explain to the uninitiated, and I believe that skewing displayed data towards the long tail is actually beneficial in this case.)

  • Code review of statograph
  • Puppetization and deployment as a systemd timer on alert1001/alert2001 hosts
  • Basic alerting provided by systemd unit failure (from statograph's exit code)
  • Add feature to statograph upload_metrics to export a Prometheus node_exporter textfile of the most_recent_data_at timestamp for each Metric, plus some basic IRC alerting on those timestamps being too far behind current time

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJun 25 2021, 5:33 PM

Change 701597 had a related patch set uploaded (by CDanis; author: CDanis):

[integration/config@master] [operations/software/statograph] Configure tox

https://gerrit.wikimedia.org/r/701597

Change 701599 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/statograph@master] statograph: Initial commit

https://gerrit.wikimedia.org/r/701599

Change 701597 merged by jenkins-bot:

[integration/config@master] [operations/software/statograph] Configure tox

https://gerrit.wikimedia.org/r/701597

As a fun curiosity, here's today's datacenter switchover as shown by Statuspage:

image.png (1ร—824 px, 69 KB)

As a fun curiosity, here's today's datacenter switchover as shown by Statuspage:

image.png (1ร—824 px, 69 KB)

To aid in the Puppetization and deployment: P16741 is a private-to-SRE paste that contains a configuration file suitable for use in production.

Change 701599 merged by CDanis:

[operations/software/statograph@master] statograph: Initial commit

https://gerrit.wikimedia.org/r/701599

Change 702187 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/software/statograph@master] Report upload_metrics failures in our exit code

https://gerrit.wikimedia.org/r/702187

Change 702187 merged by jenkins-bot:

[operations/software/statograph@master] Report upload_metrics failures in our exit code

https://gerrit.wikimedia.org/r/702187

herron triaged this task as Medium priority.Jul 1 2021, 5:29 PM

Change 704133 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/statograph@master] statograph: add debian folder allowing us to package

https://gerrit.wikimedia.org/r/704133

Change 704314 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/statograph@master] setup.py: create an entry point so we have an executable script

https://gerrit.wikimedia.org/r/704314

Change 704314 merged by jenkins-bot:

[operations/software/statograph@master] setup.py: create an entry point so we have an executable script

https://gerrit.wikimedia.org/r/704314

Change 704133 merged by jenkins-bot:

[operations/software/statograph@master] statograph: add debian folder allowing us to package

https://gerrit.wikimedia.org/r/704133

Change 708095 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] First attempt to create puppet class for statograph service, which exports statistics about WMF infra to statuspage.io for external visibility. Info: https://gerrit.wikimedia.org/r/admin/repos/operations/software/statograph

https://gerrit.wikimedia.org/r/708095

Change 708095 merged by Cathal Mooney:

[operations/puppet@production] O:alerting_host: create puppet class for statograph service.

https://gerrit.wikimedia.org/r/708095

Change 709023 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/puppet@production] O:alerting_host: fix job command for statograph systemd timer

https://gerrit.wikimedia.org/r/709023

Change 709023 merged by Cathal Mooney:

[operations/puppet@production] O:alerting_host: fix job command for statograph systemd timer

https://gerrit.wikimedia.org/r/709023

CDanis updated the task description. (Show Details)

Change 731171 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Add rate of high-signal NELs as a status page metric

https://gerrit.wikimedia.org/r/731171

Change 731171 merged by CDanis:

[operations/puppet@production] Add rate of high-signal NELs as a status page metric

https://gerrit.wikimedia.org/r/731171

Change 770944 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] Cross-ref Grafana dashboard in statograph hiera

https://gerrit.wikimedia.org/r/770944

Change 770944 merged by CDanis:

[operations/puppet@production] Cross-ref Grafana dashboard in statograph hiera

https://gerrit.wikimedia.org/r/770944

@CDanis: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via Add Action... โ†’ Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

@CDanis: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Actionโ€ฆ ๐Ÿก’ Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

CDanis claimed this task.

In practice the very basic alerting from systemd unit failures has been enough for every statograph issue so far.