Page MenuHomePhabricator

write up impact estimation procedure
Open, MediumPublic

Description

There's two high-level useful metrics for measuring the impact of an outage/incident:

  1. number of requests 'missing' (easy to measure over the short term by looking at 'holes'/'divots' in the frontend traffic graphs). Unit: generally a whole number of millions of requests.
  2. fraction of traffic affected (take the # of missing requests as the numerator, and the expected # of requests as the denominator), e.g. "about 20%".

An example of both of these being reported in a document is 20200211-caching-proxies.

#1 gives a single number that scales with both severity and duration of outage. #2 gives you an idea of the overall severity.

In particular, #1 seems useful because it's easy and meaningful to sum up that number across different incidents. Imagine doing the following:

  • estimate queries-lost for each incident
  • attach some 'tags' to each incident: which pieces of tech or parts of the stack were involved, high-level causes, important-but-not-urgent projects we've been putting off that would have helped mitigate or have prevented the incident, etc
  • compute per-tag sums for the past quarter / the past year

Might help get an idea of what stuff we 'should' be working on.

This task is to refine and write up a procedure to easily compute 'queries lost during an outage' from Prometheus data.

Event Timeline

CDanis triaged this task as Medium priority.Mar 3 2020, 7:46 AM
CDanis updated the task description. (Show Details)
Aklapper removed a subscriber: crusnov.

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!