write up impact estimation procedure
Open, MediumPublic
Actions

Assigned To

None

Authored By

	CDanis
	Mar 3 2020, 7:30 AM

Description

There's two high-level useful metrics for measuring the impact of an outage/incident:

number of requests 'missing' (easy to measure over the short term by looking at 'holes'/'divots' in the frontend traffic graphs). Unit: generally a whole number of millions of requests.
fraction of traffic affected (take the # of missing requests as the numerator, and the expected # of requests as the denominator), e.g. "about 20%".

An example of both of these being reported in a document is 20200211-caching-proxies.

#1 gives a single number that scales with both severity and duration of outage. #2 gives you an idea of the overall severity.

In particular, #1 seems useful because it's easy and meaningful to sum up that number across different incidents. Imagine doing the following:

estimate queries-lost for each incident
attach some 'tags' to each incident: which pieces of tech or parts of the stack were involved, high-level causes, important-but-not-urgent projects we've been putting off that would have helped mitigate or have prevented the incident, etc
compute per-tag sums for the past quarter / the past year

Might help get an idea of what stuff we 'should' be working on.

This task is to refine and write up a procedure to easily compute 'queries lost during an outage' from Prometheus data.

Event Timeline

CDanis created this task.Mar 3 2020, 7:30 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2020, 7:30 AM

CDanis updated the task description. (Show Details)Mar 3 2020, 7:31 AM

CDanis triaged this task as Medium priority.Mar 3 2020, 7:46 AM

CDanis updated the task description. (Show Details)

RLazarus subscribed.Mar 3 2020, 2:30 PM

• crusnov subscribed.Mar 4 2020, 5:52 PM

• ayounsi subscribed.Mar 4 2020, 5:59 PM

CDanis added a project: SRE-OnFire.Mar 11 2020, 1:02 PM

jbond subscribed.Mar 13 2020, 2:25 PM

lmata added a project: Observability-Metrics.Apr 20 2022, 11:31 AM

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

write up impact estimation procedureOpen, MediumPublicActions

Description

Event Timeline

write up impact estimation procedure
Open, MediumPublic
Actions