There's two high-level useful metrics for measuring the impact of an outage/incident:
- number of requests 'missing' (easy to measure over the short term by looking at 'holes'/'divots' in the frontend traffic graphs). Unit: generally a whole number of millions of requests.
- fraction of traffic affected (take the # of missing requests as the numerator, and the expected # of requests as the denominator), e.g. "about 20%".
An example of both of these being reported in a document is 20200211-caching-proxies.
#1 gives a single number that scales with both severity and duration of outage. #2 gives you an idea of the overall severity.
In particular, #1 seems useful because it's easy and meaningful to sum up that number across different incidents. Imagine doing the following:
- estimate queries-lost for each incident
- attach some 'tags' to each incident: which pieces of tech or parts of the stack were involved, high-level causes, important-but-not-urgent projects we've been putting off that would have helped mitigate or have prevented the incident, etc
- compute per-tag sums for the past quarter / the past year
Might help get an idea of what stuff we 'should' be working on.
This task is to refine and write up a procedure to easily compute 'queries lost during an outage' from Prometheus data.