Page MenuHomePhabricator

Proposal: create a framework to build containerized incident management protects
Open, LowPublic

Description

I have often thought it would be great if we could have a repository of containerized projects which can replicate a real incidents. With the intention of training engineers good incident management skills.

As a very loose concept i see this as something like a git repo which has a folder for each incident report we produce. The folder should contain

  • a method to configure a virtual environment representing the incident (e.g. Vagrantfile) this would need to include our environment and likely the client environment which triggers the incident.
  • A set of injects to guide the user i.e. icinga alerts on X; received user reports of Y; person foobar discovered Z; graphana graph posted in sre irc channel; etc
    • Theses injects would likely map to real events from the incident timeline and would act; first as incident indicators but also as hints or clues to the engineer in training, likely with the last ones explaining the exact issue and fix i.e:
      • $INJECT_LAST - 1: _joe_ noticed the following query signature producing the error $some_uri
      • $INJEXT_LAST: cdanis merges change to rate limit uri at the cache layer (incident over)
    • The engineer in training would be able to choose when to reveal each inject in there own time. at a simple level the injects would just be in files label inject_{1..n} but we could have some interactive cli managing this as well
  • There should be some way for the training engineer to work out if they have resolved the incident without reading all the injects
    • this imo could be quite tricky. the temptation is to check for the exact fix implemented during the incident; however this prevents us and the trainee exploring fixes not considered previously

I would love to see something like this however if we have to create all of this manually then it likely wont happen and definitely wont get updated. As such i would like to explore the possibilities of creating some type of frame work so that we can try and automate theses lab exercises. Ideally we would be able to create theses containers projects based of the incident report, in reality i doubt we would ever get to that state as there are to many subtleties but i think we can get close to that. the injects for instance would be very easy to script as it would pretty much be the incident time line, we enhance this by asking the incident report author to add some tags to the timeline which are to be used for injects?

The most difficult thing is in creating a simple frame work so it's easy to create labs that represent the incident; the clients that trigger it and a script to identify the issue is resolved.

the first goal to "create labs that represent the incident" should in theory be simple. all we need is to start our containerized wikimedia development environment with a specific;

  • puppet git revision
  • wikimedia git revision/package version
  • wikimedia-config revision
  • potentially specific debian packages (this could be tricky)

however i don't think we have a "containerized wikimedia development environment" and i don't underestimate how difficult getting one would be. in fact i think this is likely the biggest blocker to the whole proposal. however i think it would definitely have use cases beyond this project such as increasing community contributions.

The scripting is also difficult as its hard to come up with a generic way to test if someone has fixed a unique unknown incident, which is why the obvious solution is to just check if the user implements something similar to what was used in the original incident resolution. however i think that is an adequate starting ground and we could learn from there