Proposal: create a framework to build containerized incident management protects
Open, LowPublic
Actions

Assigned To

None

Authored By

	jbond
	Oct 9 2020, 4:12 PM

Description

I have often thought it would be great if we could have a repository of containerized projects which can replicate a real incidents. With the intention of training engineers good incident management skills.

As a very loose concept i see this as something like a git repo which has a folder for each incident report we produce. The folder should contain

a method to configure a virtual environment representing the incident (e.g. Vagrantfile) this would need to include our environment and likely the client environment which triggers the incident.
A set of injects to guide the user i.e. icinga alerts on X; received user reports of Y; person foobar discovered Z; graphana graph posted in sre irc channel; etc
- Theses injects would likely map to real events from the incident timeline and would act; first as incident indicators but also as hints or clues to the engineer in training, likely with the last ones explaining the exact issue and fix i.e:
  - $INJECT_LAST - 1: _joe_ noticed the following query signature producing the error $some_uri
  - $INJEXT_LAST: cdanis merges change to rate limit uri at the cache layer (incident over)
- The engineer in training would be able to choose when to reveal each inject in there own time. at a simple level the injects would just be in files label inject_{1..n} but we could have some interactive cli managing this as well
There should be some way for the training engineer to work out if they have resolved the incident without reading all the injects
- this imo could be quite tricky. the temptation is to check for the exact fix implemented during the incident; however this prevents us and the trainee exploring fixes not considered previously

I would love to see something like this however if we have to create all of this manually then it likely wont happen and definitely wont get updated. As such i would like to explore the possibilities of creating some type of frame work so that we can try and automate theses lab exercises. Ideally we would be able to create theses containers projects based of the incident report, in reality i doubt we would ever get to that state as there are to many subtleties but i think we can get close to that. the injects for instance would be very easy to script as it would pretty much be the incident time line, we enhance this by asking the incident report author to add some tags to the timeline which are to be used for injects?

The most difficult thing is in creating a simple frame work so it's easy to create labs that represent the incident; the clients that trigger it and a script to identify the issue is resolved.

the first goal to "create labs that represent the incident" should in theory be simple. all we need is to start our containerized wikimedia development environment with a specific;

puppet git revision
wikimedia git revision/package version
wikimedia-config revision
potentially specific debian packages (this could be tricky)

however i don't think we have a "containerized wikimedia development environment" and i don't underestimate how difficult getting one would be. in fact i think this is likely the biggest blocker to the whole proposal. however i think it would definitely have use cases beyond this project such as increasing community contributions.

The scripting is also difficult as its hard to come up with a generic way to test if someone has fixed a unique unknown incident, which is why the obvious solution is to just check if the user implements something similar to what was used in the original incident resolution. however i think that is an adequate starting ground and we could learn from there

Event Timeline

jbond triaged this task as Low priority.Oct 9 2020, 4:12 PM

jbond created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 9 2020, 4:12 PM

jbond updated the task description. (Show Details)Oct 9 2020, 4:18 PM

CDanis subscribed.Oct 9 2020, 4:20 PM

jijiki subscribed.Oct 9 2020, 4:28 PM

herron subscribed.Oct 9 2020, 5:06 PM

RLazarus subscribed.Oct 9 2020, 9:42 PM

ayounsi subscribed.Oct 12 2020, 8:33 AM

jbond moved this task from Unsorted 💣 to Friday tasks on the User-jbond board.Nov 20 2020, 11:25 AM

cmooney subscribed.Mar 29 2023, 10:31 AM

jbond edited projects, added Infrastructure-Foundations, Puppet; removed SRE.Mar 29 2023, 2:23 PM

jbond edited projects, added Puppet CI; removed Puppet, User-jbond.Jul 17 2023, 3:05 PM

Proposal: create a framework to build containerized incident management protectsOpen, LowPublicActions

Description

Event Timeline

Proposal: create a framework to build containerized incident management protects
Open, LowPublic
Actions