Proposal
- A new project called #Wikimedia-Incident with the following columns:
- "To Triage (default)"
- "Active Emergency" - for tasks about an on-going emergency
- "Follow-up/Actionables" - for tasks from the "Actionables" section of an Incident Report
- Specific incident milestones to track larger number than normal follow-up/actionables; the creation of such milestones is up to the people involved in the response/follow-up (IOW: only if they want it).
If a milestone is created for follow-up actionables:
- It'd look like #Wikimedia-Incident-20140228-Cirrus
- The incident milestone projects will be archived when the follow-up tasks are completed (or removed/delayed for longer than 3 months upon further reflection/discussion)
- For the avoidance of doubt: if, after discussion with stakeholders, it is deemed that the only follow-up task(s) that remain is/are not needed in the short term (3 months) but is/are still valid then we should archive the incident milestone AND keep the task in it as a historical marker/context for future work
- It is up to the people working on the specific incident to create workboard columns in that milestone
- They may choose everything from not having columns to using columns extensively.
How it'd work in practice
- See something horrible
- File task
- Add #Wikimedia-Incident to alert appropriate people
- It goes in the "To Triage (default)" column
- It is triaged into "Active Emergency"
- Fix it! (don't worry about column moving, just fix it, Greg/someone can triage it)
- Write up an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation
- Will there be a large number of follow-up tasks/do the people doing the follow-up want a milestone?
- Yes: Create a milestone eg #Wikimedia-Incident-20160725-Whatever
- No: use the "Follow-up/Actionables" column for any listed actionables from the incident report
Benefits
- This allows anyone to get a quick view (by viewing the #Wikimedia-Incident project workboard) of "tasks of importance and be reminded of outstanding technical debt that has retroactively demonstrated its need in the form of an outage" (as Krinkle said in the original description).
- And each incident's responders can manage their work in the best way for them.
Disadvantages
- In the event that a large number of incidents have follow-up that takes a long time the "Follow-up/Actionables" column will be very large.
- NB: Greg is considering new options for something to replace the quarterly review of incident reports and follow-up, see T141287 for that.
Original Description:
I'd like to propose a task related to Wikimedia-production-error with the purpose of tracking bugs, technical debt, feature work that was prompted by an outage or other incident and as such would help prevent, or reduce impact of, future incidents.
We previously sometimes created new components for incidents (e.g. Incident-20150205-SiteOutage). See also T134624: Archive old Incident-* projects. I believe it would be useful to track these tasks under a combined project instead. This would make it easier to discover tasks of importance and be reminded of outstanding technical debt that has retroactively demonstrated its need in the form of an outage.
I personally think we don't need the individual incident components, but that's a separate question. If we decide to keep them, we'd be tagging with both. And, more commonly, for incidents that didn't get a component, they'd at least be tracked here.
The workboard for this would be a more manageable version of the wiki pages we created in 2014 that transclude the "actionable" sections of each incident per quarter.
https://wikitech.wikimedia.org/wiki/Incident_documentation/QR201403
https://wikitech.wikimedia.org/wiki/Incident_documentation/QR201407