Page MenuHomePhabricator

Create #Wikimedia-Incident for tracking Wikimedia incident actionables
Closed, ResolvedPublic

Description

Proposal

  • A new project called #Wikimedia-Incident with the following columns:
    • "To Triage (default)"
    • "Active Emergency" - for tasks about an on-going emergency
    • "Follow-up/Actionables" - for tasks from the "Actionables" section of an Incident Report
    • Specific incident milestones to track larger number than normal follow-up/actionables; the creation of such milestones is up to the people involved in the response/follow-up (IOW: only if they want it).

If a milestone is created for follow-up actionables:

  • It'd look like #Wikimedia-Incident-20140228-Cirrus
  • The incident milestone projects will be archived when the follow-up tasks are completed (or removed/delayed for longer than 3 months upon further reflection/discussion)
    • For the avoidance of doubt: if, after discussion with stakeholders, it is deemed that the only follow-up task(s) that remain is/are not needed in the short term (3 months) but is/are still valid then we should archive the incident milestone AND keep the task in it as a historical marker/context for future work
  • It is up to the people working on the specific incident to create workboard columns in that milestone
    • They may choose everything from not having columns to using columns extensively.

How it'd work in practice

  1. See something horrible
  2. File task
  3. Add #Wikimedia-Incident to alert appropriate people
    1. It goes in the "To Triage (default)" column
  4. It is triaged into "Active Emergency"
  5. Fix it! (don't worry about column moving, just fix it, Greg/someone can triage it)
  6. Write up an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation
  7. Will there be a large number of follow-up tasks/do the people doing the follow-up want a milestone?
    1. Yes: Create a milestone eg #Wikimedia-Incident-20160725-Whatever
    2. No: use the "Follow-up/Actionables" column for any listed actionables from the incident report

Benefits

  • This allows anyone to get a quick view (by viewing the #Wikimedia-Incident project workboard) of "tasks of importance and be reminded of outstanding technical debt that has retroactively demonstrated its need in the form of an outage" (as Krinkle said in the original description).
  • And each incident's responders can manage their work in the best way for them.

Disadvantages

  • In the event that a large number of incidents have follow-up that takes a long time the "Follow-up/Actionables" column will be very large.
    • NB: Greg is considering new options for something to replace the quarterly review of incident reports and follow-up, see T141287 for that.

Original Description:

I'd like to propose a task related to Wikimedia-production-error with the purpose of tracking bugs, technical debt, feature work that was prompted by an outage or other incident and as such would help prevent, or reduce impact of, future incidents.

We previously sometimes created new components for incidents (e.g. Incident-20150205-SiteOutage). See also T134624: Archive old Incident-* projects. I believe it would be useful to track these tasks under a combined project instead. This would make it easier to discover tasks of importance and be reminded of outstanding technical debt that has retroactively demonstrated its need in the form of an outage.

I personally think we don't need the individual incident components, but that's a separate question. If we decide to keep them, we'd be tagging with both. And, more commonly, for incidents that didn't get a component, they'd at least be tracked here.

The workboard for this would be a more manageable version of the wiki pages we created in 2014 that transclude the "actionable" sections of each incident per quarter.

https://wikitech.wikimedia.org/wiki/Incident_documentation/QR201403
https://wikitech.wikimedia.org/wiki/Incident_documentation/QR201407

Event Timeline

Krinkle created this task.Jul 13 2016, 12:50 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 13 2016, 12:50 AM
Danny_B claimed this task.Jul 13 2016, 1:00 AM
Danny_B triaged this task as Medium priority.
Danny_B moved this task from Incoming to Projects to create on the Project-Admins board.
Danny_B added a subscriber: Danny_B.

Title says "tag", description says "task". Please synchronize. Thank you.

I think we should probably get some Operations/Release-Engineering-Team input since they would probably be the ones using this the most about how they want to move forward.

I would probably imagine having a Incident-Report parent project then having subprojects for each issue would be the easiest method.

greg added a comment.Jul 13 2016, 3:21 AM

The per incident projects were probably an over engineered part (from me).

I like the one project idea.

What do you think the columns would look like to replicate those older
quarterly reviews? status? Something else?

My idea when doing the Phabricator cleanup was #Wikimedia-sites-incident with milestones for each incident (and convert current incident projects to milestones).

greg added a comment.Jul 18 2016, 6:20 PM

My idea when doing the Phabricator cleanup was #Wikimedia-sites-incident with milestones for each incident (and convert current incident projects to milestones).

And, it'd (hopefully) force us to actively manage the tasks in those milestones (resolve them or remove them in a reasonable amount of time) so as we don't have a workboard with 100 incident milestone still active.

My idea when doing the Phabricator cleanup was #Wikimedia-sites-incident with milestones for each incident (and convert current incident projects to milestones).

And, it'd (hopefully) force us to actively manage the tasks in those milestones (resolve them or remove them in a reasonable amount of time) so as we don't have a workboard with 100 incident milestone still active.

Cf. T134624: Archive old Incident-* projects ;-)

OK, if no objections soon, I'll proceed first steps towards this aim.

greg added a comment.Jul 19 2016, 6:12 PM

@Krinkle: yay/nay to the above proposal?

@Krinkle: yay/nay to the above proposal?

Looks good. Though personally I don't feel the need for milestones. It's fine to have them I guess, but it does force our hand with regards to the main workboard - as tasks can then only be in the dedicated column for that milestone.

greg added a comment.Jul 19 2016, 8:35 PM

Yeah, it does, I just couldn't think of a better workboard column purpose (other than status, which is kind of annoying to track, since many teams track it somewhere else already).

Looks good. Though personally I don't feel the need for milestones. It's fine to have them I guess, but it does force our hand with regards to the main workboard - as tasks can then only be in the dedicated column for that milestone.

There is no way task could be in more columns regardless if they are milestones or plain columns.

However, milestone at least - unlike the regular column - allows further breakdown, typically progress-wise.

greg updated the task description. (Show Details)Jul 25 2016, 6:20 PM

I've updated the description with the proposal as I see it. I don't think this is too radical of a proposal. We should implement it soon. We just need someone to do the conversion of old incident projects into milestones of #Wikimedia-Incident (needs shell/permissions on the Phabricator host) after it is created.

greg added a comment.Jul 25 2016, 6:46 PM

I had an epiphany while responding in T140207 that makes this proposal even better (I think):

#Wikimedia-Incident will be the parent project for the specific incident milestones (see T140202 for details). `Wikimedia-Incident' could also be the solution to this problem. Something like this:

  1. See something horrible
  2. File task
  3. Add #Wikimedia-Incident to alert appropriate people
    1. It goes in the "To Triage (default)" column
  4. It is triaged into "Active Emergency"
  5. Fix it! (don't worry about column moving, just fix it, Greg/someone can triage it)
  6. Does it need an incident report with follow-up tasks? Create a milestone eg #Wikimedia-Incident-20160725-Whatever and move that task to it.
  7. See T140202 for how those milestones would be used

#Wikimedia-Incident would then become the project that is watched by myself and responsible people from Operations (though, don't count on us to see the bug mail, a ping on IRC/email is vastly preferred).
I... think that's a good solution, no?

greg updated the task description. (Show Details)Jul 25 2016, 6:49 PM
faidon added a subscriber: faidon.Jul 25 2016, 8:11 PM

I personally find absolutely no use for the #Wikimedia-Incident-20160725-Whatever tags. I don't recall ever using them as tags or their workboards and they just become noise after a while.

I'm all for a Wikimedia-Incident supertag, though. I actually have proposed that before too, cf. T119944 :)

I think I agree with @faidon regarding individual milestones per incident. That seems like a lot of task management overhead. Creating subtasks for any followup work would be much simpler / easier than creating a new milestone for an incident just because it has more than one task associated with it.

I might be missing something though - is there a specific benefit to the milestones that is being overlooked?

I think I agree with @faidon regarding individual milestones per incident. That seems like a lot of task management overhead. Creating subtasks for any followup work would be much simpler / easier than creating a new milestone for an incident just because it has more than one task associated with it.
I might be missing something though - is there a specific benefit to the milestones that is being overlooked?

I would think it should probably be based on best judgement in relation to the number of subtasks that the incident report generates.

greg added a comment.Jul 27 2016, 6:06 PM

My opinion is pretty much "what @Peachey88 said".

We're not automatons and can reason whether a milestone would be useful for incident follow-up tasks (aka: the "actionables" section in the incident report).

We could have a "Follow-up/Actionables" (wordsmithing allowed ;) ) column where those tasks go if they don't need a milestone/project themselves. I see this probably being the most common/default choice. I'll edit the description to reflect this.

greg updated the task description. (Show Details)Jul 27 2016, 6:12 PM
greg renamed this task from Create tag for tracking Wikimedia incident actionables to Create #Wikimedia-Incident for tracking Wikimedia incident actionables.Jul 27 2016, 9:53 PM
greg added a comment.Jul 27 2016, 10:16 PM

To keep things as simple as possible at the beginning, let's:

  • Just create the #Wikimedia-Incident project/tag now.
  • I'll go through as many incident reports as I can and add it to any follow-up tasks listed. Help appreciated with that. I'll create another task for that so others can help (please ;) ).

And looking at the data we have (those 23 still open tasks in those 9 #Incident-* projects) it looks like no one is using the workboards actively (some don't have workboards at all). To clean things up I propose adding them to the #Wikimedia-Incident project and putting them in the 'Follow-up/Actionables' column and archiving the #Incident-* projects. IOW: it seems the people working on them didn't actively need a milestone/workboard, so no need for them now, they can live in that column just fine.

greg updated the task description. (Show Details)Jul 27 2016, 10:17 PM
Danny_B added a comment.EditedJul 29 2016, 10:39 AM

For the sake of order and better orieantation in causes and dependencies it still would be good to have at least a column on Wikimedia-Incident workboard for each incident. Such column can become hidden when incident is considered "solved" and follow up actions moved to proper column.

Having flat board doesn't make a sense that much. Because in fact then only the tasks reporting the incident should be tagged and not all subsequent actions.

greg added a comment.Jul 29 2016, 4:33 PM

I disagree, @Danny_B. The current workboard setup is fine and meets the needs of this task and those who are responding to Wikimedia Incidents. We don't need over-engineering of it. If we need more columns/structure we can do that later (as needed). Thus far it is not needed.