Page MenuHomePhabricator

Plan how to improve reminders for teams/people to address identified actionables from incident reports
Closed, ResolvedPublic

Description

Plan of action

One easy thing to do would be to ping all of the very old tasks (eg: tasks who's last update was me adding them to #wikimedia-incident on July 27+28th) with a note like:

"This task is a follow-up action from an incident report and has not recently seen updates. If this is no longer a valid task/actionable or it has been superseded by another one, please indicate as such. If it is still valid you should prioritize this work appropriately in your team/personal backlog. If you have any questions feel free to ask me (Greg Grossmeier)."

(wordsmithing appreciated if this idea makes sense)

It might make sense to rerun that query at the end of this quarter so that there is more than 20 days from when I added Wikimedia-Incident (there'd be ~60 days). NB: When I went through and added Wikimedia-Incident to those tasks I only made it as far back as May 2016 see: T141493#2500584. It's tedious.

tl;dr: One common statement in the meeting we had was ~"awareness is useful, and people forget, so reminding them is good". This above proposal seems like a low-cost method of doing that.

Original-ish description

Just filing a task for now as a placeholder.

I am meeting with TPG this week to brainstorm.

See a previous attempt of this at quarterly reviews of incident reports:
https://wikitech.wikimedia.org/wiki/Incident_documentation#Quarterly.28ish.29_reviews_of_post-mortems

See also: T123753: Establish retrospective reports for #security and #performance incidents

Event Timeline

greg created this task.Jul 25 2016, 6:19 PM
Restricted Application added a project: User-greg. · View Herald TranscriptJul 25 2016, 6:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
greg moved this task from Backlog to Next on the User-greg board.Jul 25 2016, 7:34 PM
faidon added a subscriber: faidon.Jul 25 2016, 8:17 PM

I'd like us (SRE) to be involved in those discussions. We are de facto and de jure the primary incident responders as well as enforcers of the current system (and review each incident at length on a weekly basis).

I believe it'd be more productive and with a better end result if we worked together on that. I don't want to get in the way of your brainstorming nor invite myself to a meeting — happy to get involved when you feel it makes sense.

greg added a comment.EditedJul 25 2016, 8:23 PM

+1 :) I think this first meeting I'm having with TPG is just "is someone willing to help brainstorm with this work" :) so, yeah, I'll loop you in on any actual brainstorming.

greg moved this task from Next to In Progress on the User-greg board.Jul 29 2016, 5:30 AM
greg added a comment.Aug 4 2016, 5:05 PM

Update:

We (Faidon, Kevin S, and myself) just had a conversation about this. Notes at https://etherpad.wikimedia.org/p/reinstateincidentreviews

Conclusions:

  • DECISION: We will continue the conversation in T141287 (here) (with any ideas of how to make the reviews useful, there was debate/uncertainty if the old model was)
  • ACTION: Greg will follow up with Faidon and Kevin via email in 2 weeks, unless other actions would make that unnecessary

I don't really have the context to know what is going to make sense, so I'll toss out some ideas. Assume some will be irrelevant, impractical, etc.

  • The mere act of reminding someone later that they committed to do something has value. Even if they don't end up acting on the reminder, at least their awareness has been raised.
  • Could it become a standard practice to edit every incident report with a "90 days later" section describing what actions (if any) were taken?
  • Would there be value in compiling statistics about components or teams associated with incident reports? Not for public shaming purposes, but to help any teams that have problems in this area realize that they have problems in this area.
  • Any process should have clear ownership. "If more than one person/group is responsible, then effectively nobody is responsible."
  • Any process should be relatively light weight. At least until the benefits have been shown, a large investment isn't justified.
  • As much as possible, the process should be automatic. Nobody should have to remember to check something monthly or quarterly.
  • There should be a path for org-wide or systemic problems or actions to be escalated to someone who could act on them. There are times when the appropriate actions coming out of an incident would need attention from someone outside the group that had the incident or filed the report.

I think I'll stop there, for now.

greg removed greg as the assignee of this task.Aug 8 2016, 10:13 PM
greg added a comment.EditedAug 18 2016, 6:36 PM
  • ACTION: Greg will follow up with Faidon and Kevin via email in 2 weeks, unless other actions would make that unnecessary

It's been 2 weeks.

We now have the Wikimedia-Incident workboard column "follow-up/actionable". This allows us to do queries against the corpus of incident actionables, like:

I haven't had any other bright ideas for this, even after re-reading @ksmith's brainstorm list above. I like some of them, of course (statistics of incident reports by components/software to gauge health/need of extra support, clear ownership is obvious/true, light-weight is obvious/true, hopefully automatic), I'm just not sure what our priority should be here.

One easy thing to do would be to ping all of the very old tasks (eg: tasks who's last update was me adding them to #wikimedia-incident on July 27+28th) with a note like:

"This task is a follow-up action from an incident report and has not recently seen updates. If this is no longer a valid task/actionable or it has been superseded by another one, please indicate as such. If it is still valid you should prioritize this work appropriately in your team/personal backlog. If you have any questions feel free to ask me (Greg Grossmeier)."

(wordsmithing appreciated if this idea makes sense)

It might make sense to rerun that query at the end of this quarter so that there is more than 20 days from when I added Wikimedia-Incident (there'd be ~60 days). NB: When I went through and added Wikimedia-Incident to those tasks I only made it as far back as May 2016 see: T141493#2500584. It's tedious.

tl;dr: One common statement in the meeting we had was ~"awareness is useful, and people forget, so reminding them is good". This above proposal seems like a low-cost method of doing that.

If this makes sense, I plan to re-title this task to align with that plan (eg: "Improve reminders for teams/people to address identified actionables from incident reports"). That wouldn't prevent the reinstatement of a quarterly review in the future, but it wouldn't be the immediate thing which we do now.

greg added a comment.Sep 2 2016, 11:25 PM

One easy thing to do would be to ping all of the very old tasks (eg: tasks who's last update was me adding them to #wikimedia-incident on July 27+28th) with a note like:

"This task is a follow-up action from an incident report and has not recently seen updates. If this is no longer a valid task/actionable or it has been superseded by another one, please indicate as such. If it is still valid you should prioritize this work appropriately in your team/personal backlog. If you have any questions feel free to ask me (Greg Grossmeier)."

(wordsmithing appreciated if this idea makes sense)

It might make sense to rerun that query at the end of this quarter so that there is more than 20 days from when I added Wikimedia-Incident (there'd be ~60 days). NB: When I went through and added Wikimedia-Incident to those tasks I only made it as far back as May 2016 see: T141493#2500584. It's tedious.

tl;dr: One common statement in the meeting we had was ~"awareness is useful, and people forget, so reminding them is good". This above proposal seems like a low-cost method of doing that.

If this makes sense, I plan to re-title this task to align with that plan (eg: "Improve reminders for teams/people to address identified actionables from incident reports"). That wouldn't prevent the reinstatement of a quarterly review in the future, but it wouldn't be the immediate thing which we do now.

I'll retitle next week mid-week if no objections.

greg renamed this task from Institute quarterly(?) review of incident reports and follow-up to Improve reminders for teams/people to address identified actionables from incident reports.Sep 7 2016, 5:31 PM
greg updated the task description. (Show Details)
greg updated the task description. (Show Details)
greg renamed this task from Improve reminders for teams/people to address identified actionables from incident reports to Plan how to improve reminders for teams/people to address identified actionables from incident reports.Sep 7 2016, 5:37 PM
greg closed this task as Resolved.
greg claimed this task.

With the retitling and documenting the plan in the description, I think we're done here. I've created a parent task to actually do the work (mostly as a reminder to myself) by the end of this current quarter.

Thanks all for the brainstorming here.