Establish retrospective reports for #security and #performance incidents
Closed, ResolvedPublic

Description

Summary

Institute a norm around security and performance retrospective reports, reinforcing appropriate incentives

Background

In T114419: Event on "Make code review not suck", @ori made the case that we should move to a similar ethic to the operations/puppet repo, where self-merges are allowed, but responsibility lies with the whoever merged the patch. Such a move would require a lot more discipline and know-how than has traditionally been exhibited by everyone who currently has +2 rights ("everyone" being the key word). Even for the people that are have the know-how, the discipline can be difficult because of our current traditions.

One ethic that would help us develop the discipline required for self-merging would be TechOps-style retrospectives/postmortems for all serious security and performance issues. These wouldn't be about assigning blame, but about building trust and helping us collectively learn from our mistakes. In the aftermath of a serious mistake, a well-written retrospective would help build trust in the author, and would provide the author with deserved esteem.

One anti-pattern we should avoid: dumping this on WMF specialist teams. The Security-Team and Performance-Team s at WMF can play an essential role here, but they wouldn't be responsible for writing the reports. If we instituted this, their responsibility would be to establish the practice, guide committers who need assistance completing reports, and identify incidents for which retrospective reports are appropriate. Additional responsibilities these teams could take on: quarterly retrospective review meetings (similar to the ones @greg runs) and other aggregate reporting, such as percentage of incidents deserving reports that don't have them.

One day we might also ask for "technical debt retrospectives", but let's crawl before we walk. ;-)

RobLa-WMF updated the task description. (Show Details)
RobLa-WMF raised the priority of this task from to Needs Triage.
RobLa-WMF claimed this task.
RobLa-WMF added subscribers: RobLa-WMF, ori, Krinkle and 4 others.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 15 2016, 5:10 PM
RobLa-WMF triaged this task as Normal priority.Jan 25 2016, 9:34 PM
ori awarded a token.Jan 25 2016, 11:26 PM
ori moved this task from Inbox to Blocked on the Performance-Team board.Feb 8 2016, 7:55 PM

What is the relationship between this and https://wikitech.wikimedia.org/wiki/Incident_documentation ? The line between "really bad performance" and "outage" is not always clear.

In T123753#2161839, @Mattflaschen wrote:

What is the relationship between this and https://wikitech.wikimedia.org/wiki/Incident_documentation ? The line between "really bad performance" and "outage" is not always clear.

Thanks for showing up for E152 and asking that question! My abstract summary of our discussion: Security-Team and Performance-Team should make recommendations for what should have a retrospective, and then the parties responsible for the root cause would hopefully often do them. The log of our discussion is saved here at 21:46 (line 99):

121:01:30 <robla> #startmeeting https://phabricator.wikimedia.org/E152
221:01:30 <wm-labs-meetbot> Meeting started Wed Mar 30 21:01:30 2016 UTC and is due to finish in 60 minutes. The chair is robla. Information about MeetBot at http://wiki.debian.org/MeetBot.
321:01:30 <wm-labs-meetbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
421:01:30 <wm-labs-meetbot> The meeting name has been set to 'https___phabricator_wikimedia_org_e152'
521:01:49 <robla> #topic Please note: Channel is logged and publicly posted (DO NOT REMOVE THIS NOTE) | Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/
621:02:13 <robla> hi folks!
721:02:58 * robla begins to wonder if he's going to be the only one at this office hour ;-)
821:03:11 <Scott_WUaS> Hi Robla!
921:03:26 <robla> hi Scott_WUaS
1021:05:19 <Scott_WUaS> @robla: What in particular do you want to focus on today?
1121:05:47 <robla> this is really just going to be more of an office hour in a somewhat traditional sense. we only had a couple of ArchCom folks at the telecon last hour (gwicke and Krinkle)
1221:07:01 <Scott_WUaS> sounds good ... and perhaps an opportunity to get things done in a different way, with relatively few participants
1321:07:28 <robla> I listed a few RFCs that I'm shepherding in https://phabricator.wikimedia.org/E152 that I'm specifically happy to answer questions about, but really, no locked down agenda. Scott_WUaS, any specific questions you have?
1421:07:57 <Scott_WUaS> Yes, thanks ...
1521:08:30 <ostriches> robla: I'm around too if we need to discuss the Gerrit/Phab one a bit too
1621:08:55 <robla> ostriches! o/
1721:09:05 <Scott_WUaS> WUaS which donated WUaS to Wikidata last autumn is curious what the process is for communicating about further developing WUaS in Wikidata / MediaWiki and re ArchCom?
1821:10:29 <robla> Scott_WUaS: I think your donation is something we can discuss in a different venue, and I'm happy to do so in the hour after this meeting
1921:10:51 <Scott_WUaS> WUaS is currently talking with former CC MIT OCW Executive Director MIT Dean of Online Learning, Cecilia d'Oliveira and have received Creative Commons' permissions from her to develop and adapt MIT OCW in 7 languages and in Wikidata
2021:11:02 <robla> ostriches: we touched on the Gerrit->Phab migration conversation in our last meeting
2121:11:14 <Scott_WUaS> robla: thanks
2221:12:10 <ostriches> robla: Yeah I saw. Was there any followup needed on that? I think the only question really is the status. It's not really in draft, it's under implementation now if we consider it accepted
2321:12:47 <matt_flaschen> robla, I had one question about T123753
2421:12:47 <stashbot> T123753: Establish retrospective reports for #security and #performance incidents - https://phabricator.wikimedia.org/T123753
2521:12:48 <robla> ostriches: is there help you need from ArchCom? I think the "done"ness is something we can discuss a little bit
2621:13:26 <ostriches> I don't think we really need much in the way of help from ArchCom at this point. Considering the outcome of the various discussions we've had so far I think there's consensus for it.
2721:13:36 <ostriches> (for it to be accepted and move forward, that is)
2821:14:37 <legoktm> ostriches: Are we going to see a Gerrit upgrade happen before we dump it? :)
2921:14:46 <robla> ostriches: I wouldn't go so far as to consider it "accepted", but that did get us into a general conversation about what does "accepted" mean by ArchCom. I don't think anyone in ArchCom wants to block it
3021:14:55 <ostriches> legoktm: Yes, I've been working on that this week. "Soon"
3121:15:03 <legoktm> <3
3221:15:37 * TimStarling is partially online this hour, as well as looking after kids
3321:15:47 <robla> o/ TimStarling :-)
3421:15:48 <ostriches> hi TimStarling :)
3521:17:09 <ostriches> robla: In which case I think we're good then? I don't think we need to bikeshed over the template status too much :)
3621:17:28 <ostriches> As long as ArchCom doesn't need to block and we've got general consensus based on passed discussions, I think RelEng can move ahead
3721:17:37 <ostriches> *past
3821:18:31 <robla> ostriches: I really appreciate that y'all wrote up an RFC on this, as I think having that written up is going to help the migration go more smoothly. there's some nitpicking we can do about the RFC about the "how" and the "when", but I don't personally see any problems with the "what"
3921:19:00 <ostriches> I think some more of the how/when will become clear in the coming quarter.
4021:19:05 <robla> Krinkle: do you mind if I further paraphrase what you said in the past hour?
4121:19:16 <ostriches> It's going to be like the Gerrit RFC insofar as this one isn't going to be "done" for a long time.
4221:19:49 <Krinkle> robla: OK
4321:20:36 <robla> ostriches: I think the how/when questions need to be clear in order for it to be marked "approved" (under the current ArchCom process)
4421:21:41 <robla> marking ArchCom-RFCs as "approved" is a subject that sends me down the process wonk rabbithole
4521:22:02 <ostriches> robla: We do have https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Project/Differential_Migration as a result of our annual planning.
4621:22:08 <ostriches> Which should be incorporated into the RFC.
4721:22:15 * robla looks
4821:23:02 <greg-g> (it's linked to from the RFC, as I was tired of copy/pasting tables all last week ;) )
4921:23:29 <robla> greg-g: I understand, truly :-)
5021:23:40 <ostriches> So many tables that was.
5121:24:04 <greg-g> the outcome of planning is great, the process can sometimes be... subpar :)
5221:25:59 <robla> ostriches: lemme see if I can paraphrase... Phase 1: T130418 done hopefully June 30 (and then do the same quarter math for Phase 2 and Phase 3)
5321:25:59 <stashbot> T130418: Goal: Phase 1 repository migrations - https://phabricator.wikimedia.org/T130418
5421:26:51 <ostriches> I think thats how the quarter math works out
5521:27:20 <robla> Phase 2: T130420 done hopefully by December 31 of this year
5621:27:21 <stashbot> T130420: Goal: Phase 2 repository migrations - https://phabricator.wikimedia.org/T130420
5721:28:12 <robla> phase 3: T130421 done hopefully by 2017-03-31
5821:28:13 <stashbot> T130421: Goal: Phase 3 repository migrations - https://phabricator.wikimedia.org/T130421
5921:28:35 <robla> does that sum up the plan about right?
6021:28:39 <greg-g> yup, and the KPIs might be helpful to understand what we consider {{done}} along the way
6121:28:44 <greg-g> (they're at the bottom of the doc)
6221:29:41 <greg-g> ie: what "phase X" means :)
6321:30:35 <robla> "Q1: By the end of Q1 we plan to have a system in place to manage Differential and Nodepool/Continuous Integration interaction, from the baseline of no system in place." (Q1 ends 2016-09-30, so the middle of Phase2, right?)
6421:30:54 <greg-g> FY
6521:31:18 <greg-g> the system will be in place before phase 2
6621:31:33 <greg-g> phase 2 is in Q2, semantically luckily enough
6721:31:59 <robla> so....is there a numberless phase to this project? ;-)
6821:32:15 <greg-g> I'm confused by the "in the middle of phase2" part
6921:32:43 <greg-g> phase2 happens in Q2, building the glue happens in Q1...
7021:32:59 * greg-g goes to get his hoodie he left outside, his office is suprisingly cold
7121:33:23 <robla> greg-g: my apologies, I was extrapolating phases based on when the endpoints were
7221:33:48 * greg-g nods
7321:34:30 <greg-g> I was worried I miss-aligned something along the way and was not looking forward to copy/pasting a lot more
7421:34:33 <greg-g> :)
7521:34:42 <robla> since Phase 1 hopefully ends 2016-06-30, and Phase 2 hopefully ends by 2016-12-31, I put 2016-09-30 in the "middle of Phase 2"
7621:35:03 <greg-g> ah, I see what you mean, yeah
7721:35:49 <greg-g> as we imagined it (correct me if I'm wrong, ostriches ) is that there'd be a period of "build integration and respond to our phase1 users" before starting the phase 2, er phase
7821:36:36 <ostriches> phase1 completion requires integration work to be done, yeah.
7921:36:55 <robla> so, phase 1.1 ;-)
8021:37:24 <greg-g> 1.uhoh? ;)
8121:37:58 <greg-g> (the release after 1.0, to fix the inevitable bug you missed, for those that don't get the joke/context)
8221:39:45 <robla> Phase 1: hopefully ends 2016-06-30, Phase 1.1: hopefully ends 2016-09-30, Phase 2: hopefully ends by 2016-12-31, Phase 3: hopefully ends by 2017-03-31
8321:40:15 * greg-g nods
8421:41:00 * robla looks for the RFC number for this RFC
8521:41:16 <robla> T119908
8621:41:16 <stashbot> T119908: [RfC]: Migrate code review / management to Phabricator from Gerrit - https://phabricator.wikimedia.org/T119908
8721:41:44 <robla> #info T119908: Phase 1: hopefully ends 2016-06-30, Phase 1.1: hopefully ends 2016-09-30, Phase 2: hopefully ends by 2016-12-31, Phase 3: hopefully ends by 2017-03-31
8821:41:44 <stashbot> T119908: [RfC]: Migrate code review / management to Phabricator from Gerrit - https://phabricator.wikimedia.org/T119908
8921:42:26 <robla> alright should we talk about the other RFCs, or is that one the most interesting to get cleared up?
9021:43:04 <greg-g> reminder of other topics: https://phabricator.wikimedia.org/E152
9121:43:24 <greg-g> I think matt_flaschen had a question about T123753
9221:43:25 <stashbot> T123753: Establish retrospective reports for #security and #performance incidents - https://phabricator.wikimedia.org/T123753
9321:43:37 <greg-g> robla: anything else you want/curious about from ostriches and I?
9421:43:44 <robla> greg-g: ah, right, thanks for the reminder
9521:44:08 * robla doesn't have any followup right now for the Gerrit->Phab stuff
9621:44:22 * greg-g nods
9721:46:02 <robla> by the way, gwicke, Krinkle , and I discussed putting the mbstring requirement RFC into last call....I'll bring that up after I answer matt_flaschen 's question
9821:46:08 <robla> matt_flaschen: your question?
9921:47:19 <matt_flaschen> robla, what I mentioned on the task: How would these new retros relate to the Incident reports we have currently? https://wikitech.wikimedia.org/wiki/Incident_documentation
10021:48:41 <robla> matt_flaschen: I'm hoping we figure out some social norms around this
10121:48:44 <robla> so...
10221:49:51 <robla> what I would envision happening is that the WMF Security Team being able to flag things as "this should have a retrospective"
10321:50:11 <robla> key word being "should". I don't envision there would be 100% compliance
10421:50:13 <Scott_WUaS> Hi Megan!
10521:50:37 <robla> (WMF Performance Team would be able to do the same)
10621:52:06 <robla> the point would be that it would not be socially ok to create many security issues and never write a retrospective. at the same time, if the WMF Security Team got really fussy, I wouldn't envision there being 100% of the retrospectives written they suggest.
10721:52:45 <robla> (same holds true for Performance)
10821:53:11 <robla> matt_flaschen: does that make sense?
10921:53:21 <matt_flaschen> robla, do you think we should do them at https://wikitech.wikimedia.org/wiki/Incident_documentation ? Potential advantage: As I mentioned on the task, line between "really bad performance" and "outage" is not always clear cut.
11021:53:44 <Krinkle> matt_flaschen: I don't think robla is asking for a duplicative reporting. Take the save-timing regression as example. When this happened it was mostly on the performance team to do the full investigation and (in later stages) (maybe) delegate some actionables to the relevant maintainers of the code in regression.
11121:53:58 <Krinkle> That's not a healthy or maintainable way of working.
11221:54:54 <matt_flaschen> Krinkle, so you are you saying "Incident documentation" should be for documenting the immediate response, and there should be a separate retrospective of the full solution?
11321:55:56 <matt_flaschen> If a performance problem could also be considered an outage.
11421:56:01 <matt_flaschen> Which depends on the severity.
11521:56:20 <Krinkle> I imagine if the regression is result of regular deployment, it is subsequently reverted and the relevant author/merger/maintainer should do the investigation (probably on Phabricator). The deployer (if they notice the regression) could write an immediate response on wikitech, but I'm not sure it's all that useful. It depends on how big/obvious the
11621:56:21 <Krinkle> regression is. In most cases (at least until we have better automated measurements) it will be noticed hours/days later, in which case I think using wikitech/incident is overkill.
11721:56:49 <Krinkle> matt_flaschen: I agree, but I'd say the severity threshold is at "If the deployer observed it" (in logs/alerts etc.)
11821:57:21 <Krinkle> Which will slowly become a lower threshold as our infrastructure improves
11921:57:21 <robla> I think the credibility of the Security and Performance teams is tied up in how frequently they suggest postmortems are needed. It's very subjective, and that seems ok to me.
12021:57:33 <Krinkle> +1
12121:57:53 <gwicke> the issue with many of the big systemic issues is that it would take a lot of time to write a proper description & evaluate possible solutions
12221:58:34 <matt_flaschen> Thanks, Krinkle, that answers my question. Basically, do an incident report for severe perf issues (if you notice immediately when deploying), and do a retrospective on Phabricator if the Performance team asks for it (I would add "or if your team thinks it's a good idea").
12321:59:06 <robla> by the way, we're coming up on the end of our hour, so I feel bad about ending the official part right on the top of the hour. I may run over a couple minutes, but probably not more
12422:00:04 <Krinkle> matt_flaschen: Yeah, I don't think it's worthwhile pursuing a really strict rule that one can autonomously follow. It's mostly a quest to adopt and accept this as a normal social behaviour going forward. And to not interpret it as an assignment of blame.
12522:00:22 <robla> #info general discussion, most of the hour on T119908 , and then the end of the hour on T123753
12622:00:23 <stashbot> T123753: Establish retrospective reports for #security and #performance incidents - https://phabricator.wikimedia.org/T123753
12722:00:23 <stashbot> T119908: [RfC]: Migrate code review / management to Phabricator from Gerrit - https://phabricator.wikimedia.org/T119908
12822:02:02 <robla> #info T129435 ( RFC: drop support for running without mbstring) is going to be heading into last call
12922:02:02 <stashbot> T129435: RFC: drop support for running without mbstring - https://phabricator.wikimedia.org/T129435
13022:02:38 <robla> thanks everyone!
13122:02:43 <robla> #endmeeting

RobLa-WMF set Security to None.
In T123753#2161839, @Mattflaschen wrote:

What is the relationship between this and https://wikitech.wikimedia.org/wiki/Incident_documentation ? The line between "really bad performance" and "outage" is not always clear.

For code security issues, the two are pretty distinct. I would like to see retrospectives of any UBN or High Priority issues (https://www.mediawiki.org/wiki/Wikimedia_Security_Team/Prioritization_of_bugs) that affect a project/code that a WMF team is responsible for. If there vulnerability was found without it being exploited, there will not be an incident documented on wikitech.

For operational security issues, the documentation on Incident_documentation (or similar page on office wiki) is probably enough.

RobLa-WMF updated the task description. (Show Details)Jun 1 2016, 6:04 PM
RobLa-WMF raised the priority of this task from Normal to High.Jun 1 2016, 7:36 PM
RobLa-WMF added a subscriber: dpatrick.

I'd like to institute this as a practice for security issues by the end of FY16-17q1. There's potentially some management overhead associated with bootstrapping this process, which I'll take on (assuming @dpatrick agrees and wants the help).

I'd like to institute this as a practice for security issues by the end of FY16-17q1. There's potentially some management overhead associated with bootstrapping this process, which I'll take on (assuming @dpatrick agrees and wants the help).

I agree with instituting this by the end of FY16-17 Q1. And I concur with @csteipp's comment from April 4th regarding UBN and High Priority issues. We should discuss if/how the output resulting from security-centric retrospective differs from what's posted at https://wikitech.wikimedia.org/wiki/Incident_documentation currently. I would like to say that I really like what's already there and I'm looking forward to this process. And ideally, we could keep security retrospectives in the same location.

Wikimedia ArchCom can sign off on a revocation for technical or social reasons

I smell recursion in two different ways.

One anti-pattern we should avoid: dumping this on WMF specialist teams.

This is the second way.

I'd like to institute this as a practice for security issues

I do not have the time currently to word a sufficient response to this. However I'd like your help with Incident_documentation and their long term mitigation for things with user impact.

We should discuss if/how the output resulting from security-centric retrospective differs from what's posted at https://wikitech.wikimedia.org/wiki/Incident_documentation currently.

Yes that is probably still not sufficiently documented. Also we might need a discussion about responsible .

And ideally, we could keep security retrospectives in the same location.

I agree.

reinforcing appropriate incentives

faidon added a subscriber: faidon.Jun 3 2016, 5:05 PM

I honestly don't see how the ArchComm can be the responsible body for defining our security, performance or other operational processes. Calling retrospective reports of code deployments, "architecture" is a stretch by any definition of those two terms, IMO.

I honestly don't see how the ArchComm can be the responsible body for defining our security, performance or other operational processes.

See https://www.mediawiki.org/wiki/+2#Revocation .

ori added a comment.EditedJun 3 2016, 10:54 PM

I honestly don't see how the ArchComm can be the responsible body for defining our security, performance or other operational processes.

See https://www.mediawiki.org/wiki/+2#Revocation .

Yikes, let's not go there. I think a better way to frame it is to ask whether our community stands to benefit from having ArchComm deliberate over these processes and issue recommendations, and I think the answer to that is "yes". The question of whether ArchComm can back these recommendations with coercive authority is both unpleasant and unimportant. We should simply avoid any situation that would require settling that.

faidon added a comment.Jun 4 2016, 1:38 AM

I definitely do not agree that this discussion is in ArchComm's scope and that it thus has any authority on the matter. This policy/process-making power does not stem from anywhere or have any legitimacy — the committee's name (architecture) is pretty unambiguous and clear evidence of that.

The ability to formulate our operational processes seems fairly orthogonal to the right to revoke +2 rights — which, by the way, has always been awkwardly placed with the ArchComm, and it's a pretty much unenforceable/a meaningless power these days, IMHO. Even that page itself mentions on its very beginning "[t]his page documents a MediaWiki development policy, crafted over time by developer consensus (or sometimes by proclamation from a lead developer)". (emphasis mine).

All that said, like @ori, I think opening up a discussion about setting up our processes for postmortems to security and performance issues would be useful and fruitful. In fact, it might make sense to discuss improvements to our existing TechOps/RelEng-driven postmortems as part of that same discussion — they are conceptually similar and the lines between these three sources of issues aren't always very clear, anyway.

I'd be personally be happy to participate in such an informal consensus-driven group between all the relevant stakeholders. The alternative of letting the ArchComm facilitate, deliberate and rubber-stamp such processes does not make any sense to me, both conceptually (not its mandate) and in practice (not the right people to decide).

I honestly don't see how the ArchComm can be the responsible body for defining our security, performance or other operational processes.

See https://www.mediawiki.org/wiki/+2#Revocation .

Yikes, let's not go there. [...] The question of whether ArchComm can back these recommendations with coercive authority is both unpleasant and unimportant. We should simply avoid any situation that would require settling that.

That's fair. The initial comment got under my skin, and I led with that in my haste to respond. My apologies for the haste.

Furthermore, I agree with you that getting into a spat here about ArchCom's authority (or non-authority) is unpleasant and hopefully isn't necessary. I would like to discuss this based on whether it's a good idea, and not veer into an authority debate.

I think a better way to frame it is to ask whether our community stands to benefit from having ArchComm deliberate over these processes and issue recommendations, and I think the answer to that is "yes".

Well put; thank you! I believe that ArchCom has a responsibility to lead with respect to software development norms. While both Security and Performance have expressed support for this, @dpatrick and @Bawolff particularly will need a lot of software developers to step up if they are going to keep up. This seems like a good tool to help all of us learn how to secure our software better, and I appreciate @dpatrick's willingness to work together on this in the coming quarter.

RobLa-WMF updated the task description. (Show Details)Jun 7 2016, 5:19 AM

After doing some thinking about the comments above and discussing this with folks, I've removed the following section:

TechCom's role would be as an enforcement and appeal board should committers in certain areas of the code demonstrate a longstanding lack of discipline, per the mw:+2 policy (as of 2016-03-31):

Anyone can propose a revocation discussion, the Wikimedia ArchCom can sign off on a revocation for technical or social reasons, and anyone authorized by Wikimedia Foundation's Board of Trustees (e.g. WMF's Director of Technical Operations) can sign off on a revocation for emergency security matters or obvious policy breaches.

"revocation for technical or social reasons" is basically another way of saying "this committer isn't trusted anymore". Why that might be:

[Commit privileges are] a big deal. Your merge could cause Wikipedia or other sites to fail. It could create a security vulnerability that allows attackers to delete or corrupt data, or to gain access to private information. And in the more common case, it could cause technical debt to increase if the code doesn't have tests, is poorly implemented or poorly socialized. You're therefore required to read this entire document and carefully review all the relevant links in it before using +2.

Still in my task backlog: having a corresponding wiki page to flesh out this proposal.

Peter added a subscriber: Peter.Jun 7 2016, 5:37 AM

From a MediaWiki perspective, operations/puppet seems to be the opposite of the direction we should be heading in. It won't improve code review, it will make some privileged people (yes, including me) able to bypass it, and it will work against us getting code from people who are not already privileged accepted.

(That said the general retrospectives idea in this task seems like a good idea.)

From a MediaWiki perspective, operations/puppet seems to be the opposite of the direction we should be heading in. It won't improve code review, it will make some privileged people (yes, including me) able to bypass it, and it will work against us getting code from people who are not already privileged accepted.

I've replied over at T114419#2373286 (to keep this task from moving offtopic)

@greg, thanks for pinging this task in T141287! The process you outline in T140207#2493211 looks really solid. Though nitpicking is possible, it seems a solid base to iteratively improve.

If we implement the proposal described at T140207#2493211, should we have a "Needs retrospective" column on the #Wikimedia-Incident workboard?

greg added a comment.EditedJul 25 2016, 7:14 PM

@greg, thanks for pinging this task in T141287! The process you outline in T140207#2493211 looks really solid. Though nitpicking is possible, it seems a solid base to iteratively improve.

Thanks for the feedback! I'm still in the post-epiphany high phase :)

If we implement the proposal described at T140207#2493211, should we have a "Needs retrospective" column on the #Wikimedia-Incident workboard?

Good question. I'm I'm understanding the intent of "needs retrospective" correctly ("to have a meeting with the relevant people to review what worked, what didn't, what's still confusing, what's next") I'm thinking there should be a "Needs follow-up" column after "Active Emergency". This indicates two things:

  1. Someone (me-ish) with the right permissions needs to create the milestone for the incident to collect follow-up tasks, and
  2. Someone (the first responders to the incident or someone they delegate to) needs to start working on the incident report

"Needs follow-up" is not a column a task should be in for very long (just until the milestone is created, mostly). It (the column) would be just to indicate that the immediate issue is fixed.

If it needs a retrospective in addition to the incident report then that should probably be a task (eg: "Perform retro for 20160725-Whatever incident") in the milestone for that incident. That task would be informed/follow the process that comes out of this task, is my guess.

If we implement the proposal described at T140207#2493211, should we have a "Needs retrospective" column on the #Wikimedia-Incident workboard?

Good question. I'm I'm understanding the intent of "needs retrospective" correctly ("to have a meeting with the relevant people to review what worked, what didn't, what's still confusing, what's next")

Thanks for checking. I realize now that the vocabulary is causing problems in this discussion. Let me try to offer definitions for each of these as I understand them:

  • "Incident report" - this is the longstanding process that Ops has after downtime or other problematic event on the Wikimedia cluster
  • "Retrospective" - a meeting often used by teams involved in some form of agile software development, typically held at the end of a project, a sprint, or an iteration of some variety.
  • "Postmortem" - frequently used by way of analogy, as of this writing, this word redirects to "autopsy" on enwiki (as does "Post-mortem"). However, enwiki:Post-mortem_(disambiguation)) offers somewhat more helpful content, but I've usually heard "postmortem" used to describe "let's analyze what happened after something went very wrong" (like an "incident report") rather than generically "an after project assessment".
  • "Hotwash" - a nicer analogy than postmortem, and a term that is also in pretty widespread use. Much of my approach to running retrospective meetings comes from someone who prefers to call them "hot washes".

I've been avoiding the postmortem analogy because ... ewww. ;-) I've also been guilty of using "retrospective" and "incident report" interchangeably.

Even more confusing, there's a matrix here:

termmeetingreport
Incident(no name, usually part of some other meeting)Incident report
RetrospectiveRetrospective meetingRetrospective report
PostmortemPostmortem meetingPostmortem documentation
HotwashHotwashAfter action review

So....back to your question, what I meant by "needs retrospective" was "needs a report". Whether a meeting is helpful for arriving at the report is up to the person writing the report.

Aklapper removed RobLa-WMF as the assignee of this task.Nov 7 2016, 11:11 PM
daniel added a project: TechCom-RFC.
daniel moved this task from Under discussion to (unused) on the TechCom-RFC board.Nov 16 2016, 6:44 PM
daniel closed this task as Invalid.Dec 7 2016, 9:35 PM
daniel added a subscriber: daniel.

Out of present scope of ArchCom

MZMcBride reopened this task as Open.Dec 8 2016, 1:59 AM
MZMcBride removed projects: TechCom-RFC, Architecture.

Out of present scope of ArchCom

Perhaps. Earlier comments seem to suggest so as well. However, I don't think the architecture committee's (lack of) scope makes this task invalid. There's some decent discussion here that I don't want to get buried, so I'm going to re-open this task and remove some of the tags/projects.

Krinkle closed this task as Resolved.Jan 16 2018, 4:05 PM
Krinkle claimed this task.

Closing for now. Performance Team keeps record of performance regressions in phabricator instead and coordinates with RelEng and train blocker as needed.

Krinkle removed a project: RfC.Jan 16 2018, 4:05 PM