Page MenuHomePhabricator

Identify "first responders" for "all" "components" deployed on Wikimedia servers
Closed, ResolvedPublic

Description

This came out of the Echo+CentralAuth failure to login incident's retrospective.

Note: The quotes in the task title should be taken as scare quotes in the sense that they are either aspirational ("all") or will be defined as we go ("first responders", "components").

Previous attempts (which may or may not be solution to this problem):

  • mw:Developers/Maintainers
    • First (?) attempt started in 2012
    • collectively maintained by the developer community
    • Now (2017-03-23) canonical version, as far as @greg is concerned
  • Extension blame map (gdoc) - last edited July 26th, 2016 (as of 2016-08-23)
    • Initiated by @Jdforrester-WMF, @greg had previously reviewed.
    • High duplicity with mw:Developers/Maintainers but with a different perspective (WMF team vs person)
    • Now (2017-03-23) most likely out of sync with the more up to date mw:Dev/Maint
  • Team specific attempts:

See also:

Event Timeline

greg created this task.Jul 21 2016, 11:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2016, 11:13 PM
greg updated the task description. (Show Details)Jul 21 2016, 11:19 PM
greg added a subscriber: Jdforrester-WMF.
greg added a comment.Jul 22 2016, 12:16 AM

As this task is short without all of the context, I am going to paste here my answer to someone who already asked the most important (unanswered as of yet) question:

What does it mean for code to be owned in this context?

I actually don't want to use the word "owned" or any variations from now on in this work. It gives the wrong impressive and caused a lot of consternation in the retrospective. It will again for almost anyone who reads anything about this work without all of the context and knowing the things that were said but not written down in the retro. Oh imprecise English :).

For now I'm going with "responsible parties" but "first responders" is in fact closer to the intended purpose. Another proposal was "Fixers".

The intended purpose is basically "who should RelEng assign a task that is blocking deployment and/or causing problems in production that needs to be resolved quickly whether or not they are the people who do the actual fixing or instead coordinate with the right people who do."

At some big website companies the RelEng team (or comparable) have a list of first responders for all things (again, even here I'm unsure of "all") that they can ping for these issues. That position is normally a rotating one within the team to alleviate the burden. I see this work as somehow making our own version of that but within the limitations that we have (ie: A, some things could have volunteers in that position who we can't expect a very short SLA for response time and B, boy does no one want to own MW Core).

greg added a comment.Jul 22 2016, 12:21 AM

Scare quoting "components" as well just to call it out that defining what is and isn't included here is also a part of it.

My initial take is: MW extensions, services (eg: parsoid or rcstream), and some dissection of MW Core itself.

greg renamed this task from Identify "responsible parties" for "all" components deployed on Wikimedia servers to Identify "responsible parties" for "all" "components" deployed on Wikimedia servers.Jul 22 2016, 12:22 AM
greg updated the task description. (Show Details)
greg added a comment.Jul 22 2016, 12:28 AM

Comments for the record :)

The pro-activeness of Reading team to create their page is awesome and I greatly appreciate it and wanted to extend it to other teams when I first saw it.

Editing/VisualEditor did the same with that spreadsheet and is/was greatly appreciated as well.

greg updated the task description. (Show Details)Jul 22 2016, 12:34 AM
greg moved this task from Backlog to In Progress on the User-greg board.Jul 22 2016, 4:19 PM
greg added a comment.Jul 22 2016, 5:30 PM

And https://www.mediawiki.org/wiki/Talk:Developers/Maintainers#Kill_extensions.27_.22Maintainers.22_data points out that for MW Extensions some of this information is maintained on the relevant Extension: page. I note, however, in some cases "Author(s)" and "First Responder" will not be the same.

greg updated the task description. (Show Details)Jul 22 2016, 5:46 PM

Removing that subtask as it is already part of a chain of work that overlaps with this but I don't want to disturb things already in progress.

I'll need to sync with Quim after he is back from Vacation.

My personal next step is to talk with Technology Management on Tuesday (7/26) about this generally. Not necessarily this task, but the work it represents (I don't care which task is it, though I do wish we'd get away from the word "owners" in those other tasks (see above)).

greg renamed this task from Identify "responsible parties" for "all" "components" deployed on Wikimedia servers to Identify "first responders" for "all" "components" deployed on Wikimedia servers.Jul 25 2016, 6:31 PM
greg updated the task description. (Show Details)
GWicke updated the task description. (Show Details)Jul 26 2016, 4:09 PM

I'm still taking this proposal in, but a few preliminary notes:

"First responders" or "fixers" confuses me a little bit. A few of us -including all of ops- are de facto first responders on a 24x7 basis.

Is what we're looking for the second-tier/subsystem maintainer, the one that has deeper domain knowledge and to whom the first responders will route a more serious issue? Or perhaps a replacement for first responders? If it's the latter (and let me say upfront that I wouldn't object to that), we'd need to make some serious organizational adjustments to maintain 24x7x365 availability and to make sure there are no silos and someone owns the overall investigation when multiple components are involved.

@ori mentioned that in a recent serious site issue he observed what was basically the "bystander effect" — several people that knew there was an issue but not taking care of it because they thought someone else must be (but noone did). If that's the problem we're trying to solve, then I don't believe "fixers" is the right way to.

I think we need to at some point open up the "ownership" discussion. No component in production should be orphaned, and noone should feel obligated to respond or deploy a short-term fix when there is noone that is going to look in the larger, long-term picture. In my view, this short/long-term split of responsibilities would eventually just add-up to technical debt, rise frustrations and prolong the life of otherwise buggy products or subsystems (e.g. OCG) where noone is incentivized to own them and the "fixers" aren't empowered to either properly fix/rearchitect them or kill them. It also won't incentivize teams to build good products, knowing they might just get reassigned to something else and not be responsibile after they're, say, done working on them for 6 months. (see also T122825 about a "service ownership and maintenance" discussion)

In any case and whichever way we go, I don't believe that this discussion should at any point include specific individuals rather than teams. Individuals come and go, travel, go on vacation, get sick, change roles (and should thus be free to let go of their past commitments). That's why the Foundation exists in many ways and that's why we operate in structured teams.

Anomie added a subscriber: Anomie.Jul 26 2016, 6:08 PM
Halfak added a subscriber: Halfak.Jul 26 2016, 7:25 PM

In any case and whichever way we go, I don't believe that this discussion should at any point include specific individuals rather than teams. Individuals come and go, travel, go on vacation, get sick, change roles (and should thus be free to let go of their past commitments). That's why the Foundation exists in many ways and that's why we operate in structured teams.

This seems to echo a point that @Dzahn made last year on T115852:

I think we should aim at replacing individual names in that wiki tables with direct links to tags/projects in phab. This encourages people to use tickets rather than pinging individuals on IRC and is more effective. People interested in an area tend to subscribe to that specfic tag. There are usually multiple people who would reply to a ticket rather than just one owner. Information on the wiki page will also be regularly outdated (look at it right before JohnLewis just made an update:). Trying to find a single owner to everything can be counter-productive, adds unnecessary SPOFs and encourages working in private messages or emails rather than having group input. We should encourage people to just assing tickets to general teams/projects/tags. It usually works better.

I agree with this. @Legoktm has T128370 assigned to him (which calls for updating [[mw:Developers/Maintainers]]. Lego, should Greg postpone any work he plans on this task until you are done with your planned update to that page?

Jay8g added a subscriber: Jay8g.Jul 27 2016, 4:17 AM
greg added a comment.Aug 1 2016, 10:32 PM

I'm still taking this proposal in, but a few preliminary notes:

"First responders" or "fixers" confuses me a little bit. A few of us -including all of ops- are de facto first responders on a 24x7 basis.

Yeah, "First responders other than ops unless it's an ops owned thing" is just too long ;) Another name welcome.

Is what we're looking for the second-tier/subsystem maintainer, the one that has deeper domain knowledge and to whom the first responders will route a more serious issue? Or perhaps a replacement for first responders? If it's the latter (and let me say upfront that I wouldn't object to that), we'd need to make some serious organizational adjustments to maintain 24x7x365 availability and to make sure there are no silos and someone owns the overall investigation when multiple components are involved.

Based on the retrospective that this came out of: the former. The later would be very interesting and a great goal as well, but... we'd need to at least identify the former first :)

@ori mentioned that in a recent serious site issue he observed what was basically the "bystander effect" — several people that knew there was an issue but not taking care of it because they thought someone else must be (but noone did). If that's the problem we're trying to solve, then I don't believe "fixers" is the right way to.

That's another outcome of that retrospective, but it is not this task exactly.

In any case and whichever way we go, I don't believe that this discussion should at any point include specific individuals rather than teams. Individuals come and go, travel, go on vacation, get sick, change roles (and should thus be free to let go of their past commitments). That's why the Foundation exists in many ways and that's why we operate in structured teams.

Agree about 99%, modulo the things which are deployed that are community owned. I don't know what the absolute number (or even relative percentage) is for those, however. That is another thing that will be identified during this process, is my guess. (see also a few links from the description)

greg added a comment.Aug 2 2016, 3:44 PM

"First responders" or "fixers" confuses me a little bit. A few of us -including all of ops- are de facto first responders on a 24x7 basis.

Yeah, "First responders other than ops unless it's an ops owned thing" is just too long ;) Another name welcome.

A suggestion from IRC was "dev(eloper) first responders for all dev(eloper) components deployed"

greg updated the task description. (Show Details)Aug 23 2016, 5:51 PM
greg updated the task description. (Show Details)Mar 23 2017, 9:41 PM
greg closed this task as Resolved.Mar 23 2017, 9:43 PM

Since the last comment on this task there has been a lot of positive edits on mw:Dev/Maint:
https://www.mediawiki.org/w/index.php?title=Developers%2FMaintainers&type=revision&diff=2428822&oldid=2210661

Notably all of my "high priority" components have been claimed. That was mostly a mental list, sorry, but effectively the bits that the really-near-future MediaWiki Platform team claimed. There's still some empty spots in there, and we need to address those as much as possible, but at this time I'm calling this task closed since I (and others!) accomplished the essence of what I think was needed in response to the incident that spurred this task.

greg moved this task from In Progress to Done on the User-greg board.Mar 23 2017, 9:51 PM
mobrovac changed the status of subtask T122825: Service Ownership and Maintenance from Open to Stalled.Aug 8 2017, 10:53 PM
Krinkle changed the status of subtask T122825: Service Ownership and Maintenance from Stalled to Open.Jan 24 2018, 9:24 PM