Page MenuHomePhabricator

Consider alternative processes for Unbreak Now bugs, especially those which cross-cut components
Closed, ResolvedPublic

Assigned To
Authored By
Mattflaschen-WMF
Jul 13 2016, 3:15 AM
Referenced Files
None
Tokens
"Mountain of Wealth" token, awarded by RandomDSdevel."Like" token, awarded by Krinkle."Love" token, awarded by Liuxinyu970226."Like" token, awarded by mmodell."The World Burns" token, awarded by zeljkofilipin."Like" token, awarded by Luke081515.

Description

Do we need a better process to prevent long-running unbreak now bugs, especially those that recur, where the cause is not obvious, and where it cross-cuts multiple components?

Live list of Unbreak Now! (UBN) tasks

Related Objects

Event Timeline

I think a key question to answer here is how we want to treat the "unbreak now!" status in Phabricator Maniphest. Depending on the answer to that question, we may need better reporting and monitoring of tasks in this specific state. Maybe it's okay that tasks sit in the "unbreak now!" status for weeks or months in this Phabricator installation, but currently I don't think there's a clear guideline or consensus about whether this should be the case.

In my opinion, Unbreak now! should be reserved for serious emergencies. To make this explicit, I think we should have notifications set up to ping relevant IRC channels periodically when these bugs remain unfixed. Notifications could get very distracting if the frequency is too high but something like every hour or two would help to keep unbreak now tasks from being forgotten.

At a previous job we had Phabricator customized so that the page header turned red whenever there was an UBN task open. This helped to encourage a sort of all hands on deck attitude about these issues and kept everyone in the loop about current serious issues. It wouldn't be difficult to implement something like this in Phabricator, the main challenge is resolving existing UBN! tasks and keeping that status reserved for really serious problems.

https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities currently states

Unbreak Now! - Something is broken and needs to be fixed immediately, setting anything else aside.

Unclear maintainership responsibilities probably come into play here.

I agree with Mukunda, thinking that UBN priority is used too much by some teams (and a number of UBN tasks has stayed open for weeks due to 'sign-off' team workflows, making it harder for me to identify 'real' UBN tasks to nag).


(Personal thoughts, slightly offtopic: Assuming that https://phabricator.wikimedia.org/T119736 was part of the motivation to create this task, I'm wondering if I (as a bugwrangler) failed to escalate at some point. There has always been some activity on the task hence it never really felt 'stuck', it just took too long. So while it was on my watchlist I failed to identify a moment when to escalate. Hmm.)

ksmith subscribed.

We in TPG don't feel like this falls directly within our current scope. We're willing to help try to find a clear owner, and then we are willing to work with that owner to understand whether there are tasks that TPG should take on. We understand that finding an owner may be a challenge, and may involve some difficult conversations.

As a starting point, I have added tags for a couple other teams who we think might be in a better position to take this on.

There are 16 current UBN! tasks in Phabricator: https://phabricator.wikimedia.org/maniphest/query/dluWYbAa5ben/#R

That query sorts them by last updated, latest first. It's actually much better than when we initially migrated to Phabricator (I believe initially "Highest" in BZ was mapped to "UBN!" in Phab, which kinda made sense, but then we had something in the realm of hundreds of UBN! tasks, which made it less than ideal).

I think @mmodell 's suggestion is the right path forward here. The only caveat is that Phabricator is a sort of "mixed environment" where an UBN! in a deployed extension means something very different than an UBN! in, say, pywikibot. (IOW: unless the cause is on Wikimedia's side, an "all hands on deck" response from WMF RelEng or Ops or other development teams isn't practical).

16 UBN! tasks is still a lot if we immediately switch to @mmodell's proposal, but I think we would manage (by reducing priority of some, and just dang fixing the rest).

https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities currently states

Unbreak Now! - Something is broken and needs to be fixed immediately, setting anything else aside.

Unclear maintainership responsibilities probably come into play here.

Yeah, this will be an issue when I or the train deployer needs to tap someone on the shoulder and say "fix this now" when the code is old and not 'owned' by any current team. That usually falls on "the usual suspects" (IOW: ex-MW Core Team members....).

Adding ArchCom, in case they want to participate in the conversation, since this topics has such wide effects.

If we use one of this proposals, we have to notify all users to don't set that priority.

I can imagine, that we have auto-alarms, T139234 might be a case of example, if we take a look outside of phabricator.

Example: Someone founds a security bug, and posted it public, for example at forum. This is a big risk. So someone can add Security and set UBN => the Security-Team gets an alert.

or: Prod is down at a weekend: SRE and UBN => the SRE gets an alert.

This would be a very good way for users like me: As I noticed T139234, I tried to inform the right people to upgrade, but I don't have the right addresses, if you compare me with a staff for example. So users can notify the right people, if there is an emergency.

The only caveat is that Phabricator is a sort of "mixed environment" where an UBN! in a deployed extension means something very different than an UBN! in, say, pywikibot. (IOW: unless the cause is on Wikimedia's side, an "all hands on deck" response from WMF RelEng or Ops or other development teams isn't practical).

Yeah, UBN tasks can be "the most important" without being even in the top 100 issues for WMF.

For example, T110451: Update ConfirmAccount to use AuthManager blocks use of ConfirmAccount on MW 1.27+, so but it's not deployed to WMF production so we are unlikely to drop everything to fix it (and indeed adding more cooks to the fire may not be very helpful).

Similarly, T138781: Show an email dialog to users when we encounter an error starting the Proxy server is a feature blocker to the next release of the iOS Wiipedia app, but it's (a) not possible for anyone without the release keys to fix, and (b) probably isn't looking for more input.

Even in deployed extensions (assuming we could narrow the alert system down that much without missing stuff), UBN is murky. T138725: Special:NewItem and Special:NewProperty allow creation of items with term language as any string was UBN and was deployed (then un-deployed when wmf.9 was rolled back and is now re-deployed) but it's still open to deal with clean-up. T138673: VisualEditor should retrieve content of all Page: pages wikitext editor field when switching from WikiEditor is an ugly content-corruption bug in the experimental Wikisource VE code and won't get better faster with more eyes. There are a couple of Fundraising ones that no-doubt are messy but have lower impact. Etc.

Adding a "Are you really sure? UBN means people around the world will get paged to fix this bug!" modal might reduce alternative use patterns, but it also forces everyone using Phabricator to use it in the "one true way", which… isn't great.

Maybe we could create a new priority value. Something like one of the following:

  1. Production Emergency
  2. The sky is falling
  3. Drop Everything!
  4. Higher than Highest
  5. ZOMGWTFBBQ!

This is the intended meaning of unbreak now! So maybe we just need to create a new, lower priority for serious bugs that are not having a far-reaching production impact and move existing long-lived UBN bugs to that priority?

I don't know. Maybe we don't need to use priority for this but it IS the intended purpose of UBN!

I don't know. Maybe we don't need to use priority for this but it IS the intended purpose of UBN!

A tag for "ProductionEmergency" that auto-CCs RelEng and Greg would be a much more Phabricator-like than a magical meaning of the highest priority, I think.

A tag for "ProductionEmergency" that auto-CCs RelEng and Greg would be a much more Phabricator-like than a magical meaning of the highest priority, I think.

That doesn't necessarily fix all of the worlds problems but it does sound like a good idea to me.

I'm still trying to figure out how to use that project effectively. And I/RelEng probably shouldn't be the only auto-ping people, someone in Ops (and Ops?) should also be subscribed. There's things RelEng can't do that are indeed productionemergencies ;)

And I/RelEng probably shouldn't be the only auto-ping people, someone in Ops (and Ops?) should also be subscribed. There's things RelEng can't do that are indeed productionemergencies ;)

Oh, indeed, and I'd stalk it too, but I was scratching the itch of "how do RelEng know there's something that might be wrong?". :-)

A view from outside sharing experiences from my past involvements in bugtracking:

Unbreak Now! is treated / understanded by many people as "This really annoys me, fix it ASAP", "I am putting all my ten fingers to vote on this to be solved soon".
Production Emergency (or such clearly descriptive name) wouldn't be used by those people, because they wouldn't see 4##, 5## errors.

It's all about semantics and being general/vague vs. specific.

I am inclined to have the extra level called Production Emergency or so, however in conjunction with some Herald rule (or other limitating way), which would allow this priority to be set only by reasonable set of people, as setting such priority would clearly be also highly spamming action which would buzz relevant people in the middle of the night...

Adding a "Are you really sure? UBN means people around the world will get paged to fix this bug!" modal might reduce alternative use patterns, but it also forces everyone using Phabricator to use it in the "one true way", which… isn't great.

IMO any usage pattern should be consistent with the name; it's called "Unbreak now" for a reason. Just like using "Low" for urgent tasks and "High" for less urgent ones is not a reasonable usage pattern, using "Unbreak now" to mean "this should be fixed in the next few months" is not one either. Anything that does not need to be fixed within a few days at most should not be UBN.

JAufrecht subscribed.

As per TPG practice, since a TPGer is following this directly, removing the Team-Practices tag.

Since this task was an outcome of the https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth incident and we just had that incident's retrospective on Tuesday, I thought I would update here with some relevant information.

Retro notes: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth/Retrospective

  • One thing that was suggested that we do was to identify all Wikimedia-deployed component "first responders". This would be for all services and MW extensions and some dissection of MW Core itself. That's a huge undertaking and has been tried before. I have given my initial thoughts in T141066 (please continue that conversation there) but there is more I need to do/write as I review, eg, the mw:Developers/Maintainers talk page.
  • Another suggestion was for me to do a weekly review of all UBN! tasks in hopes that I can identify 'true' UBN! issues and correctly action (sic) on them quickly. To that end I have scheduled a 30 minute block in my calendar at 10:30am Pacific (17:30 UTC, during this current daylight savings period).
    • I believe though, if this is to have a chance, this triage should be a public one with, for instance, a Phabricator event with associated IRC channel where at least I think out loud (in text) as I review the list of UBN!s.
    • As I/we review those tasks, however, we'll need a consistent criteria by which to judge them and action (sic) on them. That is where this task is important. I/we will need to be able to review those UBN! tasks and either:
      1. Identify them as, yes, UBRTFN! and assign a "first responder" (see: T141066)/follow-up appropriately (and quickly)
      2. Identify them as not UBRTFN! and either:
        1. Reduce their priority to 'High' or lower, or
        2. Keep them at UBN! since UBN! is not special in our Phabricator instance (NB: And, implicitly this means a higher cognitive load on Greg/us as some UBN! tasks will sit that way for a while yet be not actionable in the same way)

We need to resolve this task to choose between A or B :)

Also, if the above triage turns out to be useful it will probably become one of the standard duties of that week's MW Train deployer (a duty that is rotated, every two weeks, within the Release Engineering team).

Keep them at UBN! since UBN! is not special in our Phabricator instance

That sounds super confusing. If we decide not to use UBN for emergencies that must (and can) be resolved quickly, why not just create a Wikimedia-Emergency or similar tag and use that?

Also, weekly triage sounds like a good way for getting rid of UBNs that are not actually UBNs, but for responding to real emergencies, it seems rather low-frequency. (Or do you mean "weekly" as in "every day of the week"?)

Keep them at UBN! since UBN! is not special in our Phabricator instance

That sounds super confusing.

It is :/

If we decide not to use UBN for emergencies that must (and can) be resolved quickly, why not just create a Wikimedia-Emergency or similar tag and use that?

I'm hesitant because it doesn't feel right to me, honestly.

Thinking about it more (trying to figure out how to keep it clean/useful/high signal) it could have columns for "To Triage (default)", "Active Emergency", and "Immediate fix deployed, follow up needed" (NB: need a better name for that).

This will obviously overlap with Incident Reports, which, because everything is related, there is a proposal to improve the management of them in Phabricator at T140202.

EPIPHANY! (as I'm typing this out)

#Wikimedia-Incident will be the parent project for the specific incident milestones (see T140202 for details). `Wikimedia-Incident' could also be the solution to this problem. Something like this:

  1. See something horrible
  2. File task
  3. Add #Wikimedia-Incident to alert appropriate people
    1. It goes in the "To Triage (default)" column
  4. It is triaged into "Active Emergency"
  5. Fix it! (don't worry about column moving, just fix it, Greg/someone can triage it)
  6. Does it need an incident report with follow-up tasks? Create a milestone eg #Wikimedia-Incident-20160725-Whatever and move that task to it.
  7. See T140202 for how those milestones would be used

#Wikimedia-Incident would then become the project that is watched by myself and responsible people from Operations (though, don't count on us to see the bug mail, a ping on IRC/email is vastly preferred).

I... think that's a good solution, no?

Also, weekly triage sounds like a good way for getting rid of UBNs that are not actually UBNs, but for responding to real emergencies, it seems rather low-frequency. (Or do you mean "weekly" as in "every day of the week"?)

It's for the first part ("this has been UBN! for 2 weeks now, seriously?"), not really the second (responding to emergencies). Traditionally people haven't simply set to UBN! and then done nothing. They've usually pinged relevant people (myself, or someone in Ops, or the person who can fix it, etc).

greg triaged this task as High priority.Jul 25 2016, 6:32 PM

#Wikimedia-Incident will be the parent project for the specific incident milestones (see T140202 for details). `Wikimedia-Incident' could also be the solution to this problem. Something like this:

  1. See something horrible
  2. File task
  3. Add #Wikimedia-Incident to alert appropriate people
    1. It goes in the "To Triage (default)" column
  4. It is triaged into "Active Emergency"
  5. Fix it! (don't worry about column moving, just fix it, Greg/someone can triage it)
  6. Does it need an incident report with follow-up tasks? Create a milestone eg #Wikimedia-Incident-20160725-Whatever and move that task to it.
  7. See T140202 for how those milestones would be used

#Wikimedia-Incident would then become the project that is watched by myself and responsible people from Operations (though, don't count on us to see the bug mail, a ping on IRC/email is vastly preferred).

+1. Hysterical Raisins.

Is there consensus that UBN! has a common, global meaning, rather than a per-project meaning? E.g., Unbreak Now! for gerrit or Phabricator is really important but not as important as Unbreak Now! for production Wikipedia; and Unbreak Now! for labs projects is lower than either. Either way raises some implications that I haven't seen thought through and documented (for example, if Unbreak Now! is only for a handful of the most critical systems, then it probably shouldn't be allowed for non-critical systems (and how would that be enforced, etc etc). What is the appropriate body to make these decisions?

Is there consensus that UBN! has a common, global meaning, rather than a per-project meaning? E.g., Unbreak Now! for gerrit or Phabricator is really important but not as important as Unbreak Now! for production Wikipedia; and Unbreak Now! for labs projects is lower than either. Either way raises some implications that I haven't seen thought through and documented (for example, if Unbreak Now! is only for a handful of the most critical systems, then it probably shouldn't be allowed for non-critical systems (and how would that be enforced, etc etc). What is the appropriate body to make these decisions?

I think my current proposal (T140207#2493211) side-steps this thorny issue with advanced jiu jitsu :)

(In other words: People/teams/projects are free to use UBN! as they see fit in their projects, it just makes my query that I still plan to review weekly (see: T141130) a bit longer, no huge deal.)

greg claimed this task.

Wikimedia-Incident created. Use that for all production emergencies.