Page MenuHomePhabricator

Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages
Open, NormalPublic

Description

Based on T219450, I believe a faster response time could have been done if maybe the workflow improved somehow.

These are things that maybe happened (and could be done better?):

  • The ticket filed T219450 was unclear:
    • It was not UBN (should it?)
    • The title was not 100% clear of what was going on
    • It had no tags (which ones?)
  • Product owners did not monitor recent phabricator reports after a deploy (should they?)
  • The issue generates a small amount of errors (galery editing) -compared to many other so it went unnoticed after deployment
  • ...
  • What else?

The goal of this is to detect gaps improvements on the workflow, with the deliverable being a set of recommendations or amends to workflow of reporting/responding to phabricator (and or other channels that may be used to report errors, such as IRC or email), advice for users reporting issues and for those reacting to them.

In particular, we were told some people thought of making the ticket "UBN", but didn't because they were told only the developers should, so maybe some exceptions could be made on https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities

This is not an easy task, as it has to balance making enough noise to be noticed with not overstepping with bug fixing priorities.

Event Timeline

jcrespo created this task.Mar 29 2019, 7:25 AM
jcrespo edited projects, added Wikimedia-Incident; removed Wikimedia-production-error.
jcrespo moved this task from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.

Adding release engineering, although they should not own this, but so they have a voice on improving best practices for the software side (eg. "product ownwners should look at phab after a deploy").

Adding operations as technically they are now known as Site reliability engineering and it could also be on his realm for providing input.

However, I am expecting anyone on the Wikimedia development side to provide suggestions (maybe this should be an RFC?).

Cirdan added a subscriber: Cirdan.Mar 29 2019, 8:04 AM
CDanis added a subscriber: CDanis.Mar 29 2019, 1:47 PM
Joe added a subscriber: Joe.Mar 29 2019, 2:20 PM
Yann added a subscriber: Yann.Mar 31 2019, 7:29 PM
Yann added a comment.Mar 31 2019, 7:36 PM

Hi, There should be a clear way for users and bug reporters to indicate how high or low priority is an issue.

Hi, There should be a clear way for users and bug reporters to indicate how high or low priority is an issue.

You're implying that users could judge "priority" (or did you mean "urgency"?). But often they cannot.

Peachey88 added a comment.EditedApr 1 2019, 6:54 AM

I would prefer to see someone over prioritize a task so it shows up easier (UBNs show up differently on IRC) and someone else with more experience might see it and deprioitize as needed. Where as if it's not, it could easily be missed.

Just because someone could do something wrong doesn't me we should also stop them from trying to do something.

Yann added a comment.Apr 1 2019, 10:03 PM

Hi, There should be a clear way for users and bug reporters to indicate how high or low priority is an issue.

You're implying that users could judge "priority" (or did you mean "urgency"?). But often they cannot.

Not "judge", but indicate how important is the issue for them. That they can do.

@Yann: No, because the "Priority" field is not for users to express how important things are to them. Also see https://www.mediawiki.org/wiki/Bug_management/Development_prioritization

@Yann and @Aklapper please stop discussing that (or at least discussing that issue on this ticket), as that is offtopic here (I am not agreeing or disagreeing with any of you here).

This is not about prioritization of bugs. This is about what made an outage not beeing noticed for a long time, and what advice we can do to outage responders and outage reporters to avoid that specific issue. It is offtopic because the task, as I queried several colleagues should have been and was triaged UBN; so the issue is not bug prioritization. Unless, aklapper, you are suggesting that the fix for that is for you to work 24 hours a day looking at untriaged tickets on phabricator :-D

I would like to hear other stakeholders' voices, specially Jdforrester-WMF and Greg or a collegue of Greg point of view.

jcrespo updated the task description. (Show Details)Apr 2 2019, 8:02 AM
greg added a comment.Apr 2 2019, 7:03 PM

This is a great example of a almost-worst case scenario, sadly.

Things that I do in my role as Release Manager: I have all UBN! priority tasks on my phabricator homepage/dashboard. I try to review those as needed throughout the week. You'll sometimes see me commenting on them asking "what's the status here? it's been UBN! for $alongtime" ;). I (though mostly that week's train conductor) also obviously watch any tasks which are set as subtasks of the weekly train task (see: Train Deployments ).

This didn't show up in either place, sadly.

The part that did work was @Lucas_Werkmeister_WMDE (not adding as subscriber) finding it and setting as UBN!. After that point most things happened in a reasonable time frame. Thanks Lucas. Shortening that initial timeframe, however, should be a goal. How to do that without omniscience (in this case, where things were unclear and not tagged/prioritized in a way to show up on many people's radars) isn't clear to me.

I'm open to ideas, though.

@greg I don't have a clear suggestion yet, but some possibilities include adding as stronger guideline to https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Don't_leave_town such as "if you deploy to production (as product owner), monitor phabricator 2 hours later to see if there are errors reports related to your deploy". We know you do it and people do it, and some probably doesn't, but the point is document good practices (IF that is what we think that can be improved). 0:-)

Other suggestions (not incompatible) is to improve the reporting documentation by suggesting the usage of a tag (Wikimedia-production-error, Operations Wikimedia-Incident ?) or a specific phabricator form for outages. Or "make a ticket title clearer such as 'Editing commonswiki namespace produces an error', instead of just the error name". Or maybe we can teach and allow certain power users we trust to use UBN or other contact info in a way that we (releng and ops) would be happy with, or suggest the use of IRC instead for time-sensitive issues.

There can be lot of issues with this proposals, that is why I want to hear from more people. Having only 1 or 2 people 24 hours monitoring Phabricator is of course, not a realistic solution. :-D

For example https://www.mediawiki.org/wiki/How_to_report_a_bug is a very well crafted page, but I think is more related to "regular software bugs". I think we could create a new section for more time-sensitive kind of bug "outage", more related to wmf infrastrcture (for ops, releng) than software problems. Again, just throwing some ideas.

Aklapper added a comment.EditedApr 4 2019, 2:33 AM

In particular, we were told some people thought of making the ticket "UBN", but didn't because they were told only the developers should, so maybe some exceptions could be made on https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities

https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities already says that "priority should normally be set by [...] experienced community members". Not sure what else to add, because the problem (as correctly described) is that especially new task reporters often like to set the Priority field to High or Normal for a not necessarily important enhancement request.

When it comes to getting notified about UBN tasks, Herald could be used (see e.g. https://phabricator.wikimedia.org/H315 ) to send emails in addition. This would require maintaining for which project tags such emails get sent. That would require a human to set the right project tags, so you'd make the 'not-UBN' problem a 'not the right project tag' problem. All these things depend on humans [not] doing something. Humans are error-prone.

I'd naively want to believe more in monitoring instead. If more things get monitored (Logstash etc) and the amount of an issue passes certain thresholds (which will require finding a good balance again) I'd expect maybe paging, maybe automatically creating tasks under Wikimedia-production-error with a specific priority. Also see T185155: Improve error reporting / integration between Kibana and Phabricator.

Yann added a comment.Apr 4 2019, 12:16 PM

Hi, Sorry, but where to discuss this if not here?

I agree that many users lack the technical knowledge to define the priority, but some do.
Actually I would support another field specifically for users, but I don't know Phabricator enough.

Yann, this task is about detecting large regressions and outages. Please discuss general prioritizing of tasks (which is not UBN specific) or adding "another field" in a better suited place. Also note that the "but some do" users can already set Priority. Thanks!

fgiunchedi triaged this task as Normal priority.Apr 9 2019, 8:38 AM