Page MenuHomePhabricator

Improve visibility of incoming operations tasks
Open, MediumPublic

Description

I’d like to propose a change (what I think would be an improvement) to the handling of operations clinic-duty incoming tasks.

In the ops clinic duty runbook section called “review incoming tasks” (https://wikitech.wikimedia.org/wiki/Ops_Clinic_Duty#Review_incoming_tasks) there are a few mentions of actions to take against “incoming tasks”, and a link to a dashboard (https://phabricator.wikimedia.org/dashboard/view/45/) that is intended to visualize them.

Thing is, currently the ops clinic duty dashboard displays hundreds of tasks spanning many years. This makes it quite cumbersome to differentiate incoming tasks from existing tasks, and triaged from un-triaged. There is sort by “needs triage”, but tasks created with a priority will circumvent “needs triage” and are sent into the haystack. And even priority is a questionable way to sort in this case as there are, for instance, at least a dozen unassigned “highs” on the dashboard that are several years old.

I think we are in need of a way to quickly visualize the issues that are truly “incoming” so that the on-duty person has a queue to work each day/week until it is empty.

To accomplish that I propose we enable the operations project workboard (https://phabricator.wikimedia.org/project/board/1025/) and create columns representing task statuses as they relate to clinic-duty. Something like:

  • Untriaged (Backlog) (Default) - Tasks that are “incoming” and have not yet been reviewed by clinic duty.
  • Acknowledged - Triaged tasks that operations are responsible to fulfill and upon which the "review incoming tasks" steps have been completed.
  • Radar - Triaged tasks that operations are not not responsible for fulfilling, but wish to keep an eye on.

The “on-duty” responsibilities would remain virtually the same. The run book would simply be updated to include a link to the workboard to monitor for incoming tasks, including instructions for moving tasks into the appropriate column as they are triaged.

When clinic duty shifts change the backlog is handed over empty, and the process continues.

Thoughts, feedback, etc. are welcome of course.

Event Timeline

herron triaged this task as Medium priority.Jun 18 2018, 5:35 PM
herron created this task.

(Just for completeness: in the upper right corner of project workboards would also allow filtering for most recent tasks via Advanced Filter...Created After.)

I'm +1 for switching to a board for clinic duty, also added bonus of displaying the task status beside SRE when browsing tasks

Vvjjkkii renamed this task from Improve visibility of incoming operations tasks to rpaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from rpaaaaaaaa to Improve visibility of incoming operations tasks.Jul 2 2018, 11:54 AM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Dzahn claimed this task.
Dzahn added a subscriber: Dzahn.

I would like it if we could have a new task status besides Resolved/Open/etc. That status that i feel is missing would be "New", a state before "Open".

In our former ticket system all new tickets were status "new" and once somebody made the first edit to it like leaving a comment, assigning, setting prio, they would be switched (automatically or manual) to status "open". So "open" was the equivalent to the suggested "Acknowledged" column and "new" was the equivalent to "needs triage'.

I think it would help if we could have that status before "open/ack" again, one way or another.

Also we once had a distinction between tickets that we created ourselves in our team and requests from others outside our team (core-ops vs. ops-requests). That meant a on-duty triager could focus first on the external requests and might also be helpful nowadays.

Dzahn removed Dzahn as the assignee of this task.

(closed by accident, oops)

I would like it if we could have a new task status besides Resolved/Open/etc. That status that i feel is missing would be "New", a state before "Open".

In our former ticket system all new tickets were status "new" and once somebody made the first edit to it like leaving a comment, assigning, setting prio, they would be switched (automatically or manual) to status "open". So "open" was the equivalent to the suggested "Acknowledged" column and "new" was the equivalent to "needs triage'.

Projects are free to use workboard columns for such customizations. No need for more global statuses IMO.

Let's go forward with this. The workboard is enabled, and a few columns have been created (acknowledged and radar). Backlog is still "backlog" for the time being since all operations tasks currently sit there and it might be confusing to show all existing operations tasks as untriaged.

@Aklapper how would you suggest transitioning the ~1400 existing tasks currently in "backlog" on the workboard to "acknowledged" without loads of manual work and triggering notifications? FWIW I see the bulk edit option in phab, but don't have permission use it.

@Aklapper how would you suggest transitioning the ~1400 existing tasks currently in "backlog" on the workboard to "acknowledged" without loads of manual work and triggering notifications? FWIW I see the bulk edit option in phab, but don't have permission use it.

See https://phabricator.wikimedia.org/project/profile/13/ how to get permissions; or (probably better) I could use https://phabricator.wikimedia.org/p/Phabricator_maintenance/ once someone fixes T205258...

Let's go forward with this.

The columns exist on https://phabricator.wikimedia.org/project/board/1025/ and I'm tempted to silently mass-move but no idea if I'd run into T205258: Mass-edits via @Phabricator_maintenance account stop after 11 tasks again.
But I'd like to try (and find out, and maybe fail, or maybe find some workaround, or maybe pester Mukunda to investigate T25258).

So I am explicitly asking if this has been discussed and agreed on in SRE, and if I may [try to] go forward here? ([ ] Yes [ ] No [ ] Whatever.) TIA!

Let's go forward with this.

The columns exist on https://phabricator.wikimedia.org/project/board/1025/ and I'm tempted to silently mass-move but no idea if I'd run into T205258: Mass-edits via @Phabricator_maintenance account stop after 11 tasks again.
But I'd like to try (and find out, and maybe fail, or maybe find some workaround, or maybe pester Mukunda to investigate T25258).

So I am explicitly asking if this has been discussed and agreed on in SRE, and if I may [try to] go forward here? ([ ] Yes [ ] No [ ] Whatever.) TIA!

Great! This has been circulated a couple times via the operations mailing list, and raised during a few Ops/SRE meetings as well. There hasn't been any feedback in opposition to enabling the SRE workboard that I'm aware of, so afaict the answer is "Yes/Whatever".

With that said, I'll raise it once more at the SRE infra foundations meeting at noon today for last feedback and follow up this afternoon (Eastern TZ)

No Objections. Ready to go forward with this!

Sigh. Cannot. Phab broken.

(I logged in as @Phabricator_maintenance, I clicked "Move Tasks to Column..." in the dropdown of the "Backlog" column header on the Operations workboard, chose "Operations" and then "Move to Column: Acknowledged", and clicked the "Move Tasks" button. That was 90 minutes ago and nothing has happened since then.)

As there is no progress in T205258, I guess next step would be using the API to see if that could work around T205258. Something like

#!/bin/bash
for i in Txxxx Txxxx Txxxx
do
  echo '{"transactions": [{"type": "column", "value":["PHID-PCOL-knvdos3w6r5kdmycbgir"]}], "objectIdentifier":"'$i'"}' | /var/www/html/phab/arcanist/bin/arc call-conduit maniphest.edit
  sleep 3
done

(PHID-PCOL-knvdos3w6r5kdmycbgir is currently the Acknowledged column on the SRE workboard: Go to https://phabricator.wikimedia.org/conduit/method/project.column.search/ and in constraints enter {"projects":["PHID-PROJ-5hj6ygnanfu23mmnlvmd"]} to get the PHID of the column).

Note to myself: One would have to edit ~/.arcrc accordingly after getting the token of the @Phabricator_maintenance account after logging into it.

Doing this now by moving public open tasks from Backlog to Acknowledged column on SRE workboard by using the Conduit API. Note that @Phabricator_maintenance rightfully can't deal with access-restricted tasks (ex: WMF-NDA, Security tasks, S4, S6) so that'll require you to move manually.

Not sure if updates are needed for https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty or https://phabricator.wikimedia.org/dashboard/view/45/

Done; terminated the API token of @Phabricator_maintenance. I stopped at T213918 as I'm not sure how recent the tasks are acknowledged.

transitioning the ~1400 existing tasks currently in "backlog" on the workboard to "acknowledged"

So i just got 250 surprise notifications over night and it means it's hard to see the actual notifications i would like to read.

So i just got 250 surprise notifications over night and it means it's hard to see the actual notifications i would like to read.

@Dzahn: Which exact type of notifications are you referring to? If it's from Phab itself: @Phabricator_maintenance activity itself should be silent (in theory).

@Dzahn: Which exact type of notifications are you referring to? If it's from Phab itself: @Phabricator_maintenance activity itself should be silent (in theory).

@Aklapper they were all of the "Phabricator_maintenance moved ... " type.

@Dzahn: Which exact type of notifications are you referring to? If it's from Phab itself: @Phabricator_maintenance activity itself should be silent (in theory).

@Aklapper they were all of the "Phabricator_maintenance moved ... " type.

@Dzahn; handled in T216867

Wondering how to proceed with https://phabricator.wikimedia.org/project/board/1025/

In my understanding:

  1. SRE needs to define a date / task ID threshold, up to which task ID to move tasks from the Backlog to Acknowledged column.
  2. Decide and agree which Phab user account(s) to move tasks. This is related to task access restrictions. In my understanding:
    • Public tasks could be moved by @Phabricator_maintenance. This will still trigger notifications due to T216867. That account cannot access or move restricted tasks.
    • Some non-public tasks (and public tasks) could be moved by @Aklapper or someone else with sufficient access. @Aklapper could access and move Security tasks, WMF-NDA tasks, and tasks in Spaces like S4. @Aklapper cannot access or move tasks in some other Spaces like S6. In any case, these actions will also trigger notifications.
    • In any case, if T205258 turns out to still be an issue, the acting account could still manually drag and drop from the Backlog to Acknowledged column.
    • All this still leaves the question who to move other tasks like in S6 (if tasks exists in that Space, I can only assume): Temporarily provide @Aklapper or @Phab_maintenance access/membership to S6? Have someone else with access do that (who)?
  3. Someone (singular or plural) to actually move tasks, as agreed on in step 2.
  4. Document workflow in https://wikitech.wikimedia.org/wiki/SRE_Clinic_Duty#Review_incoming_tasks
  5. Communicate workflow (ops@, team meeting, etc?) and have SRE apply the workflow consistently.

moving public open tasks from Backlog to Acknowledged

I think i may have a lack of understanding here, but if a bot or somebody outside the team moves tasks to "acknowledged" doesn't that negate the whole point of them being acknowledged by the team?

That's about the existing backlog and an action to perform once. See #1 in my previous comment.

transitioning the ~1400 existing tasks currently in "backlog" on the workboard to "acknowledged" without loads of manual work and triggering notifications?

I see.. though this assumes all of those have actually been acknowledged by somebody. I kind of doubt that is the case. Also a bit skeptical what additional benefit we get from moving tasks between the 2 columns. So far the only one i see is that a ticket creator gets some kind of feedback that their task has been seen by somebody but it does not mean it will be resolved any quicker than without doing that. This might lead to wrong expectations and more frustration. Also worried a bit that all we will do is that clinics duty person will routinely move tasks to the new column (more notifications) but there will be no change about how quickly anything is resolved.

I do think that we would need to be consistent about what constitutes "Acknowledged" (or a similar column name). IMO the workboard transition action would indicate that the clinic duty task triage work was done. The task has been prioritized, relevant people/groups/tags have been added, etc.

The benefits from my perspective are an easy to read todo list for the clinic duty person, quick feedback to users that the task has been seen, and hopefully looping in relevant parties faster.

(For me it's up to what works best for SRE. Just tell me how / if I can help to get this task closer to resolved status... :) If my previous understanding is wrong and there is no need for me/someone to mass-move older open tasks from Backlog to Acknowledged and there's nothing left to do in this task for me, also good!)