Page MenuHomePhabricator

Create #site-incident tag and use it for incident reports
Closed, DeclinedPublic

Description

We currently track incidents on wikitech, but track the follow-ups here on phabricator. I think it's worth considering tracking both on phab.

Benefits over wikitech:

  • see follow-up status inline
  • ability to follow all incidents by watching the site-incident tag
  • ability to add CCs and tags / projects
  • create sub-tasks for follow-ups, close the incident once all follow-ups are addressed
    • makes it relatively easy to see which incidents are ongoing / have outstanding follow-up
  • easy to mention the incident in other issues, with automatic linking of the mention in the original issue

Disadvantages over wikitech:

  • disruption of moving
  • site search on wikitech won't find phabricator issues (but reverse is true in phab too right now)

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke set Security to None.
GWicke added subscribers: GWicke, greg.
GWicke added a subscriber: ssastry.

There's https://wikitech.wikimedia.org/wiki/Incident_documentation but also https://phabricator.wikimedia.org/project/query/all/?after=incid with two existing tags so far.
But true, that might be less searchable for if you wanted to get a list.
Topic needs some policy and agreement it seems so we're not having stuff in several places.

See previous discussion in T929. Might want to close this as a dup and argument there instead, to have discussion in one place?

@Aklapper, lets use this issue to discuss the new proposal. I added a pointer in T929.

meta-comment: I want to make sure SRE is on board with any change given the incident report process came out of their group explicitly. So, let's bring this up in one of the weekly ops meetings?

In T85889#966759, @greg wrote:

meta-comment: I want to make sure SRE is on board with any change given the incident report process came out of their group explicitly. So, let's bring this up in one of the weekly ops meetings?

Has that happened yet?
Assuming "no", who plans to bring this up and should be the task assignee here? ;)

@Aklapper, we did bring it up & heard some positive noises, but I think a final decision was not yet made. @greg should know more.

@Aklapper, we did bring it up & heard some positive noises, but I think a final decision was not yet made. @greg should know more.

What he said. Next Ops meeting I guess.

Aklapper changed the task status from Open to Stalled.Feb 10 2015, 1:30 AM

Setting status to "stalled" - feel free to set back to open once this has been discussed and decided.

GWicke changed the task status from Stalled to Open.Mar 13 2015, 4:51 AM

We now have several individual incidents on phabricator, but no project / tag to unify them in.

Any objections against creating this tag?

Yes, two objections.

  1. I'd prefer someone to first push the actual *social* part (acceptance of proposed workflow!) of the process instead of creating more projects in Phabricator. Looking at https://wikitech.wikimedia.org/wiki/Incident_documentation I see enough recent outages that have no corresponding Phab project at all. Maybe that's even fine, don't know, but initial sentence of this task is: "We currently track incidents on wikitech, but track the follow-ups here on phabricator. I think it's worth considering tracking both on phab." We're far away from that it seems, with some additional tag or without that additional tag.
  1. Ticket is stalled because of T85889#990027. Not handled yet (or missing its outcome in this ticket).

We did talk about it & everybody seemed to be in favor, but never made a final decision. @greg, could you chime in on current status?

I prefer to stay with what we have now. I'm swamped on time and can't keep all the various wikis and projects in alignment always. I have been creating per-incident projects when I'm involved with the incident (contrary to popular belief, I'm not always :) ).

What we have now is: wikitech pages and per-incident project.

I might just replace the wikitech page with the phab project, eg: https://phabricator.wikimedia.org/project/profile/1125/ (the description, while short, is basically the outage report, this was a quick and obvious outage with clear follow up, hence not being very extensive).

Aklapper claimed this task.

I'm declining as per last comment (status could be reopened at any time)