Page MenuHomePhabricator

New project: "Wikimedia Incident Response"
Closed, ResolvedPublic

Description

Useful for tagging things that are filed as the result of a site outage, etc. Usually will also be filed under operations or a relevant bug category, but this is needed for tracking incidents collectively. cf incident docs on wikitechwiki.

Willing to bikeshed on the name a little if people have a better one. Otherwise if no one objects I'll go ahead and create this one.

Event Timeline

demon claimed this task.
demon raised the priority of this task from to Needs Triage.
demon updated the task description. (Show Details)
demon added projects: Phabricator, acl*sre-team.
demon changed Security from none to None.
demon added a subscriber: greg.
demon added a subscriber: demon.

Question: who else's incident response would it be? (no really asking) We have talked often of the need to minimize "wikimedia" in all names as it is implied. Curious if this is one of those cases.

Tagging all tickets related to a certain incident (like the Search issue today with T928 and https://bugzilla.wikimedia.org/show_bug.cgi?id=72559 ) or what's the scope?
Projects would have some naming scheme that includes some calendar date I assume?

I don't see a reason to rush so I'd like to give this a bit more thoughts.

Qgil triaged this task as Medium priority.Oct 28 2014, 9:17 AM
Qgil added a subscriber: Qgil.

Tagging all tickets related to a certain incident (like the Search issue today with T928 and https://bugzilla.wikimedia.org/show_bug.cgi?id=72559 ) or what's the scope?
Projects would have some naming scheme that includes some calendar date I assume?

We can probably just copy the very verbose method I used on the incident response wiki pages: YYYYMMDD-ServiceName, probably precede with Incident- to make it clear what's going on. Or is something else (simpler) is thought of, let's go with that :)

So, #Incident-YYYYMMDD-CamelCaseServiceName, while long, is probably fine.

This (something like this) is something I've wanted for a long time. I hate maintaining that wiki page (and subpages).

I don't see a reason to rush so I'd like to give this a bit more thoughts.

Yeah, most issues that we're going to face should probably still be reported in BZ to reduce split-brain, but I think we should be ready for this.

I forget how locked down projects are right now and how locked down they are planned to be in the future, but at least I should be added to the group of people who can create projects so I can create these ASAP (relative to the incident).

In T929#16146, @greg wrote:

We can probably just copy the very verbose method I used on the incident response wiki pages: YYYYMMDD-ServiceName, probably precede with Incident- to make it clear what's going on. Or is something else (simpler) is thought of, let's go with that :)

Could you share example URLs for those not familiar with this incident response problem, please?

So, #Incident-YYYYMMDD-CamelCaseServiceName, while long, is probably fine.

The good thing about just one project (called Incident-Response or similar) is that people usually involved in these situations can join and watch this project. This, plus the Unbreak Now! priority should be a clear signal arriving to the right people soon. Tasks filed urgently under a new project will be less likely to be detected.

You example might be more useful if there is a rather complex incident, in terms of tasks involved and time required to fix them. In these situations it might make sense to create an additional specific project to track them.

I forget how locked down projects are right now and how locked down they are planned to be in the future, but at least I should be added to the group of people who can create projects so I can create these ASAP (relative to the incident).

You are now CCed to T706: Requests for addition to the #acl*Project-Admins group (in comments)

In T929#19819, @Qgil wrote:
In T929#16146, @greg wrote:

We can probably just copy the very verbose method I used on the incident response wiki pages: YYYYMMDD-ServiceName, probably precede with Incident- to make it clear what's going on. Or is something else (simpler) is thought of, let's go with that :)

Could you share example URLs for those not familiar with this incident response problem, please?

All linked from https://wikitech.wikimedia.org/wiki/Incident_documentation

eg: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140619-parsercache

So, #Incident-YYYYMMDD-CamelCaseServiceName, while long, is probably fine.

The good thing about just one project (called Incident-Response or similar) is that people usually involved in these situations can join and watch this project. This, plus the Unbreak Now! priority should be a clear signal arriving to the right people soon. Tasks filed urgently under a new project will be less likely to be detected.

You example might be more useful if there is a rather complex incident, in terms of tasks involved and time required to fix them. In these situations it might make sense to create an additional specific project to track them.

Except that's not how we use them :)

Incident response pages are for after the incident is done, not during. They're "what the hell happened, how did we fix it, and (most importantly) what are we going to do to make it not happen in the future." That last bit is what we want to track with these projects (plural). See, eg:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20140619-parsercache

Those are long term things and we want to track them for completion as they relate to the outage.

Please, let us create per-incident projects for this. They'll be small (probably no more than 5 or so tasks, if history is to be believed) but they're useful to track follow-up work (and waaaaaaaay better than the crappy wiki pages we're using now).

We might change how we do things in the future (to only using one project for all incident follow-up) but I don't see that happening now or in the near future (I'm not convinced it will be useful, I'm convinced it will be a regression in functionality).

In T929#19942, @greg wrote:

Except that's not how we use them :)

Context matters! Thank you for the explanation. Then yes, your proposal makes total sense.

If @Aklapper also agrees, let's do it this way the next time an incident occurs (or if you want to document past incidents after the Bugzilla migration). Meanwhile, your nearest Phabricator team member can create the tags without going through new requests using the formula you propose:

#Incident-YYYYMMDD-CamelCaseServiceName

(And T706 permitting, your should get those permissions eventually.)

In T929#19952, @Qgil wrote:

Context matters! Thank you for the explanation. Then yes, your proposal makes total sense.

:) :)

In T929#19952, @Qgil wrote:

If @Aklapper also agrees, let's do it this way the next time an incident occurs

He agrees.

Meanwhile, your nearest Phabricator team member can create the tags without going through new requests using the formula you propose:

#Incident-YYYYMMDD-CamelCaseServiceName

Was wondering whether there's some way to express that "incident" refers to things going wrong on WMF servers but could not find other stuff in Wikimedia that I'd call incidents and don't want to make that name even longer. So let's do it like that!

Since I am in the Project-Creators group, I guess we can call this done. I'll start creating projects for incidents as they happen (I don't have it in me right now to create the old ones).

A follow-up proposal to move all of the incident response to phab is at T85889. Please chime in.