New project: "Wikimedia Incident Response"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• demon
	Oct 27 2014, 5:49 PM

Description

Useful for tagging things that are filed as the result of a site outage, etc. Usually will also be filed under operations or a relevant bug category, but this is needed for tracking incidents collectively. cf incident docs on wikitechwiki.

Willing to bikeshed on the name a little if people have a better one. Otherwise if no one objects I'll go ahead and create this one.

Related Objects

Mentioned In: T140202: Create #Wikimedia-Incident for tracking Wikimedia incident actionables
T94275: BoilerPlate extension should incorporate best test and CI practices
T85889: Create #site-incident tag and use it for incident reports
Mentioned Here: T85889: Create #site-incident tag and use it for incident reports
T706: Requests for addition to the #acl*Project-Admins group (in comments)
T928: Search in Phabricator does not work

Event Timeline

• demon created this task.Oct 27 2014, 5:49 PM

• demon claimed this task.

• demon raised the priority of this task from to Needs Triage.

• demon updated the task description. (Show Details)

• demon added projects: Phabricator, acl*sre-team.

• demon changed Security from none to None.

• demon added a subscriber: greg.

• demon subscribed.

Question: who else's incident response would it be? (no really asking) We have talked often of the need to minimize "wikimedia" in all names as it is implied. Curious if this is one of those cases.

Legit point :)

Tagging all tickets related to a certain incident (like the Search issue today with T928 and https://bugzilla.wikimedia.org/show_bug.cgi?id=72559 ) or what's the scope?
Projects would have some naming scheme that includes some calendar date I assume?

I don't see a reason to rush so I'd like to give this a bit more thoughts.

• demon removed a project: acl*sre-team.Oct 27 2014, 9:27 PM

Qgil triaged this task as Medium priority.Oct 28 2014, 9:17 AM

Qgil subscribed.

In T929#15867, @Aklapper wrote:

Tagging all tickets related to a certain incident (like the Search issue today with T928 and https://bugzilla.wikimedia.org/show_bug.cgi?id=72559 ) or what's the scope?
Projects would have some naming scheme that includes some calendar date I assume?

We can probably just copy the very verbose method I used on the incident response wiki pages: YYYYMMDD-ServiceName, probably precede with Incident- to make it clear what's going on. Or is something else (simpler) is thought of, let's go with that :)

So, #Incident-YYYYMMDD-CamelCaseServiceName, while long, is probably fine.

This (something like this) is something I've wanted for a long time. I hate maintaining that wiki page (and subpages).

I don't see a reason to rush so I'd like to give this a bit more thoughts.

Yeah, most issues that we're going to face should probably still be reported in BZ to reduce split-brain, but I think we should be ready for this.

I forget how locked down projects are right now and how locked down they are planned to be in the future, but at least I should be added to the group of people who can create projects so I can create these ASAP (relative to the incident).

Qgil added a subtask: T706: Requests for addition to the #acl*Project-Admins group (in comments).Oct 28 2014, 4:44 PM

In T929#16146, @greg wrote:

We can probably just copy the very verbose method I used on the incident response wiki pages: YYYYMMDD-ServiceName, probably precede with Incident- to make it clear what's going on. Or is something else (simpler) is thought of, let's go with that :)

Could you share example URLs for those not familiar with this incident response problem, please?

So, #Incident-YYYYMMDD-CamelCaseServiceName, while long, is probably fine.

The good thing about just one project (called Incident-Response or similar) is that people usually involved in these situations can join and watch this project. This, plus the Unbreak Now! priority should be a clear signal arriving to the right people soon. Tasks filed urgently under a new project will be less likely to be detected.

You example might be more useful if there is a rather complex incident, in terms of tasks involved and time required to fix them. In these situations it might make sense to create an additional specific project to track them.

I forget how locked down projects are right now and how locked down they are planned to be in the future, but at least I should be added to the group of people who can create projects so I can create these ASAP (relative to the incident).

You are now CCed to T706: Requests for addition to the #acl*Project-Admins group (in comments)

In T929#19819, @Qgil wrote:

In T929#16146, @greg wrote:

We can probably just copy the very verbose method I used on the incident response wiki pages: YYYYMMDD-ServiceName, probably precede with Incident- to make it clear what's going on. Or is something else (simpler) is thought of, let's go with that :)

Could you share example URLs for those not familiar with this incident response problem, please?

All linked from https://wikitech.wikimedia.org/wiki/Incident_documentation

eg: https://wikitech.wikimedia.org/wiki/Incident_documentation/20140619-parsercache

So, #Incident-YYYYMMDD-CamelCaseServiceName, while long, is probably fine.

The good thing about just one project (called Incident-Response or similar) is that people usually involved in these situations can join and watch this project. This, plus the Unbreak Now! priority should be a clear signal arriving to the right people soon. Tasks filed urgently under a new project will be less likely to be detected.

You example might be more useful if there is a rather complex incident, in terms of tasks involved and time required to fix them. In these situations it might make sense to create an additional specific project to track them.

Except that's not how we use them :)

Incident response pages are for after the incident is done, not during. They're "what the hell happened, how did we fix it, and (most importantly) what are we going to do to make it not happen in the future." That last bit is what we want to track with these projects (plural). See, eg:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20140619-parsercache

Those are long term things and we want to track them for completion as they relate to the outage.

Please, let us create per-incident projects for this. They'll be small (probably no more than 5 or so tasks, if history is to be believed) but they're useful to track follow-up work (and waaaaaaaay better than the crappy wiki pages we're using now).

We might change how we do things in the future (to only using one project for all incident follow-up) but I don't see that happening now or in the near future (I'm not convinced it will be useful, I'm convinced it will be a regression in functionality).

In T929#19942, @greg wrote:

Except that's not how we use them :)

Context matters! Thank you for the explanation. Then yes, your proposal makes total sense.

If @Aklapper also agrees, let's do it this way the next time an incident occurs (or if you want to document past incidents after the Bugzilla migration). Meanwhile, your nearest Phabricator team member can create the tags without going through new requests using the formula you propose:

#Incident-YYYYMMDD-CamelCaseServiceName

(And T706 permitting, your should get those permissions eventually.)

In T929#19952, @Qgil wrote:

Context matters! Thank you for the explanation. Then yes, your proposal makes total sense.

:) :)

Qgil moved this task from To Triage to Need discussion on the Phabricator board.Nov 9 2014, 11:01 PM

In T929#19952, @Qgil wrote:

If @Aklapper also agrees, let's do it this way the next time an incident occurs

He agrees.

Meanwhile, your nearest Phabricator team member can create the tags without going through new requests using the formula you propose:
#Incident-YYYYMMDD-CamelCaseServiceName

Was wondering whether there's some way to express that "incident" refers to things going wrong on WMF servers but could not find other stuff in Wikimedia that I'd call incidents and don't want to make that name even longer. So let's do it like that!

Qgil closed subtask T706: Requests for addition to the #acl*Project-Admins group (in comments) as Resolved.Nov 25 2014, 12:15 PM

• demon added a project: Project-Admins.Nov 25 2014, 3:13 PM

Since I am in the Project-Creators group, I guess we can call this done. I'll start creating projects for incidents as they happen (I don't have it in me right now to create the old ones).

Aklapper mentioned this in T85889: Create #site-incident tag and use it for incident reports.Jan 7 2015, 3:52 PM

A follow-up proposal to move all of the incident response to phab is at T85889. Please chime in.

• Spage mentioned this in T94275: BoilerPlate extension should incorporate best test and CI practices.Mar 27 2015, 10:13 PM

doctaxon reopened subtask T706: Requests for addition to the #acl*Project-Admins group (in comments) as Open.Oct 13 2015, 5:26 PM

Restricted Application added a subscriber: scfc. · View Herald TranscriptOct 13 2015, 5:26 PM

Aklapper closed subtask T706: Requests for addition to the #acl*Project-Admins group (in comments) as Resolved.Oct 13 2015, 5:32 PM

greg added a project: Essential-Work.Jan 11 2016, 10:50 PM

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptJan 11 2016, 10:50 PM

• mmodell reopened subtask T706: Requests for addition to the #acl*Project-Admins group (in comments) as Stalled.Mar 1 2016, 10:38 PM

Danny_B moved this task from Incoming to Projects to create on the Project-Admins board.May 20 2016, 10:19 PM

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptMay 20 2016, 10:20 PM

Danny_B mentioned this in T140202: Create #Wikimedia-Incident for tracking Wikimedia incident actionables.Jul 13 2016, 12:53 PM

Danny_B removed a subtask: T706: Requests for addition to the #acl*Project-Admins group (in comments).Jul 30 2016, 5:10 PM

New project: "Wikimedia Incident Response"Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

New project: "Wikimedia Incident Response"
Closed, ResolvedPublic
Actions