Page MenuHomePhabricator

Revise phabricator.wikimedia.org service level
Closed, DeclinedPublic

Description

As per T76381: Determine phabricator.wikimedia.org service level, we need to revise Operations' phabricator.wikimedia.org service level whenever we start planning the Gerrit migration.

Related Objects

StatusSubtypeAssignedTask
ResolvedDzahn
Resolved Cmjohnson
ResolvedDzahn
ResolvedDanny_B
ResolvedPaladox
OpenDzahn
Resolved demon
Resolved demon
ResolvedPaladox
ResolvedNemo_bis
Resolved demon
ResolvedPaladox
ResolvedKrenair
Resolved mmodell
InvalidNone
DeclinedNone
Resolved demon
InvalidNone
InvalidNone
ResolvedQgil
DeclinedNone
DuplicateNone
Declinedgreg

Event Timeline

Qgil raised the priority of this task from to Low.
Qgil updated the task description. (Show Details)
Qgil added a project: Gerrit-Migration.
Qgil changed Security from none to None.
Qgil added subscribers: Aklapper, Qgil, greg and 16 others.

See also: https://secure.phabricator.com/T4209 (Multiserver / High-Availability Configuration)

@chasemp: pinging you because I like you. Can you give me an example of what "service level" means here, from Ops perspective? Feel free to punt to someone else :)

In T76381: Determine phabricator.wikimedia.org service level there wasn't much discussion about what that word meant other than "like bugzilla" or "more than bugzilla" (and then being too specific like "redundant switch on the rack").

@chasemp: pinging you because I like you. Can you give me an example of what "service level" means here, from Ops perspective? Feel free to punt to someone else :)

Greg, stop liking me. :) I don't know if there is a general Ops perspective here, but I can give you my perspective.

When I use the language it relates to the following things:

  • What downtime we expect to incur a year and during failure (MTTR). If we are OK w/ 4 hours of downtime every year or so for some hardware failure that's one thing. If we expect never to lose Phab for more than an hour at a time that's another.
  • What level of response (and for whom should we set the expectation) when the service is having issues
  • At what point during an outage should we cut and run to setup a second box. Sometimes this is called something like "recovery escalation". i.e. if Phab is down for 2 hours and we think it will be 8 to reinstall on a new box and 10 to fix the existing box. That's going to be more aggregate hours of work to setup on a new host but will bring the service online 2 hours faster. Who makes that call and when.

Things I mean usually that probably apply less here:

  • What level and method of escalation should we employ when the above aren't being met
  • Who is the highest stakeholder and when to involve them
  • What level of resources should we reserve to make sure the above are met

It's bureaucratic speak my man, and it's somewhat disturbing I find it comforting.

greg claimed this task.

Thanks man, that helps.

Given that this is not a common framework in use by WMF Ops (for better or worse) I'm going to close this task for now.

I'm happy to revisit this later with Ops as needed, however. But I'll let them initiate that conversation.

Much love.