Page MenuHomePhabricator

Revise service level
Closed, DeclinedPublic


As per T76381: Determine service level, we need to revise Operations' service level whenever we start planning the Gerrit migration.

Related Objects


Event Timeline

Qgil created this task.Dec 2 2014, 10:00 AM
Qgil raised the priority of this task from to Low.
Qgil updated the task description. (Show Details)
Qgil added a project: Gerrit-Migration.
Qgil changed Security from none to None.
Qgil added subscribers: Aklapper, Qgil, greg and 16 others.
hashar removed a subscriber: hashar.Dec 2 2014, 10:20 AM
bd808 added a comment.Dec 2 2014, 4:02 PM

See also: (Multiserver / High-Availability Configuration)

demon removed a subscriber: demon.Dec 16 2014, 6:10 PM
greg added a comment.Sep 3 2015, 6:00 PM

@chasemp: pinging you because I like you. Can you give me an example of what "service level" means here, from Ops perspective? Feel free to punt to someone else :)

In T76381: Determine service level there wasn't much discussion about what that word meant other than "like bugzilla" or "more than bugzilla" (and then being too specific like "redundant switch on the rack").

@chasemp: pinging you because I like you. Can you give me an example of what "service level" means here, from Ops perspective? Feel free to punt to someone else :)

Greg, stop liking me. :) I don't know if there is a general Ops perspective here, but I can give you my perspective.

When I use the language it relates to the following things:

  • What downtime we expect to incur a year and during failure (MTTR). If we are OK w/ 4 hours of downtime every year or so for some hardware failure that's one thing. If we expect never to lose Phab for more than an hour at a time that's another.
  • What level of response (and for whom should we set the expectation) when the service is having issues
  • At what point during an outage should we cut and run to setup a second box. Sometimes this is called something like "recovery escalation". i.e. if Phab is down for 2 hours and we think it will be 8 to reinstall on a new box and 10 to fix the existing box. That's going to be more aggregate hours of work to setup on a new host but will bring the service online 2 hours faster. Who makes that call and when.

Things I mean usually that probably apply less here:

  • What level and method of escalation should we employ when the above aren't being met
  • Who is the highest stakeholder and when to involve them
  • What level of resources should we reserve to make sure the above are met

It's bureaucratic speak my man, and it's somewhat disturbing I find it comforting.

greg closed this task as Declined.Sep 3 2015, 6:30 PM
greg claimed this task.

Thanks man, that helps.

Given that this is not a common framework in use by WMF Ops (for better or worse) I'm going to close this task for now.

I'm happy to revisit this later with Ops as needed, however. But I'll let them initiate that conversation.

Much love.

Krenair added a subscriber: Krenair.Sep 3 2015, 6:33 PM
Restricted Application added a project: User-greg. · View Herald TranscriptSep 24 2015, 11:35 PM
greg moved this task from Backlog to Done on the User-greg board.Sep 24 2015, 11:36 PM