@chasemp: pinging you because I like you. Can you give me an example of what "service level" means here, from Ops perspective? Feel free to punt to someone else :)
In T76381: Determine phabricator.wikimedia.org service level there wasn't much discussion about what that word meant other than "like bugzilla" or "more than bugzilla" (and then being too specific like "redundant switch on the rack").
Greg, stop liking me. :) I don't know if there is a general Ops perspective here, but I can give you my perspective.
When I use the language it relates to the following things:
- What downtime we expect to incur a year and during failure (MTTR). If we are OK w/ 4 hours of downtime every year or so for some hardware failure that's one thing. If we expect never to lose Phab for more than an hour at a time that's another.
- What level of response (and for whom should we set the expectation) when the service is having issues
- At what point during an outage should we cut and run to setup a second box. Sometimes this is called something like "recovery escalation". i.e. if Phab is down for 2 hours and we think it will be 8 to reinstall on a new box and 10 to fix the existing box. That's going to be more aggregate hours of work to setup on a new host but will bring the service online 2 hours faster. Who makes that call and when.
Things I mean usually that probably apply less here:
- What level and method of escalation should we employ when the above aren't being met
- Who is the highest stakeholder and when to involve them
- What level of resources should we reserve to make sure the above are met
It's bureaucratic speak my man, and it's somewhat disturbing I find it comforting.
Thanks man, that helps.
Given that this is not a common framework in use by WMF Ops (for better or worse) I'm going to close this task for now.
I'm happy to revisit this later with Ops as needed, however. But I'll let them initiate that conversation.