Revise phabricator.wikimedia.org service level
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Qgil
	Dec 2 2014, 10:00 AM

Description

As per T76381: Determine phabricator.wikimedia.org service level, we need to revise Operations' phabricator.wikimedia.org service level whenever we start planning the Gerrit migration.

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production
Resolved	• Cmjohnson	T138978 decom antimony (datacenter)
Resolved	Dzahn	T123718 Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit)
Resolved	Danny_B	T138986 Archive #Gitblit-Deprecate
Resolved	Paladox	T137353 Update all on-wiki references to git.wikimedia.org and replace them with the Phabricator equivalent
Open	Dzahn	T323073 Make https://git.wikimedia.org not redirect to Phabricator Diffusion
Resolved	• demon	T111465 [keyresult] Deprecate gitblit in favor of Diffusion
Resolved	• demon	T752 Use Diffusion as canonical location for browsing code repos (not gitblit)
Resolved	Paladox	T108864 Update mediawiki.org templates to link to Diffusion, not gitblit
Resolved	Nemo_bis	T101358 Update {{git file}} to link to diffusion
Resolved	• demon	T110607 redirect gerrit repo paths to diffusion callsigns
Resolved	Paladox	T111887 Diffusion replacement for tarfile download from git.wikimedia.org
Resolved	Krenair	T122674 Make all /r/project/* paths in phabricator accessible without login
Resolved	• mmodell	T129447 Diffusion redirect from name to callsign doesn't always work
Invalid	None	T135336 Update Module:Callsigns in mediawiki.org
Declined	None	T76788 ExtensionDistributor should use Phabricator/Diffusion instead of Gerrit
Resolved	• demon	T616 Import all gerrit.wikimedia.org repositories with Diffusion
Invalid	None	T98348 Implement the Wikimedia Foundation Call to Action 2015
Invalid	None	T98358 WMF to integrate, consolidate, and pause or stop stalled initiatives
Resolved	Qgil	T553 Engineering Community team goals for October 2014
Declined	None	T617 Provide static dump of Gerrit
Duplicate	None	T18 Plan to migrate code review from Gerrit to Phabricator
Declined	greg	T76446 Revise phabricator.wikimedia.org service level

Event Timeline

Qgil created this task.Dec 2 2014, 10:00 AM

Qgil raised the priority of this task from to Low.

Qgil updated the task description. (Show Details)

Qgil added a project: Gerrit-Migration.

Qgil changed Security from none to None.

Qgil mentioned this in T76381: Determine phabricator.wikimedia.org service level.

Qgil added subscribers: Aklapper, Qgil, greg and 16 others.

hashar unsubscribed.Dec 2 2014, 10:20 AM

See also: https://secure.phabricator.com/T4209 (Multiserver / High-Availability Configuration)

• MZMcBride subscribed.Dec 5 2014, 7:23 PM

• demon unsubscribed.Dec 16 2014, 6:10 PM

devurandom subscribed.Aug 19 2015, 5:38 AM

@chasemp: pinging you because I like you. Can you give me an example of what "service level" means here, from Ops perspective? Feel free to punt to someone else :)

In T76381: Determine phabricator.wikimedia.org service level there wasn't much discussion about what that word meant other than "like bugzilla" or "more than bugzilla" (and then being too specific like "redundant switch on the rack").

In T76446#1603455, @greg wrote:

@chasemp: pinging you because I like you. Can you give me an example of what "service level" means here, from Ops perspective? Feel free to punt to someone else :)

Greg, stop liking me. :) I don't know if there is a general Ops perspective here, but I can give you my perspective.

When I use the language it relates to the following things:

What downtime we expect to incur a year and during failure (MTTR). If we are OK w/ 4 hours of downtime every year or so for some hardware failure that's one thing. If we expect never to lose Phab for more than an hour at a time that's another.
What level of response (and for whom should we set the expectation) when the service is having issues
At what point during an outage should we cut and run to setup a second box. Sometimes this is called something like "recovery escalation". i.e. if Phab is down for 2 hours and we think it will be 8 to reinstall on a new box and 10 to fix the existing box. That's going to be more aggregate hours of work to setup on a new host but will bring the service online 2 hours faster. Who makes that call and when.

Things I mean usually that probably apply less here:

What level and method of escalation should we employ when the above aren't being met
Who is the highest stakeholder and when to involve them
What level of resources should we reserve to make sure the above are met

It's bureaucratic speak my man, and it's somewhat disturbing I find it comforting.

Thanks man, that helps.

Given that this is not a common framework in use by WMF Ops (for better or worse) I'm going to close this task for now.

I'm happy to revisit this later with Ops as needed, however. But I'll let them initiate that conversation.

Much love.

Krenair subscribed.Sep 3 2015, 6:33 PM

greg moved this task from To Triage to Done/Archive on the Gerrit-Migration board.Sep 24 2015, 11:35 PM

Restricted Application added a project: User-greg. · View Herald TranscriptSep 24 2015, 11:35 PM

greg moved this task from Backlog to Done on the User-greg board.Sep 24 2015, 11:36 PM

Revise phabricator.wikimedia.org service levelClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Revise phabricator.wikimedia.org service level
Closed, DeclinedPublic
Actions

Related Objects
Search...