Page MenuHomePhabricator

Expand sc-admins to provide sufficient coverage for sc* clusters
Closed, ResolvedPublic

Description

As of T134251, @mobrovac is the only person in Services with access to the SC* clusters. This is not sufficient to provide emergency coverage for services owned or co-managed by the services team.

To remedy this, I think we should expand access to include most or all of the services team.

Event Timeline

As a personal note: Marko was granted the right to run/disable/manage puppet because he is performing non-emergency coverage and regularly doing deployments; we don't really need "emergency coverage" and I have seen in the past repeated abuses of the ability to disable puppet in production.

If the focus is evenly distribute the load of managing deployments between different people in the team, we can discuss this, but emergency coverage is not really the reason marko was given the right to manage puppet in the first place.

Joe removed Joe as the assignee of this task.May 20 2016, 6:27 AM

I am currently the default go-to guy when it comes to SC* services. This is becoming a bottleneck and, more generally, is not a sustainable solution. With the creation of the sc-admins group, I think it makes sense to expand it to:

  • allow all users in that group to manage services on the hosts (basically sudo service *)
  • add the services team's members to it

This would allow other members of the team not only to participate in emergency situations in order to fix breakages, but also take share in day-to-day duties that come with it (configs, cleanups, investigations etc).

An important thing to note here is that currently @Eevans and @Pchelolo don't have any kind of access to SC*, which I don't think is a good state to be in.

@mobrovac I agree in principle; also I guess the "puppet disabling" will not be needed anymore once we move every service fully to scap3?

Having a lot of people able to disable puppet for long stretches while doing testing is what slightly worries me, out of sour experiences we've had in quite a few cases.

@mobrovac I agree in principle; also I guess the "puppet disabling" will not be needed anymore once we move every service fully to scap3?

If by fully you mean that also config deploys are done via Scap3 then, yes, I don't see how/why would Puppet be needed in this context.

Having a lot of people able to disable puppet for long stretches while doing testing is what slightly worries me, out of sour experiences we've had in quite a few cases.

I can relate to your concern, but I think we should put this into perspective: it is unlikely that it will be feasible to have puppet disabled for extended periods of time on SC*, simply because of the number of services running there. I think we can all agree that disabling Puppet for slowly getting out potentially-harmful changes and/or testing parameter changes that are known only to affect production (examples of which might be proxy config, rate limiting, etc) up to a couple of hours, i.e. while working on them, is OK, but that disabling Puppet and leaving it disabled because it works like this now, so let's leave it is not.

I'm not 100% clear on the meeting result from today (my hangouts decided to lag and jitter). It seems there was no objection, but I'll double-check with @mark in the AM tomorrow to ensure it.

@Joe: I'm not clear if your objection is revoked due to explanation or still stands? Please advise.

@RobH yes it was approved in the meeting and I was actually sponsoring this :)

@GWicke: This was approved in the operations team meeting, but to ensure I ONLY add the right folks, can we confirm the users to be added to sc-admins:

I have the following: @GWicke, @Eevans, & @ @Pchelolo. @mobrovac already has this access.

Is there anyone I'm missing?

Change 290491 had a related patch set uploaded (by RobH):
expanding sc-admins rights and members

https://gerrit.wikimedia.org/r/290491

Change 290491 merged by RobH:
expanding sc-admins rights and members

https://gerrit.wikimedia.org/r/290491

I got the service names for inclusion from @mobrovac (as it seems Gabriel is out today, and there isn't a need to stall this just for his input when Marko knows what was needed!) Also Marko confirmed that there weren't any missing service team members from the list.

As this was already approved in the meeting, and I have both my review and @Joe's review on the patch, I've merged it live.

Please note that while this is live, it is up to the services team to coordinate service issues and administration with operations. (Basically everyone who just got rights should continue to do what @mobrovac is already doing with these sc-admin rights, work with ops.)

Thanks!