Page MenuHomePhabricator

Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts
Open, MediumPublic

Description

https://gerrit.wikimedia.org/r/c/operations/puppet/+/621343 includes a proposal to add the wmcs-admin group (which is the WMCS engineering manager's special access for managing wiki replica host scripts because they are part of our clinic duty rotation). We have a wmcs-roots group that the same people are a part of, but that group grants root on all other cloud* hosts vs. restricted sudo for the wiki-replicas hosts.

Right now, that group provides a list of scripts that must be run as root on the wiki replica servers. This adds secure-cookbook wmcs.* to the list on the requisite cumin server to run it from. This would unblock some parts of the redesign of the wikireplicas because we are moving from 4 to 8 servers and multiinstance. It will require more complex manual interactions that don't scale well without being able to use cumin, spicerack or a similar framework.

Event Timeline

Bstorm created this task.Mon, Aug 24, 4:53 PM
Restricted Application added a project: Operations. · View Herald TranscriptMon, Aug 24, 4:53 PM

Change 621343 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins

https://gerrit.wikimedia.org/r/621343

Bstorm removed Bstorm as the assignee of this task.Mon, Aug 24, 4:57 PM
Bstorm moved this task from Backlog to Wiki replicas on the Data-Services board.
Bstorm moved this task from Inbox to Watching on the cloud-services-team (Kanban) board.
jijiki claimed this task.Mon, Aug 24, 10:20 PM
jijiki triaged this task as Medium priority.

@Volans @MoritzMuehlenhoff @jbond Please advise if we should move this forward or put it in the agenda for the next Monday meeting

I'd say to discuss it in the meeting, but it was already announced by Brooke in the last and didn't get much attention.
The existing dcops exception is not a real one as then it was decided to grant additional privileges to their members to overcome some limitations.
I think that this feedback is important to push more for the general work of running cookbooks as non-root (the one started in T244840). I was asked also by the Search team about the same possibility.

As I commented in the CR my main worry is that the cumin hosts are currently considered ops-only and as such "safe" and might contain sensitive data.

bd808 updated the task description. (Show Details)Thu, Aug 27, 8:21 PM
bd808 added a comment.Thu, Aug 27, 8:31 PM

As I commented in the CR my main worry is that the cumin hosts are currently considered ops-only and as such "safe" and might contain sensitive data.

The wmcs-admin group will only ever contain trusted technical contributors who hold elevated shell privileges in other parts of the production infrastructure. Perhaps one path forward if the current cumin hosts are considered tainted would be setup a new place to run cumin from that is not used by roots? My thinking here is along the lines of the bastion/bastion-restricted split in the Cloud VPS environment.

jijiki removed jijiki as the assignee of this task.Mon, Aug 31, 2:17 PM
jijiki added a subscriber: jijiki.
Bstorm added a comment.EditedMon, Aug 31, 3:00 PM

Are DCops considered part of ops only? They are already in that with restricted access only to server install scripts. I'm mostly asking because I want to be clear that access does currently include a small restricted set outside of the "ops" group--which doesn't mean the same thing necessarily. T249916

Perhaps one path forward if the current cumin hosts are considered tainted would be setup a new place to run cumin from that is not used by roots?

Yes, that's an option and that's also what is planned for T244840 as well. Until we have root-less command execution in place we can also restrict the Ferm rules allowing Cumin command execution to the WMCS servers in question. Ultimatately it adds technical debt which is built until T244840 is solved for good, but if it's needed in the short term (and a full implementation of T244840 will not be avaialble until > 6 months depending on priority setting), that's a tradeoff to be made.

jbond added a comment.Fri, Sep 4, 11:51 AM

Further to the comment from moritz it would be useful to know what the priority is on this and what the blockers are on the WMCS side. If we can wait util rootless cumin that would be great. however if this is needed yesterday then perhaps we need to re-prioritse things and/or get more creative. Further it sounds like it might be useful to have a cloud common instance on the production environment regardless of this specific issue?

We are currently trying to make it more straightforward to manage the wikireplicas, and I wrote a spicerack cookbook that allows us to "add a wiki" to the wikireplicas with a single command instead of three on each server and one on our core control server in the middle of all that, all in a specified order. One of our quarterly goals for us and the DBAs is redesigning the wikireplicas service to allow multi-instance replicas and expanding the server set to 8 servers (each with only particular instances) instead of three servers with multisource replication as the model (the actual multisource is the upstream servers to the wikireplica servers, but that's the simplified version). That will make the manual version of the task far more error prone and a somewhat absurd amount of toil.

The people who are in this group (my engineering manager and my former engineering manager--who this group was specifically created to contain) assist in managing the replicas as part of our clinic duty rotation and hold root on our other physical servers.

In the very near term (this month), this will dramatically simplify one of the clinic duty procedures for the entire team, which is good but not a huge blocker. When the new wikireplica servers are set up (they were delivered last week and racking task is underway), they will allow remote management of the servers to be technically feasible. Without the ability to remotely run root level commands on those servers and our cloudcontrol cluster, it will require a heavy amount of work for them to take part in the rotation at all.

That is the general summary of blockers and reasons.

Further it sounds like it might be useful to have a cloud common instance on the production environment regardless of this specific issue?

That makes some sense except that the entire WMCS team has ops privileges except the engineering manager. Is that worth provisioning a server? Maybe!

faidon added a subscriber: faidon.Wed, Sep 16, 5:52 PM

Hey - this was brought to my attention, and we discussed it today at the I/F meeting. The outcome of our conversation was that @Volans and @jbond will do a final review pass and merge r621343 ~by the end of this week.

We consider this a stopgap that hopefully unblocks immediate needs of the WMCS team. The security guarantees are not going to be great here, and we don't feel comfortable with having this a permanent solution, or to expanding this to other teams. It's now on I/F to figure out the cleaner path forward, however :) Non-root Cumin is in fact something that has been identified as a need and has been on our roadmap for some time now (we had a quarterly goal a few quarters back to experiment with it). The intention is to work on this again in one of the upcoming quarters and hopefully have it as a production feature during this FY. Thanks for the patience, and hope that merging the change above does provide you with the kind of relief you were looking in the short-term!

Hope this helps and happy to discuss this more if there are further comments or needs.

Change 621343 merged by Jbond:
[operations/puppet@production] cumin: for new wmcs. prefix for cookbooks, grant access to wmcs-admins

https://gerrit.wikimedia.org/r/621343