Page MenuHomePhabricator

Give access to ml-serve* to the non-ops members of the ML team
Closed, ResolvedPublic

Description

So far, only members of ops have access (due to a blanket rule).

Additionally, at least Chris Albon, Andy Craze and Kevin Bazira need to be able to log in and become root.

More context for the request: the ML team is bootstrapping an entire new stack and rapid development/testing is really the key to speed up the work. The plan is to add Kubernets and Kubeflow on top, and of course translate the work to puppet as soon as possible (likely as the team manages to achieve milestones). Eventually the hosts will be reimaged and full root shouldn't be needed anymore (but probably some more specific sudo rules will be necessary).

Event Timeline

Change 657864 had a related patch set uploaded (by Klausman; owner: Klausman):
[operations/puppet@production] admin: Let the ml-team admins become root on ml-serve*

https://gerrit.wikimedia.org/r/657864

Change 657864 merged by Klausman:
[operations/puppet@production] admin: Let the ml-team admins become root on ml-serve*

https://gerrit.wikimedia.org/r/657864

klausman triaged this task as High priority.
elukey updated the task description. (Show Details)
elukey updated the task description. (Show Details)

kbazira needs access too

Yep Tobias added Kevin as well, all covered!

elukey lowered the priority of this task from High to Medium.
elukey updated the task description. (Show Details)

Reopening to get this request validated by the SRE team during today's team meeting :)

@klausman (adding a comment here incase it was missed from the meeting) when this access is revoked and the hacking is over we should rebuild all the machines using the re-image script so that we can ensure they can be automaticity installed and no manually steps where missed.

@klausman (adding a comment here incase it was missed from the meeting) when this access is revoked and the hacking is over we should rebuild all the machines using the re-image script so that we can ensure they can be automaticity installed and no manually steps where missed.

Agreed, that was always the plan.

This was approved in today's SRE meeting with the following notes:

  • Approved due to the recognition of exceptional circumstances for quick iteration and difficulty to test iterate anywhere else.
  • Access should be time limited, i.e. after date X, root access for non-ops is removed
  • If possible, testing/iteration using Cloud VPS instances is preferable to doing it in production
  • Needing to push the envelope/rules a bit here is recognized, and will be kept in mind
  • The team members will be clearly instructed that these hosts need to still be treated like production hosts in a secure environment (avoid wildly insecure things like “curl | bash”)

I believe the only thing left here to do is determine when the time limited access should expire.

I'm closing this task since the group has been created. @klausman will reopen the task (or create a new one) once the installation is fleshed out and was transferred/reimaged to a regular production system.