Page MenuHomePhabricator

Requesting write access to ml-staging-codfw for ML team
Closed, ResolvedPublic

Description

As a member of the ML team,

I would like to have admin/write access to the experimental namespace on ml-staging-codfw, so that I can debug Lift Wing deployments more easily. The team is working on deploying and improving model inference on GPUs and needs to be able to attach to a running pod (or edit resources), make changes on the fly without going through the the whole CI/CD pipeline in order to be able to experiment and iterate faster.

Requesting this access for the following team members: @kevinbazira @AikoChou @calbon @isarantopoulos as SREs in the team already have access to the aforementioned resources.

Event Timeline

isarantopoulos renamed this task from Requesting to Requesting write access to ml-staging-codfw for ML team.Jan 8 2024, 9:37 AM

Hi. I'm the clinician on duty this week. I'm afraid I'm not quite clear what sort of access you are requesting here (ml-staging-codfw isn't a group I can see in puppet, nor is it an LDAP group)?

[the answer may be to get one of your SRE colleagues to explain it to me!]

Sorry for the delay.

I think we can remove the SRE-Access-Requests tag, since this likely can be entirely covered on the k8s permission level.

From my understanding so far and according to the k8s docs we need to create a Role since we want just a single namespace (otherwise we would use a ClusterRole). Then we have to create a RoleBinding to assign it to a group/service account.

Change 991292 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] admin_ng: allow write access to pods in experimental ns in ml-staging

https://gerrit.wikimedia.org/r/991292

I started a patch for the above. I haven't found a way to do this only for ml-staging-codfw. From our side there is no issue (it may be preferred) if we can also do that on ml-serve (again only in experimental namespace) but I don't know if SREs would approve this kind of access in production.

Change 991309 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] helmfile/rbac: Allow deploy users to debug pods in experimental

https://gerrit.wikimedia.org/r/991309

klausman moved this task from Ready To Go to In Progress on the Machine-Learning-Team board.

Change 992152 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] ml-serve/staging: Add group to allow debugging operations on

https://gerrit.wikimedia.org/r/992152

Change 992152 merged by Klausman:

[operations/puppet@production] ml-serve/staging: Add group to allow debugging operations on

https://gerrit.wikimedia.org/r/992152

Change 991309 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile/rbac: Allow deploy users to debug pods in experimental

https://gerrit.wikimedia.org/r/991309

This has been solved for now, though needs better docs and possibly simplification, as an extra step is needed:

$ kube_env experimental ml-staging-codfw
$ export KUBECONFIG="/etc/kubernetes/experimental-debug-ml-staging-codfw.config"
$ kubectl exec -it ...

Change 992764 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-serve: Drop explicit list of deployExtraClusterRoles

https://gerrit.wikimedia.org/r/992764

Change 992764 merged by jenkins-bot:

[operations/deployment-charts@master] ml-serve: Drop explicit list of deployExtraClusterRoles

https://gerrit.wikimedia.org/r/992764

Change 994117 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] admin_ng: elevate ml users expermintal permissions

https://gerrit.wikimedia.org/r/994117

Change 991292 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] WIP - admin_ng: allow write access to pods in experimental ns in ml-staging

Reason:

has already been dealt with in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991309

https://gerrit.wikimedia.org/r/991292

Change 994117 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: elevate ml users experimental permissions

https://gerrit.wikimedia.org/r/994117

I tried to delete a revision and an inferenceservice on experimental namespace and it seems that I don't have access:

kubectl delete revision revertrisk-wikidata-predictor-default-00014
Error from server (Forbidden): revisions.serving.knative.dev "revertrisk-wikidata-predictor-default-00014" is forbidden: User "experimental-debug" cannot delete resource "revisions" in API group "serving.knative.dev" in the namespace "experimental"
kubectl delete isvc revertrisk-wikidata
Error from server (Forbidden): inferenceservices.serving.kserve.io "revertrisk-wikidata" is forbidden: User "experimental-debug" cannot delete resource "inferenceservices" in API group "serving.kserve.io" in the namespace "experimental"

The Api groups seem to have been configured properly in the latest patch so I'm out of ideas on what the issue is at the moment.

Change 998330 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] admin_ng: drop version on apiGroups perms for exp NS in LiftWing

https://gerrit.wikimedia.org/r/998330

Change 998330 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: drop version on apiGroups perms for exp NS in LiftWing

https://gerrit.wikimedia.org/r/998330

After dropping the version specifiers (/v...) at the end of the apiGroups directives, this is now working properly.

Change 1006528 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] LiftWing: add missing entry for article-desc certs

https://gerrit.wikimedia.org/r/1006528