Requesting write access to ml-staging-codfw for ML team
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	isarantopoulos
	Jan 8 2024, 9:16 AM

Description

As a member of the ML team,

I would like to have admin/write access to the experimental namespace on ml-staging-codfw, so that I can debug Lift Wing deployments more easily. The team is working on deploying and improving model inference on GPUs and needs to be able to attach to a running pod (or edit resources), make changes on the fly without going through the the whole CI/CD pipeline in order to be able to experiment and iterate faster.

Requesting this access for the following team members: @kevinbazira @AikoChou @calbon @isarantopoulos as SREs in the team already have access to the aforementioned resources.

Details

Subject	Repo	Branch	Lines +/-
admin_ng: drop version on apiGroups perms for exp NS in LiftWing	operations/deployment-charts	master	+2 -2
admin_ng: elevate ml users experimental permissions	operations/deployment-charts	master	+6 -0
WIP - admin_ng: allow write access to pods in experimental ns in ml-staging	operations/deployment-charts	master	+24 -0
ml-serve: Drop explicit list of deployExtraClusterRoles	operations/deployment-charts	master	+0 -3
helmfile/rbac: Allow deploy users to debug pods in experimental	operations/deployment-charts	master	+29 -0
ml-serve/staging: Add group to allow debugging operations on	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T353337 Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models
		Resolved		klausman	T354516 Requesting write access to ml-staging-codfw for ML team

Event Timeline

isarantopoulos created this task.Jan 8 2024, 9:16 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 8 2024, 9:16 AM

Maintenance_bot added a project: SRE.Jan 8 2024, 9:29 AM

isarantopoulos renamed this task from Requesting to Requesting write access to ml-staging-codfw for ML team.Jan 8 2024, 9:37 AM

Hi. I'm the clinician on duty this week. I'm afraid I'm not quite clear what sort of access you are requesting here (ml-staging-codfw isn't a group I can see in puppet, nor is it an LDAP group)?

[the answer may be to get one of your SRE colleagues to explain it to me!]

Sorry for the delay.

MatthewVernon moved this task from Untriaged to Awaiting User Input on the SRE-Access-Requests board.Jan 9 2024, 11:17 AM

calbon assigned this task to klausman.Jan 9 2024, 3:19 PM

I think we can remove the SRE-Access-Requests tag, since this likely can be entirely covered on the k8s permission level.

MatthewVernon unsubscribed.Jan 9 2024, 3:27 PM

calbon moved this task from Unsorted to Ready To Go on the Machine-Learning-Team board.Jan 9 2024, 3:42 PM

isarantopoulos added a parent task: T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models.Jan 9 2024, 3:44 PM

From my understanding so far and according to the k8s docs we need to create a Role since we want just a single namespace (otherwise we would use a ClusterRole). Then we have to create a RoleBinding to assign it to a group/service account.

Change 991292 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] admin_ng: allow write access to pods in experimental ns in ml-staging

https://gerrit.wikimedia.org/r/991292

gerritbot added a project: Patch-For-Review.Jan 17 2024, 9:51 AM

I started a patch for the above. I haven't found a way to do this only for ml-staging-codfw. From our side there is no issue (it may be preferred) if we can also do that on ml-serve (again only in experimental namespace) but I don't know if SREs would approve this kind of access in production.

Change 991309 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] helmfile/rbac: Allow deploy users to debug pods in experimental

https://gerrit.wikimedia.org/r/991309

klausman triaged this task as High priority.Jan 18 2024, 10:31 AM

klausman moved this task from Ready To Go to In Progress on the Machine-Learning-Team board.

Change 992152 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] ml-serve/staging: Add group to allow debugging operations on

https://gerrit.wikimedia.org/r/992152

Change 992152 merged by Klausman:

[operations/puppet@production] ml-serve/staging: Add group to allow debugging operations on

https://gerrit.wikimedia.org/r/992152

Change 991309 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile/rbac: Allow deploy users to debug pods in experimental

https://gerrit.wikimedia.org/r/991309

This has been solved for now, though needs better docs and possibly simplification, as an extra step is needed:

$ kube_env experimental ml-staging-codfw
$ export KUBECONFIG="/etc/kubernetes/experimental-debug-ml-staging-codfw.config"
$ kubectl exec -it ...

isarantopoulos awarded a token.Jan 23 2024, 5:34 PM

Change 992764 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] ml-serve: Drop explicit list of deployExtraClusterRoles

https://gerrit.wikimedia.org/r/992764

Change 992764 merged by jenkins-bot:

[operations/deployment-charts@master] ml-serve: Drop explicit list of deployExtraClusterRoles

https://gerrit.wikimedia.org/r/992764

Change 994117 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] admin_ng: elevate ml users expermintal permissions

https://gerrit.wikimedia.org/r/994117

Change 991292 abandoned by Ilias Sarantopoulos:

[operations/deployment-charts@master] WIP - admin_ng: allow write access to pods in experimental ns in ml-staging

Reason:

has already been dealt with in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/991309

https://gerrit.wikimedia.org/r/991292

Change 994117 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: elevate ml users experimental permissions

https://gerrit.wikimedia.org/r/994117

I tried to delete a revision and an inferenceservice on experimental namespace and it seems that I don't have access:

kubectl delete revision revertrisk-wikidata-predictor-default-00014
Error from server (Forbidden): revisions.serving.knative.dev "revertrisk-wikidata-predictor-default-00014" is forbidden: User "experimental-debug" cannot delete resource "revisions" in API group "serving.knative.dev" in the namespace "experimental"

kubectl delete isvc revertrisk-wikidata
Error from server (Forbidden): inferenceservices.serving.kserve.io "revertrisk-wikidata" is forbidden: User "experimental-debug" cannot delete resource "inferenceservices" in API group "serving.kserve.io" in the namespace "experimental"

The Api groups seem to have been configured properly in the latest patch so I'm out of ideas on what the issue is at the moment.

Maintenance_bot removed a project: Patch-For-Review.Feb 6 2024, 4:31 PM

Change 998330 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] admin_ng: drop version on apiGroups perms for exp NS in LiftWing

https://gerrit.wikimedia.org/r/998330

gerritbot added a project: Patch-For-Review.Feb 7 2024, 10:41 AM

Change 998330 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: drop version on apiGroups perms for exp NS in LiftWing

https://gerrit.wikimedia.org/r/998330

After dropping the version specifiers (/v...) at the end of the apiGroups directives, this is now working properly.

klausman moved this task from In Progress to 2023-2024 Q3 Done on the Machine-Learning-Team board.Feb 7 2024, 2:00 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 7 2024, 2:30 PM

Change 1006528 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/deployment-charts@master] LiftWing: add missing entry for article-desc certs

https://gerrit.wikimedia.org/r/1006528

gerritbot added a project: Patch-For-Review.Feb 26 2024, 1:43 PM

klausman closed this task as Resolved.Mar 5 2024, 11:50 AM

Requesting write access to ml-staging-codfw for ML teamClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Requesting write access to ml-staging-codfw for ML team
Closed, ResolvedPublic
Actions

Related Objects
Search...