Discussion: dedicated directory in the deployment-chart repository for ML services
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jul 16 2021, 3:29 PM

Description

While chatting with @JMeybohm about the ML cluster we decided to open this task to decide where to place ML-related service definitions in the deployment-charts repository.

The main problem is currently that the services directory, containing helmfile configs, is tailored for the ServiceOps use case, and it may not be good to add ML-related service definitions. For example, in case one of the main ServiceOps k8s clusters needs to be bootstrapped from scratch, it is sufficient to helmfile sync all the dirs under services. If we add ML-specific helmfile configs then an operator would need to know what service runs on what cluster, that is not really straightforward.

We should probably create a separate ml-services directory, in which we'll place (initially) the Kfserving config and its InferenceService definitions.

Details

Subject	Repo	Branch	Lines +/-
helmfile.d: lower the min cpu limit for ml-serve	operations/deployment-charts	master	+8 -1
helmfile.d: skip helm3 namespace creation for ml-services	operations/deployment-charts	master	+3 -1
helmfile.d: add user deploy-kserve	operations/deployment-charts	master	+34 -1
role::deployment_server: add revscoring-editquality-deploy k8s user	operations/puppet	production	+7 -0
kubernetes: add the revscoring-editquality-deploy fake user/token	labs/private	master	+4 -0
Rakefile: change HELMFILE_GLOB to include ml-services	operations/deployment-charts	master	+2 -2
Add revscoring-editquality as first ml-service to helmfile.d	operations/deployment-charts	master	+155 -1
helmfile.d: move private dirs to the new format	operations/deployment-charts	master	+61 -48
Add missing dir declaration to helmfile private configurations	operations/puppet	production	+7 -0
kubernetes: add revscoring-editquality in the services configs	operations/puppet	production	+520 -474
helmfile.d/admin make tiller components configurable per environment	operations/deployment-charts	master	+12 -2
kubeflow-kfserving-inference: avoid repetitions with multi-models	operations/deployment-charts	master	+111 -14

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T272917 Lift Wing proof of concept
		Resolved		elukey	T286791 Discussion: dedicated directory in the deployment-chart repository for ML services

Event Timeline

elukey created this task.Jul 16 2021, 3:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 16 2021, 3:29 PM

We should probably create a separate ml-services directory, in which we'll place (initially) the Kfserving config and its InferenceService definitions.

I think this sounds reasonable, I took a quick glance at the services directory and there are seemed to be some slight differences compared to how we have been discussing the structure of our InferenceService definitions. I am wondering if we have our own directory, then would we have our own ml-serve operator that would manage kfserving/isvcs ?

In T286791#7218022, @ACraze wrote:

We should probably create a separate ml-services directory, in which we'll place (initially) the Kfserving config and its InferenceService definitions.

I think this sounds reasonable, I took a quick glance at the services directory and there are seemed to be some slight differences compared to how we have been discussing the structure of our InferenceService definitions. I am wondering if we have our own directory, then would we have our own ml-serve operator that would manage kfserving/isvcs ?

The idea is to have the same structure as the service directory (so managing everything via helmfile etc..), but just a separation of concerns to ease the work of SREs. Depending on how we create the Kfserving helm chart we should be able to deploy InferenceServices with a simple yaml config deployed via helmfile (still not clear about the details, but it should be possible).

I think having a separate directory is the way to go. I'll leave to you to actually pick a name.

But please consider: you need to add this directory to our CI pipeline. This also means that I would be very careful with modifying the general structure and workflow if you are going to use helmfile for your deployments (which is what I'd recommend).

In T286791#7220181, @Joe wrote:

I think having a separate directory is the way to go. I'll leave to you to actually pick a name.

But please consider: you need to add this directory to our CI pipeline. This also means that I would be very careful with modifying the general structure and workflow if you are going to use helmfile for your deployments (which is what I'd recommend).

Unsurprisingly, I would agree to that. :)

It turns out that even for the admin_ng dir it is a problem, see for example early attempts of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/707408

A possible solution could be to split admin_ng as well, not really DRY but there seems to be no other good solution :(

elukey mentioned this in T289834: Add network policies to the ML k8s clusters.Aug 30 2021, 1:31 PM

elukey mentioned this in T251305: Migrate to helm v3.Aug 30 2021, 4:42 PM

Coming back to this task :)

For the admin_ng directory Joe came up with this trick to add new functionalities as opt-in for clusters: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708475. We used the same solution for both knative and kfserving, and it worked nicely (so only ml-serve clusters will opt-in for them).

The next step is to figure out how will the ml-services directory works. We'll need to use the kubeflow-kfserving-inference chart for each service, that basically deploys the following (for each group of ML models):

an instance of the InferenceService CRD (provided by Kfserving)
a secret containing the credentials to access to swift buckets

The initial goal is to create a subdir for each group of models that ORES provides (mainly 4 categories), and instanciate one or more InferenceService resource (one for each model). We need a deploy user able to read the secret and spin up pods (plus some other things).

Change 719128 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add revscoring-editquality as first ml-service to helmfile.d

https://gerrit.wikimedia.org/r/719128

gerritbot added a project: Patch-For-Review.Sep 6 2021, 3:10 PM

Change 719515 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow-kfserving-inference: avoid repetitions with multi-models

https://gerrit.wikimedia.org/r/719515

elukey mentioned this in T288829: Implement Pod Security Policies for Istio/Knative/Kfserving.Sep 8 2021, 1:29 PM

Change 719522 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Rakefile: change HELMFILE_GLOB to include ml-services

https://gerrit.wikimedia.org/r/719522

Change 720048 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kubernetes: add revscoring-editquality in the services configs

https://gerrit.wikimedia.org/r/720048

Change 719515 merged by Elukey:

[operations/deployment-charts@master] kubeflow-kfserving-inference: avoid repetitions with multi-models

https://gerrit.wikimedia.org/r/719515

Change 720342 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] helmfile.d/admin make tiller components configurable per environment

https://gerrit.wikimedia.org/r/720342

Three code reviews are pending to implement the new ml-services dir:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/720048/ - puppet change to split tokens and secrets between main and ml clusters.
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/719128 - first service in deployment-charts under the ml-services dir.
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/719522/ - CI changes to check the new dir.

Change 720342 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile.d/admin make tiller components configurable per environment

https://gerrit.wikimedia.org/r/720342

Change 722276 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: move private dirs to the new format

https://gerrit.wikimedia.org/r/722276

Change 720048 merged by Elukey:

[operations/puppet@production] kubernetes: add revscoring-editquality in the services configs

https://gerrit.wikimedia.org/r/720048

Change 722818 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add missing dir declaration to helmfile private configurations

https://gerrit.wikimedia.org/r/722818

Change 722818 merged by Elukey:

[operations/puppet@production] Add missing dir declaration to helmfile private configurations

https://gerrit.wikimedia.org/r/722818

Change 722276 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: move private dirs to the new format

https://gerrit.wikimedia.org/r/722276

Current status - part of the helmfile private dir refactoring is done, services are now split by cluster group (main, ml-serve for the moment). It remains the admin_ng use case, that will be probably added as separate hiera config to avoid mixing.

Change 719128 merged by Elukey:

[operations/deployment-charts@master] Add revscoring-editquality as first ml-service to helmfile.d

https://gerrit.wikimedia.org/r/719128

Change 719522 merged by Elukey:

[operations/deployment-charts@master] Rakefile: change HELMFILE_GLOB to include ml-services

https://gerrit.wikimedia.org/r/719522

Change 723073 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] kubernetes: add the revscoring-editquality-deploy fake user/token

https://gerrit.wikimedia.org/r/723073

Change 723073 merged by Elukey:

[labs/private@master] kubernetes: add the revscoring-editquality-deploy fake user/token

https://gerrit.wikimedia.org/r/723073

elukey mentioned this in rLPRI8bedbaa03dfa: kubernetes: add the revscoring-editquality-deploy fake user/token.Sep 23 2021, 10:20 AM

Change 723077 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::deployment_server: add revscoring-editquality-deploy k8s user

https://gerrit.wikimedia.org/r/723077

Change 723077 abandoned by Elukey:

[operations/puppet@production] role::deployment_server: add revscoring-editquality-deploy k8s user

Reason:

https://gerrit.wikimedia.org/r/723077

Change 724448 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add user deploy-kserve

https://gerrit.wikimedia.org/r/724448

Change 724448 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: add user deploy-kserve

https://gerrit.wikimedia.org/r/724448

Change 724757 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: skip helm3 namespace creation for ml-services

https://gerrit.wikimedia.org/r/724757

Change 724757 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: skip helm3 namespace creation for ml-services

https://gerrit.wikimedia.org/r/724757

Change 724956 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: lower the min cpu limit for ml-serve

https://gerrit.wikimedia.org/r/724956

Change 724956 merged by Elukey: