Page MenuHomePhabricator

Discussion: dedicated directory in the deployment-chart repository for ML services
Closed, ResolvedPublic

Description

While chatting with @JMeybohm about the ML cluster we decided to open this task to decide where to place ML-related service definitions in the deployment-charts repository.

The main problem is currently that the services directory, containing helmfile configs, is tailored for the ServiceOps use case, and it may not be good to add ML-related service definitions. For example, in case one of the main ServiceOps k8s clusters needs to be bootstrapped from scratch, it is sufficient to helmfile sync all the dirs under services. If we add ML-specific helmfile configs then an operator would need to know what service runs on what cluster, that is not really straightforward.

We should probably create a separate ml-services directory, in which we'll place (initially) the Kfserving config and its InferenceService definitions.

Event Timeline

We should probably create a separate ml-services directory, in which we'll place (initially) the Kfserving config and its InferenceService definitions.

I think this sounds reasonable, I took a quick glance at the services directory and there are seemed to be some slight differences compared to how we have been discussing the structure of our InferenceService definitions. I am wondering if we have our own directory, then would we have our own ml-serve operator that would manage kfserving/isvcs ?

We should probably create a separate ml-services directory, in which we'll place (initially) the Kfserving config and its InferenceService definitions.

I think this sounds reasonable, I took a quick glance at the services directory and there are seemed to be some slight differences compared to how we have been discussing the structure of our InferenceService definitions. I am wondering if we have our own directory, then would we have our own ml-serve operator that would manage kfserving/isvcs ?

The idea is to have the same structure as the service directory (so managing everything via helmfile etc..), but just a separation of concerns to ease the work of SREs. Depending on how we create the Kfserving helm chart we should be able to deploy InferenceServices with a simple yaml config deployed via helmfile (still not clear about the details, but it should be possible).

I think having a separate directory is the way to go. I'll leave to you to actually pick a name.

But please consider: you need to add this directory to our CI pipeline. This also means that I would be very careful with modifying the general structure and workflow if you are going to use helmfile for your deployments (which is what I'd recommend).

I think having a separate directory is the way to go. I'll leave to you to actually pick a name.

But please consider: you need to add this directory to our CI pipeline. This also means that I would be very careful with modifying the general structure and workflow if you are going to use helmfile for your deployments (which is what I'd recommend).

Unsurprisingly, I would agree to that. :)

It turns out that even for the admin_ng dir it is a problem, see for example early attempts of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/707408

A possible solution could be to split admin_ng as well, not really DRY but there seems to be no other good solution :(

Coming back to this task :)

For the admin_ng directory Joe came up with this trick to add new functionalities as opt-in for clusters: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708475. We used the same solution for both knative and kfserving, and it worked nicely (so only ml-serve clusters will opt-in for them).

The next step is to figure out how will the ml-services directory works. We'll need to use the kubeflow-kfserving-inference chart for each service, that basically deploys the following (for each group of ML models):

  • an instance of the InferenceService CRD (provided by Kfserving)
  • a secret containing the credentials to access to swift buckets

The initial goal is to create a subdir for each group of models that ORES provides (mainly 4 categories), and instanciate one or more InferenceService resource (one for each model). We need a deploy user able to read the secret and spin up pods (plus some other things).

Change 719128 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Add revscoring-editquality as first ml-service to helmfile.d

https://gerrit.wikimedia.org/r/719128

Change 719515 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kubeflow-kfserving-inference: avoid repetitions with multi-models

https://gerrit.wikimedia.org/r/719515

Change 719522 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Rakefile: change HELMFILE_GLOB to include ml-services

https://gerrit.wikimedia.org/r/719522

Change 720048 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kubernetes: add revscoring-editquality in the services configs

https://gerrit.wikimedia.org/r/720048

Change 719515 merged by Elukey:

[operations/deployment-charts@master] kubeflow-kfserving-inference: avoid repetitions with multi-models

https://gerrit.wikimedia.org/r/719515

Change 720342 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/deployment-charts@master] helmfile.d/admin make tiller components configurable per environment

https://gerrit.wikimedia.org/r/720342

Three code reviews are pending to implement the new ml-services dir:

Change 720342 merged by jenkins-bot:

[operations/deployment-charts@master] helmfile.d/admin make tiller components configurable per environment

https://gerrit.wikimedia.org/r/720342

Change 722276 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: move private dirs to the new format

https://gerrit.wikimedia.org/r/722276

Change 720048 merged by Elukey:

[operations/puppet@production] kubernetes: add revscoring-editquality in the services configs

https://gerrit.wikimedia.org/r/720048

Change 722818 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add missing dir declaration to helmfile private configurations

https://gerrit.wikimedia.org/r/722818

Change 722818 merged by Elukey:

[operations/puppet@production] Add missing dir declaration to helmfile private configurations

https://gerrit.wikimedia.org/r/722818

Change 722276 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: move private dirs to the new format

https://gerrit.wikimedia.org/r/722276

Current status - part of the helmfile private dir refactoring is done, services are now split by cluster group (main, ml-serve for the moment). It remains the admin_ng use case, that will be probably added as separate hiera config to avoid mixing.

Change 719128 merged by Elukey:

[operations/deployment-charts@master] Add revscoring-editquality as first ml-service to helmfile.d

https://gerrit.wikimedia.org/r/719128

Change 719522 merged by Elukey:

[operations/deployment-charts@master] Rakefile: change HELMFILE_GLOB to include ml-services

https://gerrit.wikimedia.org/r/719522

Change 723073 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] kubernetes: add the revscoring-editquality-deploy fake user/token

https://gerrit.wikimedia.org/r/723073

Change 723073 merged by Elukey:

[labs/private@master] kubernetes: add the revscoring-editquality-deploy fake user/token

https://gerrit.wikimedia.org/r/723073

Change 723077 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::deployment_server: add revscoring-editquality-deploy k8s user

https://gerrit.wikimedia.org/r/723077

Change 723077 abandoned by Elukey:

[operations/puppet@production] role::deployment_server: add revscoring-editquality-deploy k8s user

Reason:

https://gerrit.wikimedia.org/r/723077

Change 724448 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add user deploy-kserve

https://gerrit.wikimedia.org/r/724448

Change 724448 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: add user deploy-kserve

https://gerrit.wikimedia.org/r/724448

Change 724757 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: skip helm3 namespace creation for ml-services

https://gerrit.wikimedia.org/r/724757

Change 724757 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: skip helm3 namespace creation for ml-services

https://gerrit.wikimedia.org/r/724757

Change 724956 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: lower the min cpu limit for ml-serve

https://gerrit.wikimedia.org/r/724956

Change 724956 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: lower the min cpu limit for ml-serve

https://gerrit.wikimedia.org/r/724956

elukey claimed this task.

This has been implemented, and we deployed the first model/service via Helm3.