Page MenuHomePhabricator

Upgrade ml clusters to kserve 0.8
Closed, ResolvedPublic

Description

kserve 0.8 release is out: https://github.com/kserve/kserve/releases/tag/v0.8.0

There are some interesting new things, we should try to upgrade and see if everything works as expected in staging.

Some notes:

  • knative recommended version is now 1.0, meanwhile we have 0.18. In https://github.com/kserve/kserve/issues/2292 upstream suggested that everything should work fine even on our platform, but we'd need to test.
  • As explained in the changelog, several Python classes got changed from 0.7 to 0.8, so we'd need to change our code as well.

High level list of things to do:

  1. Prepare the new docker image in production-images (kserve controller).
  2. File code changes for inference-service model.py files and create the new Docker images (bump kserve's pypi dependency as well).
  3. Import the new kserve yaml config in deployment-charts and update the related chart.
  4. Deploy everything to staging and test.
  5. Deploy everything to prod and test.

Event Timeline

Change 810841 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] Upgrade kserve images to upstream release 0.8

https://gerrit.wikimedia.org/r/810841

Change 810841 merged by Elukey:

[operations/docker-images/production-images@master] Upgrade kserve images to upstream release 0.8

https://gerrit.wikimedia.org/r/810841

root@build2001:/srv/images/production-images# build-production-images 
== Step 0: scanning /srv/images/production-images/images ==
Will build the following images:
* docker-registry.discovery.wmnet/kserve-build:0.8.0-1
* docker-registry.discovery.wmnet/kserve-controller:0.8.0-1
* docker-registry.discovery.wmnet/kserve-agent:0.8.0-1
* docker-registry.discovery.wmnet/kserve-storage-initializer:0.8.0-1
== Step 1: building images ==
* Built image docker-registry.discovery.wmnet/kserve-build:0.8.0-1
* Built image docker-registry.discovery.wmnet/kserve-controller:0.8.0-1
* Built image docker-registry.discovery.wmnet/kserve-agent:0.8.0-1
* Built image docker-registry.discovery.wmnet/kserve-storage-initializer:0.8.0-1
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/kserve-controller:0.8.0-1
Successfully published image docker-registry.discovery.wmnet/kserve-agent:0.8.0-1
Successfully published image docker-registry.discovery.wmnet/kserve-build:0.8.0-1
Successfully published image docker-registry.discovery.wmnet/kserve-storage-initializer:0.8.0-1
== Build done! ==
You can see the logs at ./docker-pkg-build.log
== Step 0: scanning /srv/images/production-images/istio ==
Will build the following images:
== Step 1: building images ==
== Step 2: publishing ==
== Build done! ==
You can see the logs at ./docker-pkg-build.log
== Step 0: scanning /srv/images/production-images/cert-manager ==
Will build the following images:
== Step 1: building images ==
== Step 2: publishing ==
== Build done! ==
You can see the logs at ./docker-pkg-build.log

I ran into a dependency issue when building docker image after bumping kserve’s dependencies in requirements.txt:

  • kserve==0.8.0
  • ray==1.9.0
  • numpy==1.19.2 (kserve 0.8.0 depends on numpy~=1.19.2)
  • boto3==1.20.24 (kserve 0.8.0 depends on boto3==1.20.24)
  • botocore==1.23.24 (boto3 1.20.24 depends on botocore<1.24.0 and >=1.23.24)
#18 174.4 ERROR: Cannot install -r model-server/requirements.txt (line 103), -r model-server/requirements.txt (line 108), -r model-server/requirements.txt (line 40), -r model-server/requirements.txt (line 62), -r model-server/requirements.txt (line 63) and numpy==1.19.2 because these package versions have conflicting dependencies.
#18 174.4 
#18 174.4 The conflict is caused by:
#18 174.4     The user requested numpy==1.19.2
#18 174.4     gensim 3.8.3 depends on numpy>=1.11.3
#18 174.4     konlpy 0.5.2 depends on numpy>=1.6
#18 174.4     kserve 0.8.0 depends on numpy~=1.19.2
#18 174.4     ray 1.9.0 depends on numpy>=1.16; python_version < "3.9"
#18 174.4     revscoring 2.11.4 depends on numpy<1.18.999 and >=1.18.4
#18 174.4 
#18 174.4 To fix this you could try to:
#18 174.4 1. loosen the range of package versions you've specified
#18 174.4 2. remove package versions to allow pip attempt to solve the dependency conflict
#18 174.4 
#18 174.4 ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
------
executor failed running [/bin/sh -c python3.7 "-m" "pip" "wheel" "-r" "model-server/requirements.txt" && python3.7 "-m" "pip" "install" "--target" "/opt/lib/python/site-packages" "-r" "model-server/requirements.txt"]: exit code: 1
aikochou@Aikos-MacBook-Pro editquality %

We need to check if we can bump revscoring's dependencies numpy to 1.19.2

@achou thanks a lot! I have tested revscoring on stat1004 in the following way:

  • cloned the revscoring repo
  • pip installed dependencies, and update numpy to 1.19.2
  • ran my script for T309623#8084829, including getting a score from the model.

Everything worked nicely, so I think we can safely upgrade the requirements.txt file in the revscoring repo. Since we have some things pending (MWAPICache, a community member fixing tests, numpy upgrade, etc..) I think that we can wait for everything to be merged and then we can cut a new release (probably 2.11.5).

Ran also all unit tests, only 3 are failing but it is not a regression (namely they are already failing now).

Change 815691 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve: upgrade to upstream release 0.8

https://gerrit.wikimedia.org/r/815691

Change 815721 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] Update Python model servers and requirements to KServe 0.8

https://gerrit.wikimedia.org/r/815721

Change 815721 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] Update Python model servers and requirements to KServe 0.8

https://gerrit.wikimedia.org/r/815721

Change 816710 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: test the first kserve 0.8 Docker images

https://gerrit.wikimedia.org/r/816710

Change 816710 merged by Elukey:

[operations/deployment-charts@master] ml-services: test the first kserve 0.8 Docker images

https://gerrit.wikimedia.org/r/816710

Tested all Docker images locally (and added documentation on Wikitech). Merged the change and updated the isvc images for articlequality and editquality in staging, all tests passed.

It seems possible, at least between these two kserve versions, to upgrade the Docker images separately from the Kserve control plane.

Change 816716 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update Docker images to KServe 0.8 in staging

https://gerrit.wikimedia.org/r/816716

Change 816716 merged by Elukey:

[operations/deployment-charts@master] ml-services: update Docker images to KServe 0.8 in staging

https://gerrit.wikimedia.org/r/816716

Change 815691 merged by Elukey:

[operations/deployment-charts@master] kserve: upgrade to upstream release 0.8

https://gerrit.wikimedia.org/r/815691

Change 816762 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve: add service account to StatefulSet

https://gerrit.wikimedia.org/r/816762

Change 816792 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] kserve: apply upstream fix to storage-initializer

https://gerrit.wikimedia.org/r/816792

Change 816792 merged by Elukey:

[operations/docker-images/production-images@master] kserve: apply upstream fix to storage-initializer

https://gerrit.wikimedia.org/r/816792

Change 816807 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve: update storage-initializer's docker image version

https://gerrit.wikimedia.org/r/816807

Change 816762 merged by Elukey:

[operations/deployment-charts@master] kserve: add service account to StatefulSet

https://gerrit.wikimedia.org/r/816762

Change 816807 merged by Elukey:

[operations/deployment-charts@master] kserve: update storage-initializer's docker image version

https://gerrit.wikimedia.org/r/816807

The new storage-initializer image works! KServe 0.8 is deployed in staging and so far everything works fine. The next step is to plan and execute the deployment to production.

High level details:

  • Move all Kserve pods to the new Docker images.
  • Upgrade the K8s control plane to 0.8

Change 817732 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: move prod docker images to KServe 0.8

https://gerrit.wikimedia.org/r/817732

Change 817732 merged by Elukey:

[operations/deployment-charts@master] ml-services: move prod docker images to KServe 0.8

https://gerrit.wikimedia.org/r/817732

ml-serve-codfw upgraded, all good up to now, waiting a day and some deployments before proceeding wit eqiad as well.

ml-serve-eqiad completed as well.