Page MenuHomePhabricator

Update kserve to v0.15.2* on ML clusters
Open, HighPublic8 Estimated Story Points

Description

  • Figure out bets version to update to. v0.15.2 seems like a good candidate.
  • Update prod-images to build kserve images (Similar to change 1046617)
  • Update helm charts (note: moving to the upstream chart should be done later for simplicity)
  • Update staging-codfw to use new kserve and test functionality
  • Update prod-codfw to use new kserve and test functionality
  • Update prod-eqiad to use new kserve and test functionality

The last two items should be done with the non-active DC going first.

Event Timeline

klausman set the point value for this task to 8.Nov 26 2024, 3:17 PM

@klausman Shall we rename this task and switch to a newer version? A candidate could be the latest version 0.15.2

@klausman Shall we rename this task and switch to a newer version? A candidate could be the latest version 0.15.2

Yeah, sounds good. I'll make the edit.

klausman renamed this task from Update kserve to v0.13.0 on ML clusters to Update kserve to v0.15.2* on ML clusters.Jul 8 2025, 11:59 AM
klausman updated the task description. (Show Details)

Posting here also something that would be useful for us to try:
We can use https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1154293 to spin up a cluster and try out changes in the admin_ng namespace
If this proves to be useful we should also update our documentation

cc: @BWojtowicz-WMF

I've managed to spin up a local cluster with minikube, following our documentation. The documentation is a little outdated, thus I'll be updating it this week with the discovered improvements.
On my local cluster, I've installed new kserve version directly from kserve github charts and I could successfully deploy our services, which means there should be no dependency conflicts between new kserve version and our current setup.

As suggested by @elukey , I've looked through the Kserve release updates to look for possible conflicts with old Istio versions. Although, I have not found any mention of breaking changes, there were several Istio bumps in the Kserve chart, which now uses 1.20.4 version. Istio will also be updated on our end in the next Kubernetes upgrade to version 1.24.x, but the Kubernetes upgrade needs to be scheduled by ML team first. @isarantopoulos It might make sense to update the Kserve chart only once we bump K8s and Istio versions, but it's not required.


Next steps

  1. I will prepare patch to build kserve==0.15.2 Wikimedia prod images.
  2. Once the image is built, we need to update the HELM charts according to the README. We need to discuss whether we do it before or after migrating to new K8s/Istio versions.
  3. Apply and test the new charts successively in the staging-codfw, prod-codfw and prod-eqiad.

Sorry for the drive-by: I've created a script (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/kind.sh) that can spin up production like wikikube clusters with "kind". Maybe there's an opportunity to move the steps in https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/ML-Sandbox/Configuration towards that so we can use the same code/tools.
I should read more carefully - sorry.