Page MenuHomePhabricator

Upgrade the inference-services repo codebase to kserve 0.10 (fastapi)
Closed, ResolvedPublic

Description

During the upgrade of the revscoring model servers to Bullseye and Python 3.9 we had to upgrade their kserve dependency to 0.10 to use a more up-to-date version of numpy. The upgrade to kserve 0.10 is a lot more invasive than the other ones, since it involves dropping tornado in favor of fastapi.

The main problem in our codebase is that we raise tornado-specific exceptions when an error condition occurs, to return a specific HTTP response. With the drop of tornado we'll have to use something different, and at the moment for revscoring we opted for some nice built in exceptions: InvalidInput and InferenceError. Due to how fastapi works, it is possible to also return Json responses with a certain code, we'll see what's best when we work on this task.

Since we have common modules, like events.py, we decided to use RuntimeError exceptions in there, since it seemed a good compromise to make everything work (tornado and fastapi). Once we have everything on kserve 0.10 we should also update the common modules to more meaningful exceptions (if needed).

To recap:

  • Upgrade nsfw to Kserve 0.10 - see T331416
  • Upgrade revert risk to Kserve 0.10
  • Upgrade outlink to Kserve 0.10
  • Check common modules like events.py if they need better exceptions.

Event Timeline

@elukey Do these model servers also need to be upgraded to bullseye and python 3.9?

@achou yep yep, there should be some tasks related to that move in the backlog. We can probably couple the migrations in one, I'll let people decide :)

In this context, after we upgrade we can check it there is a swagger UI available for the model servers (which comes bundled with fastAPI https://fastapi.tiangolo.com/features/)

Change 894004 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revscoring: remove unnecessary aiohttp session cleanup

https://gerrit.wikimedia.org/r/894004

Change 894005 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to Kserve 0.10

https://gerrit.wikimedia.org/r/894005

Change 894006 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] nsfw: upgrade to Kserve 0.10

https://gerrit.wikimedia.org/r/894006

Change 894007 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] outlink: upgrade to Kserve 0.10

https://gerrit.wikimedia.org/r/894007

Change 894004 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revscoring: remove unnecessary aiohttp session cleanup

https://gerrit.wikimedia.org/r/894004

Change 894005 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: upgrade to Kserve 0.10

https://gerrit.wikimedia.org/r/894005

Change 894663 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] blubber: add python3-distutils to revert-risk configs

https://gerrit.wikimedia.org/r/894663

Change 894663 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] blubber: add python3-distutils to revert-risk configs

https://gerrit.wikimedia.org/r/894663

Change 894006 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] nsfw: upgrade to Kserve 0.10

https://gerrit.wikimedia.org/r/894006

Change 894695 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] nsfw,revert-risk: update method signature for Kserve 0.10

https://gerrit.wikimedia.org/r/894695

Change 894007 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] outlink: upgrade to Kserve 0.10

https://gerrit.wikimedia.org/r/894007

Change 894695 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] nsfw,revert-risk: update method signature for Kserve 0.10

https://gerrit.wikimedia.org/r/894695

Change 895065 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: update docker images for kserve 0.10

https://gerrit.wikimedia.org/r/895065

Change 895065 merged by Elukey:

[operations/deployment-charts@master] ml-services: update docker images for kserve 0.10

https://gerrit.wikimedia.org/r/895065

Change 895132 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] python: remove FIXME and refactor some exceptions

https://gerrit.wikimedia.org/r/895132

I opened T331416 for nsfw, since the model's predict function seems to hang with the new version of Kserve. Since it is an experimental model it shouldn't block upgrades.

Change 895132 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] python: remove FIXME and refactor some exceptions

https://gerrit.wikimedia.org/r/895132

Change 895706 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: upgrade docker images

https://gerrit.wikimedia.org/r/895706

Change 895706 merged by Elukey:

[operations/deployment-charts@master] ml-services: upgrade docker images

https://gerrit.wikimedia.org/r/895706

Task completed, all clusters upgraded to kserve 0.10. The nsfw model doesn't work but since it is experimental we'll follow up in T331416

elukey updated the task description. (Show Details)