Page MenuHomePhabricator

Deploy logo-detection model-server to LiftWing staging
Open, Needs TriagePublic

Description

In this task we are going to deploy the logo-detection model-server, recently published to the Wikimedia docker registry (T362598), to LiftWing staging for user acceptance testing by the Structured Content team. This will involve the team testing functionality using Upload Stash URLs, identifying and reporting edge cases encountered, and collaborating with the ML team to resolve these issues to ensure the logo-detection inference service readiness for production use.

Event Timeline

Change #1020706 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental namespace

https://gerrit.wikimedia.org/r/1020706

Change #1020710 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: specify model name

https://gerrit.wikimedia.org/r/1020710

Change #1020710 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: specify model name

https://gerrit.wikimedia.org/r/1020710

Change #1020706 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental namespace

https://gerrit.wikimedia.org/r/1020706

I have deployed the logo-detection model-server in the experimental namespace on LiftWing staging. On checking the pod, I noticed it was not starting successfully due to a CrashLoopBackOff error:

$ kubectl get pods
.
.
.
NAME                                                              READY   STATUS             RESTARTS      AGE
logo-detection-predictor-00001-deployment-7f98bb54f7-hpnf4        1/3     CrashLoopBackOff   5 (36s ago)   4m31s
.
.
.

I have reviewed the logs and found that the CrashLoopBackOff error is occurring because the model-server lacks the necessary permissions to load the logo-detection model:

$ kubectl describe pod logo-detection-predictor-00001-deployment-7f98bb54f7-hpnf4
.
.
.
Traceback (most recent call last):
  File "/srv/logo_detection/model_server/model.py", line 210, in <module>
    model = LogoDetectionModel(model_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/logo_detection/model_server/model.py", line 33, in __init__
    self.model = self.load()
                 ^^^^^^^^^^^
  File "/srv/logo_detection/model_server/model.py", line 36, in load
    model = keras.models.load_model(self.model_path)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/keras/src/saving/saving_api.py", line 176, in load_model
    return saving_lib.load_model(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/keras/src/saving/saving_lib.py", line 139, in load_model
    with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/keras/src/utils/file_utils.py", line 436, in File
    return open(path, mode=mode)
           ^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras'
.
.
.

I am going to reach out to SREs to assist in resolving this permission issue.

PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras'

I've fetched the image locally and run a shell in it like so:

docker run -it --entrypoint /bin/bash docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-logo-detection:stable

I wanted to check the permissions on /mnt/models/, and it turns out, the directory doesn't exist (/mnt does). So my guess is that the Blubberfile needs to be adjusted to create that directory (with the right permissions).

Change #1021397 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: downgrade bookworm to bullseye

https://gerrit.wikimedia.org/r/1021397

Change #1021397 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: downgrade bookworm to bullseye

https://gerrit.wikimedia.org/r/1021397

Change #1021398 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental ns

https://gerrit.wikimedia.org/r/1021398

Change #1021398 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental ns

https://gerrit.wikimedia.org/r/1021398

The directory is being created by the storage-initializer container when the pod initializes.
I attached a shell to the running pod and checked the permissions on the file and they seem ok

-rw-r--r-- 1 nobody daemon 71333909 Apr 19 09:10 logo_max_all.keras

For comparison these are the permissions from revertrisk-wikidata model

-rw-r--r-- 1 nobody daemon 1019107337 Apr 12 15:54 model.pkl

Same stands for the directories
when I execute
ls -ld /mnt /mnt/models

drwxr-xr-x 1 root root   4096 Apr 12 15:54 /mnt
drwxrwsrwx 2 root daemon 4096 Apr 12 15:54 /mnt/models
drwxr-xr-x 1 root root   4096 Apr 19 09:10 /mnt
drwxrwsrwx 2 root daemon 4096 Apr 19 09:10 /mnt/models

This is a bug in keras: it tries to open the file with mode r+b (read, append, binary), but since the file is owned by another user (nobody vs. somebody), the call fails. Why keras would need to be able to append to the file, I don't know.

Change #1021883 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] logo-detection: bump keras to 3.2.1

https://gerrit.wikimedia.org/r/1021883

Filed a patch to test latest keras version (3.2.1) which opens the file in "rb" mode
https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py#L151
keras 3.2.1

with open(filepath, "rb") as f:
            return _load_model_from_fileobj(
                f, custom_objects, compile, safe_mode
            )

vs keras 3.0.4 (current version) https://github.com/keras-team/keras/blob/v3.0.4/keras/saving/saving_lib.py#L139

with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile(
        gfile_handle, "r"
    ) as zf:
        with zf.open(_CONFIG_FILENAME, "r") as f:
            config_json = f.read()

Change #1021883 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: bump keras to 3.2.1

https://gerrit.wikimedia.org/r/1021883

Upgrading to keras==3.2.1 resolved the above issue. Nice catch @klausman!
Now in order for requests to work we'd need to give the model server connectivity to the commons upload stash so that the model server can download the images.

Thank you for your help in troubleshooting and resolving the permissions issue, @klausman and @isarantopoulos!

@mfossati, when a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources like wikimedia, wikipedia, and wikidata (see details here). Is it possible for the Structured content team to provide sample URLs from the commons upload stash? This will enable us to configure the logo-detection model-server to access them from LiftWing. Thanks in advance.

Change #1021908 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update keras version in logo detection

https://gerrit.wikimedia.org/r/1021908

Change #1021399 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: upgrade bullseye to bookworm

https://gerrit.wikimedia.org/r/1021399

Change #1021399 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: upgrade bullseye to bookworm

https://gerrit.wikimedia.org/r/1021399

Change #1021401 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection

https://gerrit.wikimedia.org/r/1021401

Change #1021402 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection

https://gerrit.wikimedia.org/r/1021402

Change #1021908 merged by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: update keras version in logo detection

https://gerrit.wikimedia.org/r/1021908

Change #1021401 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection

https://gerrit.wikimedia.org/r/1021401

@mfossati, when a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources like wikimedia, wikipedia, and wikidata (see details here). Is it possible for the Structured content team to provide sample URLs from the commons upload stash? This will enable us to configure the logo-detection model-server to access them from LiftWing. Thanks in advance.

Hey @kevinbazira , here's how a public stash URL would look like: https://commons.wikimedia.org/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png. The only variable would be the file key, i.e., 1avpfxdmdb4c.deuia.10893556.png.
Not 100% sure, but I guess that you can go for http://localhost:6500/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png, with commons.wikimedia.org as the host header.

Thank you for sharing an example of the public stash URL, @mfossati! In T363449, we are going to configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.