Deploy logo-detection model-server to LiftWing staging
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	kevinbazira
	Wed, Apr 17, 7:19 AM

Description

In this task we are going to deploy the logo-detection model-server, recently published to the Wikimedia docker registry (T362598), to LiftWing staging for user acceptance testing by the Structured Content team. This will involve the team testing functionality using Upload Stash URLs, identifying and reporting edge cases encountered, and collaborating with the ML team to resolve these issues to ensure the logo-detection inference service readiness for production use.

Details

Subject	Repo	Branch	Lines +/-
ml-services: upgrade OS in logo-detection	operations/deployment-charts	master	+1 -1
ml-services: update keras version in logo detection	operations/deployment-charts	master	+1 -1
ml-services: upgrade OS in logo-detection	operations/deployment-charts	master	+1 -1
logo-detection: upgrade bullseye to bookworm	machinelearning/liftwing/inference-services	main	+1 -1
logo-detection: bump keras to 3.2.1	machinelearning/liftwing/inference-services	main	+1 -1
ml-services: add logo-detection isvc to experimental ns	operations/deployment-charts	master	+1 -1
logo-detection: downgrade bookworm to bullseye	machinelearning/liftwing/inference-services	main	+1 -1
ml-services: add logo-detection isvc to experimental namespace	operations/deployment-charts	master	+17 -0
logo-detection: specify model name	machinelearning/liftwing/inference-services	main	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T349641 [EPIC] Logo machine detection on Commons
In Progress	kevinbazira	T358676 Host a logo detection model for Commons images
Open	kevinbazira	T362749 Deploy logo-detection model-server to LiftWing staging
Open	kevinbazira	T363449 Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons

Event Timeline

kevinbazira created this task.Wed, Apr 17, 7:19 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Apr 17, 7:19 AM

kevinbazira mentioned this in T362598: Prepare docker image for hosting the logo-detection model-server on LiftWing.Wed, Apr 17, 7:29 AM

Change #1020706 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental namespace

https://gerrit.wikimedia.org/r/1020706

gerritbot added a project: Patch-For-Review.Wed, Apr 17, 8:24 AM

mfossati subscribed.Wed, Apr 17, 8:54 AM

Change #1020710 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: specify model name

https://gerrit.wikimedia.org/r/1020710

Change #1020710 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: specify model name

https://gerrit.wikimedia.org/r/1020710

kevinbazira mentioned this in rMLIS4553d60c2dc0: logo-detection: specify model name.Wed, Apr 17, 11:53 AM

Change #1020706 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental namespace

https://gerrit.wikimedia.org/r/1020706

I have deployed the logo-detection model-server in the experimental namespace on LiftWing staging. On checking the pod, I noticed it was not starting successfully due to a CrashLoopBackOff error:

$ kubectl get pods
.
.
.
NAME                                                              READY   STATUS             RESTARTS      AGE
logo-detection-predictor-00001-deployment-7f98bb54f7-hpnf4        1/3     CrashLoopBackOff   5 (36s ago)   4m31s
.
.
.

I have reviewed the logs and found that the CrashLoopBackOff error is occurring because the model-server lacks the necessary permissions to load the logo-detection model:

$ kubectl describe pod logo-detection-predictor-00001-deployment-7f98bb54f7-hpnf4
.
.
.
Traceback (most recent call last):
  File "/srv/logo_detection/model_server/model.py", line 210, in <module>
    model = LogoDetectionModel(model_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/logo_detection/model_server/model.py", line 33, in __init__
    self.model = self.load()
                 ^^^^^^^^^^^
  File "/srv/logo_detection/model_server/model.py", line 36, in load
    model = keras.models.load_model(self.model_path)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/keras/src/saving/saving_api.py", line 176, in load_model
    return saving_lib.load_model(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/keras/src/saving/saving_lib.py", line 139, in load_model
    with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/lib/python/site-packages/keras/src/utils/file_utils.py", line 436, in File
    return open(path, mode=mode)
           ^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras'
.
.
.

I am going to reach out to SREs to assist in resolving this permission issue.

In T362749#9727438, @kevinbazira wrote:

PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras'

I've fetched the image locally and run a shell in it like so:

docker run -it --entrypoint /bin/bash docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-logo-detection:stable

I wanted to check the permissions on /mnt/models/, and it turns out, the directory doesn't exist (/mnt does). So my guess is that the Blubberfile needs to be adjusted to create that directory (with the right permissions).

Change #1021397 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: downgrade bookworm to bullseye

https://gerrit.wikimedia.org/r/1021397

Change #1021397 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: downgrade bookworm to bullseye

https://gerrit.wikimedia.org/r/1021397

kevinbazira mentioned this in rMLIS6575d655b7cb: logo-detection: downgrade bookworm to bullseye.Fri, Apr 19, 8:35 AM

Change #1021398 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental ns

https://gerrit.wikimedia.org/r/1021398

Change #1021398 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental ns

https://gerrit.wikimedia.org/r/1021398

The directory is being created by the storage-initializer container when the pod initializes.
I attached a shell to the running pod and checked the permissions on the file and they seem ok

-rw-r--r-- 1 nobody daemon 71333909 Apr 19 09:10 logo_max_all.keras

For comparison these are the permissions from revertrisk-wikidata model

-rw-r--r-- 1 nobody daemon 1019107337 Apr 12 15:54 model.pkl

Same stands for the directories
when I execute
ls -ld /mnt /mnt/models

drwxr-xr-x 1 root root   4096 Apr 12 15:54 /mnt
drwxrwsrwx 2 root daemon 4096 Apr 12 15:54 /mnt/models

drwxr-xr-x 1 root root   4096 Apr 19 09:10 /mnt
drwxrwsrwx 2 root daemon 4096 Apr 19 09:10 /mnt/models

This is a bug in keras: it tries to open the file with mode r+b (read, append, binary), but since the file is owned by another user (nobody vs. somebody), the call fails. Why keras would need to be able to append to the file, I don't know.

Change #1021883 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[machinelearning/liftwing/inference-services@main] logo-detection: bump keras to 3.2.1

https://gerrit.wikimedia.org/r/1021883

Filed a patch to test latest keras version (3.2.1) which opens the file in "rb" mode
https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py#L151
keras 3.2.1

with open(filepath, "rb") as f:
            return _load_model_from_fileobj(
                f, custom_objects, compile, safe_mode
            )

vs keras 3.0.4 (current version) https://github.com/keras-team/keras/blob/v3.0.4/keras/saving/saving_lib.py#L139

with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile(
        gfile_handle, "r"
    ) as zf:
        with zf.open(_CONFIG_FILENAME, "r") as f:
            config_json = f.read()

Change #1021883 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: bump keras to 3.2.1

https://gerrit.wikimedia.org/r/1021883

isarantopoulos mentioned this in rMLIS6834a49bfbb3: logo-detection: bump keras to 3.2.1.Fri, Apr 19, 10:34 AM

Upgrading to keras==3.2.1 resolved the above issue. Nice catch @klausman!
Now in order for requests to work we'd need to give the model server connectivity to the commons upload stash so that the model server can download the images.

Thank you for your help in troubleshooting and resolving the permissions issue, @klausman and @isarantopoulos!

@mfossati, when a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources like wikimedia, wikipedia, and wikidata (see details here). Is it possible for the Structured content team to provide sample URLs from the commons upload stash? This will enable us to configure the logo-detection model-server to access them from LiftWing. Thanks in advance.

Change #1021908 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):

[operations/deployment-charts@master] ml-services: update keras version in logo detection

https://gerrit.wikimedia.org/r/1021908

Change #1021399 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: upgrade bullseye to bookworm

https://gerrit.wikimedia.org/r/1021399

Change #1021399 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] logo-detection: upgrade bullseye to bookworm

https://gerrit.wikimedia.org/r/1021399

kevinbazira mentioned this in rMLISc7f820b95ebd: logo-detection: upgrade bullseye to bookworm.Fri, Apr 19, 12:46 PM

Change #1021401 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection

https://gerrit.wikimedia.org/r/1021401

Change #1021402 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection

https://gerrit.wikimedia.org/r/1021402

Change #1021908 merged by Ilias Sarantopoulos:

[operations/deployment-charts@master] ml-services: update keras version in logo detection

https://gerrit.wikimedia.org/r/1021908

Change #1021401 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection

https://gerrit.wikimedia.org/r/1021401

In T362749#9729294, @kevinbazira wrote:

@mfossati, when a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources like wikimedia, wikipedia, and wikidata (see details here). Is it possible for the Structured content team to provide sample URLs from the commons upload stash? This will enable us to configure the logo-detection model-server to access them from LiftWing. Thanks in advance.

Hey @kevinbazira , here's how a public stash URL would look like: https://commons.wikimedia.org/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png. The only variable would be the file key, i.e., 1avpfxdmdb4c.deuia.10893556.png.
Not 100% sure, but I guess that you can go for http://localhost:6500/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png, with commons.wikimedia.org as the host header.

mfossati awarded a token.Thu, Apr 25, 9:05 AM

kevinbazira mentioned this in T363449: Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.Thu, Apr 25, 9:21 AM

Thank you for sharing an example of the public stash URL, @mfossati! In T363449, we are going to configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons.

Deploy logo-detection model-server to LiftWing stagingOpen, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Deploy logo-detection model-server to LiftWing staging
Open, Needs TriagePublic
Actions

Related Objects
Search...