In this task we are going to deploy the logo-detection model-server, recently published to the Wikimedia docker registry (T362598), to LiftWing staging for user acceptance testing by the Structured Content team. This will involve the team testing functionality using Upload Stash URLs, identifying and reporting edge cases encountered, and collaborating with the ML team to resolve these issues to ensure the logo-detection inference service readiness for production use.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T349641 [EPIC] Logo machine detection on Commons | |||
In Progress | kevinbazira | T358676 Host a logo detection model for Commons images | |||
Open | kevinbazira | T362749 Deploy logo-detection model-server to LiftWing staging | |||
Open | kevinbazira | T363449 Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons |
Event Timeline
Change #1020706 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental namespace
Change #1020710 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[machinelearning/liftwing/inference-services@main] logo-detection: specify model name
Change #1020710 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] logo-detection: specify model name
Change #1020706 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental namespace
I have deployed the logo-detection model-server in the experimental namespace on LiftWing staging. On checking the pod, I noticed it was not starting successfully due to a CrashLoopBackOff error:
$ kubectl get pods . . . NAME READY STATUS RESTARTS AGE logo-detection-predictor-00001-deployment-7f98bb54f7-hpnf4 1/3 CrashLoopBackOff 5 (36s ago) 4m31s . . .
I have reviewed the logs and found that the CrashLoopBackOff error is occurring because the model-server lacks the necessary permissions to load the logo-detection model:
$ kubectl describe pod logo-detection-predictor-00001-deployment-7f98bb54f7-hpnf4 . . . Traceback (most recent call last): File "/srv/logo_detection/model_server/model.py", line 210, in <module> model = LogoDetectionModel(model_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/srv/logo_detection/model_server/model.py", line 33, in __init__ self.model = self.load() ^^^^^^^^^^^ File "/srv/logo_detection/model_server/model.py", line 36, in load model = keras.models.load_model(self.model_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/lib/python/site-packages/keras/src/saving/saving_api.py", line 176, in load_model return saving_lib.load_model( ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/lib/python/site-packages/keras/src/saving/saving_lib.py", line 139, in load_model with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/lib/python/site-packages/keras/src/utils/file_utils.py", line 436, in File return open(path, mode=mode) ^^^^^^^^^^^^^^^^^^^^^ PermissionError: [Errno 13] Permission denied: '/mnt/models/logo_max_all.keras' . . .
I am going to reach out to SREs to assist in resolving this permission issue.
I've fetched the image locally and run a shell in it like so:
docker run -it --entrypoint /bin/bash docker-registry.wikimedia.org/wikimedia/machinelearning-liftwing-inference-services-logo-detection:stable
I wanted to check the permissions on /mnt/models/, and it turns out, the directory doesn't exist (/mnt does). So my guess is that the Blubberfile needs to be adjusted to create that directory (with the right permissions).
Change #1021397 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[machinelearning/liftwing/inference-services@main] logo-detection: downgrade bookworm to bullseye
Change #1021397 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] logo-detection: downgrade bookworm to bullseye
Change #1021398 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental ns
Change #1021398 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: add logo-detection isvc to experimental ns
The directory is being created by the storage-initializer container when the pod initializes.
I attached a shell to the running pod and checked the permissions on the file and they seem ok
-rw-r--r-- 1 nobody daemon 71333909 Apr 19 09:10 logo_max_all.keras
For comparison these are the permissions from revertrisk-wikidata model
-rw-r--r-- 1 nobody daemon 1019107337 Apr 12 15:54 model.pkl
Same stands for the directories
when I execute
ls -ld /mnt /mnt/models
drwxr-xr-x 1 root root 4096 Apr 12 15:54 /mnt drwxrwsrwx 2 root daemon 4096 Apr 12 15:54 /mnt/models
drwxr-xr-x 1 root root 4096 Apr 19 09:10 /mnt drwxrwsrwx 2 root daemon 4096 Apr 19 09:10 /mnt/models
This is a bug in keras: it tries to open the file with mode r+b (read, append, binary), but since the file is owned by another user (nobody vs. somebody), the call fails. Why keras would need to be able to append to the file, I don't know.
Change #1021883 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] logo-detection: bump keras to 3.2.1
Filed a patch to test latest keras version (3.2.1) which opens the file in "rb" mode
https://github.com/keras-team/keras/blob/master/keras/src/saving/saving_lib.py#L151
keras 3.2.1
with open(filepath, "rb") as f: return _load_model_from_fileobj( f, custom_objects, compile, safe_mode )
vs keras 3.0.4 (current version) https://github.com/keras-team/keras/blob/v3.0.4/keras/saving/saving_lib.py#L139
with file_utils.File(filepath, mode="r+b") as gfile_handle, zipfile.ZipFile( gfile_handle, "r" ) as zf: with zf.open(_CONFIG_FILENAME, "r") as f: config_json = f.read()
Change #1021883 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] logo-detection: bump keras to 3.2.1
Upgrading to keras==3.2.1 resolved the above issue. Nice catch @klausman!
Now in order for requests to work we'd need to give the model server connectivity to the commons upload stash so that the model server can download the images.
Thank you for your help in troubleshooting and resolving the permissions issue, @klausman and @isarantopoulos!
@mfossati, when a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources like wikimedia, wikipedia, and wikidata (see details here). Is it possible for the Structured content team to provide sample URLs from the commons upload stash? This will enable us to configure the logo-detection model-server to access them from LiftWing. Thanks in advance.
Change #1021908 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/deployment-charts@master] ml-services: update keras version in logo detection
Change #1021399 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[machinelearning/liftwing/inference-services@main] logo-detection: upgrade bullseye to bookworm
Change #1021399 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] logo-detection: upgrade bullseye to bookworm
Change #1021401 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection
Change #1021402 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection
Change #1021908 merged by Ilias Sarantopoulos:
[operations/deployment-charts@master] ml-services: update keras version in logo detection
Change #1021401 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: upgrade OS in logo-detection
Hey @kevinbazira , here's how a public stash URL would look like: https://commons.wikimedia.org/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png. The only variable would be the file key, i.e., 1avpfxdmdb4c.deuia.10893556.png.
Not 100% sure, but I guess that you can go for http://localhost:6500/wiki/Special:UploadStash/file/1avpfxdmdb4c.deuia.10893556.png, with commons.wikimedia.org as the host header.