Page MenuHomePhabricator

Configure the logo-detection model-server hosted on LiftWing to process images from Wikimedia Commons
Closed, ResolvedPublic

Description

When a model-server is deployed within the WMF k8s infrastructure it has to be configured to enable it to access external resources like wikimedia.org, wikipedia.org, and wikidata.org. In T362749, we deployed the logo-detection model-server to LiftWing staging and the Structured Content team provided sample URLs from the commons upload stash. In this task we are going to:

  • restrict image processing to trusted domains that host Wikimedia Commons images.
  • configure the model-server to access Commons images from LiftWing.
NOTE: Both the Structured Content team and the ML team agreed to use an internal endpoint to prevent the isvc from processing images from untrusted sources. The model-server now processes base64 encoded image objects sent by the Wikimedia Commons MediaDetection API.

Event Timeline

Change #1023542 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: restrict image processing to trusted domains

https://gerrit.wikimedia.org/r/1023542

A restriction has been added to the model-server to enable it process images from trusted domains that host Wikimedia Commons images as shown in the requests below:
1.Request A: passes because it uses a trusted domain

$ curl -s localhost:8080/v1/models/logo-detection:predict -X POST -d '{"instances": [ { "filename": "Cambia_logo.png", "url": "http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224", "target": "logo" } ] }' -i --header "Content-type: application/json"
HTTP/1.1 200 OK
date: Fri, 26 Apr 2024 08:24:54 GMT
server: uvicorn
content-length: 101
content-type: application/json

{"predictions":[{"filename":"Cambia_logo.png","target":"logo","prediction":1.0,"out_of_domain":0.0}]}

2.Request B: fails because it uses an untrusted domain

$ curl -s localhost:8080/v1/models/logo-detection:predict -X POST -d '{"instances": [ { "filename": "Cambia_logo.png", "url": "https://phab.wmfusercontent.org/file/data/mb6wynlvf3bdfw5e443f/PHID-FILE-wc27fvtkl6yv4rjdlqzn/Cambia_logo.png", "target": "logo" } ] }' -i --header "Content-type: application/json"
HTTP/1.1 500 Internal Server Error
date: Fri, 26 Apr 2024 08:25:16 GMT
server: uvicorn
content-length: 142
content-type: application/json

{"error":"Requests to phab.wmfusercontent.org are not allowed.                         Only images from the Wikimedia Commons are processed."}

Hi @elukey, following the recent switch from api-ro to mw-api-int-ro in T362316. If we wanted to enable the logo-detection model-server hosted on LiftWing to access the external URL below:

http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224

Please confirm whether the following URL would be the correct k8s internal endpoint:

http://mw-api-int-ro.discovery.wmnet:4680/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224

when used with the commons.wikimedia.org host header. Thanks!

Hi Kevin! You have two options: you can use the new "transparent proxy" config, calling directly http://commons.wikimedia.org/w/index.php?title=Special:FilePath&file=Cambia_logo.png&width=224 without any special proxy/port set. Note the http protocol and not https, for the reasons explained during the last presentation. Or you can use the option that you proposed, but I'd suggest to try the new way first since it should require less config.

A compromise could be: we define a special configuration (env variable etc..) for the protocol:://endpoint:port part (like http://commons.wikimedia.org or http://mw-api-int-ro.discovery.wmnet:4680), that can be varied depending on what we want to use/test. Lemme know what you think about it!

Thank you for the confirmation, @elukey! Since we are working towards implementing the new transparent proxy config for all isvcs, the logo-detection isvc can start using it straightaway and keep the mw-api-int-ro config as a contingency.

I looked around for an example on how the transparent proxy config is to be implemented and found T353622#9419330 but when I looked at the recent config file in https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/admin_ng/values/ml-staging-codfw/values.yaml
these configurations were removed. Which file(s) are the eventgate configurations for virtual service (routing), service entry, and destination rule (post-routing) to be added?

Hi Kevin! So https://github.com/wikimedia/operations-deployment-charts/blob/master/helmfile.d/admin_ng/values/ml-serve.yaml#L340 is the point to add the new config, I'd say commons.wikimedia.org should suffice. The endpoint is served by MediaWiki Appservers, so in my opinion we can just expand the list of available/allowed Host headers safely.

@klausman leaving the decision to you :) You can file path anytime, today we'll roll out the transparent proxy changes for eqiad and then we'll be ok to proceed with the new commons host header. Before proceeding I'd suggest to check if calls to the MW API can accept commons host header and URI paths, I don't think any rewrite is happening in upper layers but better safe than sorry!

@klausman leaving the decision to you :) You can file path anytime, today we'll roll out the transparent proxy changes for eqiad and then we'll be ok to proceed with the new commons host header. Before proceeding I'd suggest to check if calls to the MW API can accept commons host header and URI paths, I don't think any rewrite is happening in upper layers but better safe than sorry!

I think that is the right approach. Commons being visisble to everything else using that entry is fine, and so we should proceed.

Change #1027484 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] admin_ng: add commons host header

https://gerrit.wikimedia.org/r/1027484

Change #1027484 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: add commons host header

https://gerrit.wikimedia.org/r/1027484

Change #1028937 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] logo-detection: use cookie to access stash images

https://gerrit.wikimedia.org/r/1028937

Work on this task has been paused for now as the ML team, Structured Content team, and Data Persistence team discuss the best solution to access images from the Wikimedia Commons UploadStash that doesn't compromise privacy, reliability, and security.

In T363506#9794170, it was agreed that the logo-detection model-server should process base64 encoded image objects instead of image URLs to avoid compromising privacy, reliability, and security. The model-server has been updated and deployed on LiftWing staging. Below are the results of a test request we made using base64 images in the payload: https://phabricator.wikimedia.org/P62581

Change #1028937 abandoned by Kevin Bazira:

[machinelearning/liftwing/inference-services@main] logo-detection: use cookie to access stash images

Reason:

The model-server now processes base64 encoded image objects sent by the Wikimedia Commons MediaDetection API:
https://phabricator.wikimedia.org/T363449#9841426

https://gerrit.wikimedia.org/r/1028937

Change #1023542 abandoned by Kevin Bazira:

[machinelearning/liftwing/inference-services@main] logo-detection: restrict image processing to trusted domains

Reason:

Both the Structured Content team and the ML team agreed to use an internal endpoint to prevent the isvc from processing images from untrusted sources.
https://phabricator.wikimedia.org/T370757#10027062

https://gerrit.wikimedia.org/r/1023542

kevinbazira updated the task description. (Show Details)
kevinbazira moved this task from In Progress to 2024-2025 Q2 Done on the Machine-Learning-Team board.