Page MenuHomePhabricator

Host a logo detection model for Commons images
Open, In Progress, Needs TriagePublic2 Estimated Story Points

Description

From https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Hosting_a_model:

  • What use case is the model going to support/resolve?

Detection of potentially copyrighted image uploads on Commons. T340546: [XL] Analysis of deletion requests on Commons highlighted that logos account for a significant chunk of files that undergo a deletion request, then get deleted from Commons. See also T340546#9583762.

Not yet.

  • What team created/trained/etc.. the model?

Structured Content, main developer @mfossati .

  • What tools and frameworks have you used?

Keras 3.0.4
TensorFlow 2.15.0 backend
KerasCV 0.8.1
EfficientNet V2 backbone, B0 variant pre-trained on ImageNet. See https://keras.io/api/keras_cv/models/#backbone-presets

  • What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?

Commons images. The model will require an image file and will output whether the given file is a logo or not.

  • If you have a minimal codebase that you used to run the first tests with the model, could you please share it?

Training;
evaluation;
demo classification.

  • State what team will own the model and please share some main point of contacts (see more info in Ownership of a model).

Structured Content, main point of contact @mfossati, tech lead @Cparle.

  • What is the current latency and throughput of the model, if you have tested it? We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!

Not there yet.

  • Is there an expected frequency in which the model will have to be retrained with new data?

To be discussed. Note that current performances are already high, which may lower the priority of re-training. See T352748: [SPIKE] Image classifier prototype.

  • What are the resources required to train the model and what was the dataset size?

Training on CPUs with 36 threads was reasonable: I haven't exactly measured the training time, it took a few hours.
Scaling it up to GPUs would be nice to have.
The dataset size is 1.1 GB with 23,325 training samples.

  • Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!

The output is a confidence score of an image being a logo, so I think it's safe. As a side note, we discussed something similar with Legal as part of T350020: Access request to deleted image files in the production Swift cluster.

Details

Other Assignee
mfossati
TitleReferenceAuthorSource BranchDest Branch
lw_prototype: add LogoDetectionModel classmfossati/scriptz!9kevinbaziralw_prototype_LogoDetectionModel_classmain
lw_prototype: image download error handlingmfossati/scriptz!8kevinbaziralw_prototype_image_download_error_handlingmain
lw_prototype: validate input datamfossati/scriptz!7kevinbaziralw_prototype_validate_input_datamain
Improve functionalitymfossati/scriptz!6mfossatiT358676main
Customize query in GitLab

Event Timeline

calbon set the point value for this task to 2.

Thank you for providing details about the logo detection project, @mfossati! The ML team is excited to explore hosting it on LiftWing.

We have reviewed the demo you provided and would like to ask a few questions to help us effectively build, host, and provide an API to query the logo detection model server:

1.API input and preprocessing:
In the demo, the model accepts an image directory as input and preprocesses the images within specific subdirectories like 'logo' and 'out_of_domain' (see directory structure in screenshot below).

logo-detection-image-dir-structure.png (234×242 px, 12 KB)

Could you please clarify how the images will be sent to the API? Will they be sent as one image or multiple images? Will you provide image links or serialized image objects? Additionally, what is the expected size of the images that will be sent?

2.API output:
The demo currently visualizes prediction results in a plotted grid as shown in the screenshot below.

logo-detection-prediction.png (270×828 px, 242 KB)

LiftWing typically returns API responses as JSON objects (see example). Could you please specify the expected response format from the API?

Thank you for providing details about the logo detection project, @mfossati! The ML team is excited to explore hosting it on LiftWing.

Cool cool @kevinbazira , thanks for picking this up!
To give you more context, the interaction with the model will happen within Commons Upload Wizard, most likely at its Upload step.

Could you please clarify how the images will be sent to the API? Will they be sent as one image or multiple images?

Multiple images. Users can upload multiple images, but typically it's only one.

Will you provide image links or serialized image objects?

Serialized image objects. More specifically, the Upload Wizard currently uses chunked uploading.

Additionally, what is the expected size of the images that will be sent?

The image size is arbitrary, it will depend on what users are uploading to Upload Wizard. The current upper bound is 5.37 GB, but this is usually for video files.
In any case, the model requires inputs of size 224 x 224 pixels, and the pre-processing step will take care of rescaling to that size.

LiftWing typically returns API responses as JSON objects (see example). Could you please specify the expected response format from the API?

The essentials are predictions with their probability scores. I think that a JSON array of per-file objects is a good fit:

[
    {
        "filename": "my_uploaded_file.jpg",
        "target: "logo",
        "prediction": 0.999,
        "out_of_domain": 0.001 
    },

    ...

]

In the example you provided, I also see the latency key, which looks like a nice to have.

Thank you for providing more context @mfossati. I shared this information with the team, and they have a few more questions to clarify the implementation details for the logo detection API:

Could you please clarify how the images will be sent to the API? Will they be sent as one image or multiple images?

Multiple images. Users can upload multiple images, but typically it's only one.

To prevent potential DOS vulnerabilities, we need to establish a limit on the number of images that can be sent to the API in a single request. Currently, the upload wizard restricts uploads to 50 files. Would you like us to maintain this limit for the API as well?

To give you more context, the interaction with the model will happen within Commons Upload Wizard, most likely at its Upload step.

Will you provide image links or serialized image objects?

Serialized image objects. More specifically, the Upload Wizard currently uses chunked uploading.

The Upload Wizard documentation warns against processing images from the Upload Stash during the upload step due to potential security risks. Will you implement security checks for serialized image objects before sending them to the API?

LiftWing typically returns API responses as JSON objects (see example). Could you please specify the expected response format from the API?

The essentials are predictions with their probability scores. I think that a JSON array of per-file objects is a good fit.
...
In the example you provided, I also see the latency key, which looks like a nice to have.

Great. Could you please provide a sample API input that specifies the expected parameters and the encoding format for serialized images?

To prevent potential DOS vulnerabilities, we need to establish a limit on the number of images that can be sent to the API in a single request. Currently, the upload wizard restricts uploads to 50 files. Would you like us to maintain this limit for the API as well?

Yes, please.

Will you provide image links or serialized image objects?

Serialized image objects. More specifically, the Upload Wizard currently uses chunked uploading.

The Upload Wizard documentation warns against processing images from the Upload Stash during the upload step due to potential security risks. Will you implement security checks for serialized image objects before sending them to the API?

I spoke with my team and we think that consuming the stash URL doesn't pose security risks. As a result, instead of sending image objects to the LiftWing API, we'll send URLs.

Great. Could you please provide a sample API input that specifies the expected parameters and the encoding format for serialized images?

I think we can send a POST request with the following JSON body:

[
    {
        "filename": "my_uploaded_file.jpg",
        "url": "https://commons.wikimedia.org/wiki/Special:UploadStash/file/my_stash_filekey.png",
        "target" : "logo"
    },

    ...

]

Please let me know if you prefer form-encoded data instead.

Thank you for sharing this information, @mfossati. Based on the requirements you've shared so far, we have worked on a first pass of the prototype that takes the input JSON you specified, preprocesses it similar to the way you did in the demo, and returns the output JSON you specified (see P58917#237712). Please test it and let us know whether we've captured the key requirements correctly before we proceed working on input validation and sanitization, image limits, error handling, etc.

Hey @kevinbazira , I went through P58917, took the liberty of versioning it, and added some changes. Please have a look at https://gitlab.wikimedia.org/mfossati/scriptz/-/merge_requests/6.
Key changes involve:

  • JSON output
  • input dataset, which shouldn't have labels
  • label mode, which shouldn't be binary, since the model is actually multiclass

@kevinbazira: I'm hitting this ignored exception when running the code:

Exception ignored in: <function AtomicFunction.__del__ at 0x157402660>
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/cnn/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py", line 291, in __del__
TypeError: 'NoneType' object is not subscriptable

Not harmful, but worth a check: are you hitting that, too?

Thank you for versioning the liftwing_prototype and making changes @mfossati! I tested the changes locally and got results that I shared in P58917#237822. Please have a look whenever you get a minute.

@kevinbazira: I'm hitting this ignored exception when running the code:

Exception ignored in: <function AtomicFunction.__del__ at 0x157402660>
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/cnn/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/atomic_function.py", line 291, in __del__
TypeError: 'NoneType' object is not subscriptable

Not harmful, but worth a check: are you hitting that, too?

I am not able to reproduce this issue. I've tried running the prototype in two enviroments:

  1. GColab with python 3.10.12, keras==3.0.4, keras-cv==0.8.1, and tensorflow==2.15.0.
  2. local venv with python 3.9.2, keras==3.0.4, keras-cv==0.8.1, and tensorflow==2.16.1.

Could this be caused by an incompatility in the python version shown in the error message: python3.11?

Thank you for versioning the liftwing_prototype and making changes @mfossati! I tested the changes locally and got results that I shared in P58917#237822. Please have a look whenever you get a minute.

See answer in P58917#237851.

I am not able to reproduce this issue. I've tried running the prototype in two enviroments:

  1. GColab with python 3.10.12, keras==3.0.4, keras-cv==0.8.1, and tensorflow==2.15.0.
  2. local venv with python 3.9.2, keras==3.0.4, keras-cv==0.8.1, and tensorflow==2.16.1.

Could this be caused by an incompatility in the python version shown in the error message: python3.11?

Interesting, looks like I'm hitting it with both 3.10.12 and 3.12.2 Python versions in a mamba environment, both on a Linux and a Mac machine. Also hitting it with 3.11.5 in a venv environment on a Mac machine.
Not a blocker, but I suggest to try to resolve this before going to production.

mfossati changed the task status from Open to In Progress.Wed, Mar 27, 6:57 PM
mfossati moved this task from Incoming to Doing on the Structured-Data-Backlog (Current Work) board.
mfossati updated Other Assignee, added: mfossati.

Hi @mfossati ! Thanks a lot for all this great work!
I was wondering if you had tried to train the same model using pytorch as a keras backend instead of tensorflow. The reason I'm asking is totally unrelated to the model itself but has to do with technical challenges of maintaining multiple images and backends. There is ongoing work on our side to provide better support for pytorch (related task).
This is more of a question so we can provide better support and not a request from our side as we'd be supporting keras/tensorflow models as well.

I was wondering if you had tried to train the same model using pytorch as a keras backend instead of tensorflow.

Hey @isarantopoulos: no, I haven't.

The prototype looks good to me, I'm excited to see this effort move to the next level!
@kevinbazira, I've especially appreciated the tightness of our development iterations 😄 .

Thanks @mfossati! <3
It's great to hear you're excited about moving to the next milestone.
Rest assured, in T361803, we'll maintain the tight development iterations and ensure you're kept in the loop at every key milestone as we work towards hosting the logo-detection model-server on LiftWing.