Page MenuHomePhabricator

Send 160x160 thumbnails to photo DNA instead of real files
Closed, ResolvedPublic

Description

We need to thumbnail images before sending them to the PhotoDNA cloud service. Per their documentation, the image size must be in the following range: minimum size is 160x160 pixels and maximum size is 4MB.

Event Timeline

Pchelolo created this task.Mar 4 2020, 5:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2020, 5:04 PM
Peter.ovchyn added a comment.EditedMar 10 2020, 7:31 PM

@Pchelolo File::createThumb will create another one copy of the file. is that OK? Should it be removed after processing?

Is there a way to transform image in-memory. Maybe it worth implementing it in the extension?

@Pchelolo File::createThumb will create another one copy of the file. is that OK? Should it be removed after processing?

FileDeleteBatch is clearing out the thumbnails afaik, so we should be fine. However such a small thumb will probably never be used again, so storing it more permanently doesn't make much sense indeed.

Is there a way to transform image in-memory. Maybe it worth implementing it in the extension?

I'm not sure about in-memory - it seems like all the MediaHandler subclasses implement transformation writing into a temp file, so changing that behavior might be too complex and too far out of the scope. However, this code we probably don't want to execute. How about adding a RENDER_TMP option into file render options and returning the early from File::generateAndSaveThumb based on that flag?

I offer not to use thumbnails functionality from core, but implement it inline in extension using imagick.

I am not sure that's a good idea. The code in core that does thumbnailing is battle tested for years and years, has support for whole bunch of image types, afaik works with or without imagick extension.

I am not sure what benefit will reimplementing it provide apart of doing it completely in-memory? I think that the cost of reimplementing thumbnailing functionality will be too high. Having some tmp files seems fine to me.

I just remember that even keeping info in queue was considered as problematic. Only this concern:)

Change 579392 had a related patch set uploaded (by Peter.ovchyn; owner: Peter.ovchyn):
[mediawiki/extensions/MediaModeration@master] Send 160x160 thumbnails to photo DNA instead of real files

https://gerrit.wikimedia.org/r/579392

Change 579557 had a related patch set uploaded (by Peter.ovchyn; owner: Peter.ovchyn):
[mediawiki/core@master] Add possibility to render thumbnails without saving it to storage.

https://gerrit.wikimedia.org/r/579557

Helga_sf removed Peter.ovchyn as the assignee of this task.May 6 2020, 2:28 PM

It turns out it is not just a suggestion, but a requirement. Files over that 160x160px/4M are being rejected by PhotoDNA.

Helga_sf raised the priority of this task from Medium to High.Jul 1 2020, 3:43 PM
CCicalese_WMF updated the task description. (Show Details)Jul 1 2020, 4:51 PM

Change 579557 merged by jenkins-bot:
[mediawiki/core@master] Add possibility to render thumbnails without saving it to storage.

https://gerrit.wikimedia.org/r/579557

Krinkle added a subscriber: Krinkle.EditedJul 10 2020, 5:13 PM

Please note that MediaWiki's built-in media handler should not be used anymore in Wikimedia production. And ad-hoc use of ImageMagic is indeed also not acceptable (wouldn't work for many types and be potentially insecure or not scalable to our requirements).

All relevant entry points are proxied through Swift and/or Thumbor. This also includes private wiki traffic and ad-hoc generation via thumb.php. There is a configuration setting that thus automatically makes this code bypassed. In production, the decision to store the result is made by Thumbor, not MediaWiki.

Also, why does it need to be temporary? Users can request arbitrary sizes and we store those as well. Should be fine to just store without additional long-term complexity and maintenance cost being added to the under/unowned media stack in MediaWiki. Did SRE suggest to avoid storing them?

Pchelolo added a comment.EditedJul 10 2020, 5:25 PM

All relevant entry points are proxied through Swift and/or Thumbor.

In the current setup, we need to POST image bytes into the PhotoDNA cloud service to get the result. Is it possible to achieve with thumbor? This could be potentially changed to the model where we give them the image thumbnail URL instead though, but the pull model would make us cache potentially illegal thumbnail, thus complicating the problem of erasing it completely later.

Also, why does it need to be temporary?

This processing is run on every image upload, so we would need to store a thumb for all images on all wikis, which sounded like a sizable resource use for one-time use thumbnail. Additionally, in the initial requirements, we needed to remove all physical copies on the image from everywhere in our infrastructure in case a match is found, so creating more persistent representations sounded like a bad idea.

Krinkle added a comment.EditedJul 10 2020, 5:30 PM

Does it run synchronously during upload, or async afterwards?

If during upload, does it operate on a real File object and then try to rollback? Or does it operate on the UploadStashFile? As far as I know we support private thumnails for UploadStashFile already, and these are shown during the upload process. I do not know how that works with thumbor currently. I suspect it is able to proxy those as well, but are they are kept in a place where only the uploader can see them for a limited time. Presumably we already have a way to garbage collect those.

I understand and agree with not wanting to generate more copies. If this runs asynchronous after upload then I would not worry about one extra thumbnail, we already generate several dozen unconditionally directly after upload. Whatever purge/delete mechanism would cover this one as well.

I think the bottom line though is that I think we absolutely should not do ad-hoc image processing on app servers and jobrunners (we moved Thumbor out for a reason), and also not deploy new features that rely on the legacy built-in handler which should be disabled in the future (with the relevant packages uninstalled).

It runs asynchronously via the job queue, so it operates on already uploaded files for two reasons:

  1. We want to process retroactively files that already were uploaded
  2. We didn't want to mess with the upload process

I think the bottom line though is that I think we absolutely should not do ad-hoc image processing on app servers and jobrunners

Ok, this makes totally sense, thank you. The better's an enemy of the good, and in attempt of over optimizing this I made the design incorrect.

We should correct:

  1. Revert the https://gerrit.wikimedia.org/r/c/mediawiki/core/+/579557
  2. Change course to submit a thumbnail URL to PhotoDNA service and rely on standard thumbnail generation logic.

Thank you @Krinkle

Change 611220 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/core@master] Revert "Add possibility to render thumbnails without saving it to storage."

https://gerrit.wikimedia.org/r/611220

Change 611220 merged by jenkins-bot:
[mediawiki/core@master] Revert "Add possibility to render thumbnails without saving it to storage."

https://gerrit.wikimedia.org/r/611220

Krinkle removed a subscriber: Krinkle.Jul 16 2020, 5:00 PM

@eprodromou could you please review the task and close it if it is done?

@eprodromou could you please review the task and close it if it is done?

Sure thing. I'll ask @CCicalese_WMF to run the script in production on the test image as well as some other images to make sure it's working correctly. Once we know it works there, I'll close it.

eprodromou closed this task as Resolved.Jul 27 2020, 1:57 PM
eprodromou claimed this task.

@Pchelolo I think this is complete, since we ran a successful scan of larger images last week. I'm going to mark this as resolved, but correct me if I'm wrong.