We need to thumbnail images before sending them to the PhotoDNA cloud service. Per their documentation, the image size must be in the following range: minimum size is 160x160 pixels and maximum size is 4MB.
|Open||None||T247977 Implement Hash Checking of Media Files|
|Resolved||Peter.ovchyn||T245595 MediaModeration extension MVP|
|Resolved||• eprodromou||T246915 Send 160x160 thumbnails to photo DNA instead of real files|
FileDeleteBatch is clearing out the thumbnails afaik, so we should be fine. However such a small thumb will probably never be used again, so storing it more permanently doesn't make much sense indeed.
Is there a way to transform image in-memory. Maybe it worth implementing it in the extension?
I'm not sure about in-memory - it seems like all the MediaHandler subclasses implement transformation writing into a temp file, so changing that behavior might be too complex and too far out of the scope. However, this code we probably don't want to execute. How about adding a RENDER_TMP option into file render options and returning the early from File::generateAndSaveThumb based on that flag?
I am not sure that's a good idea. The code in core that does thumbnailing is battle tested for years and years, has support for whole bunch of image types, afaik works with or without imagick extension.
I am not sure what benefit will reimplementing it provide apart of doing it completely in-memory? I think that the cost of reimplementing thumbnailing functionality will be too high. Having some tmp files seems fine to me.
Please note that MediaWiki's built-in media handler should not be used anymore in Wikimedia production. And ad-hoc use of ImageMagic is indeed also not acceptable (wouldn't work for many types and be potentially insecure or not scalable to our requirements).
All relevant entry points are proxied through Swift and/or Thumbor. This also includes private wiki traffic and ad-hoc generation via thumb.php. There is a configuration setting that thus automatically makes this code bypassed. In production, the decision to store the result is made by Thumbor, not MediaWiki.
Also, why does it need to be temporary? Users can request arbitrary sizes and we store those as well. Should be fine to just store without additional long-term complexity and maintenance cost being added to the under/unowned media stack in MediaWiki. Did SRE suggest to avoid storing them?
All relevant entry points are proxied through Swift and/or Thumbor.
In the current setup, we need to POST image bytes into the PhotoDNA cloud service to get the result. Is it possible to achieve with thumbor? This could be potentially changed to the model where we give them the image thumbnail URL instead though, but the pull model would make us cache potentially illegal thumbnail, thus complicating the problem of erasing it completely later.
Also, why does it need to be temporary?
This processing is run on every image upload, so we would need to store a thumb for all images on all wikis, which sounded like a sizable resource use for one-time use thumbnail. Additionally, in the initial requirements, we needed to remove all physical copies on the image from everywhere in our infrastructure in case a match is found, so creating more persistent representations sounded like a bad idea.
Does it run synchronously during upload, or async afterwards?
If during upload, does it operate on a real File object and then try to rollback? Or does it operate on the UploadStashFile? As far as I know we support private thumnails for UploadStashFile already, and these are shown during the upload process. I do not know how that works with thumbor currently. I suspect it is able to proxy those as well, but are they are kept in a place where only the uploader can see them for a limited time. Presumably we already have a way to garbage collect those.
I understand and agree with not wanting to generate more copies. If this runs asynchronous after upload then I would not worry about one extra thumbnail, we already generate several dozen unconditionally directly after upload. Whatever purge/delete mechanism would cover this one as well.
I think the bottom line though is that I think we absolutely should not do ad-hoc image processing on app servers and jobrunners (we moved Thumbor out for a reason), and also not deploy new features that rely on the legacy built-in handler which should be disabled in the future (with the relevant packages uninstalled).
It runs asynchronously via the job queue, so it operates on already uploaded files for two reasons:
- We want to process retroactively files that already were uploaded
- We didn't want to mess with the upload process
I think the bottom line though is that I think we absolutely should not do ad-hoc image processing on app servers and jobrunners
Ok, this makes totally sense, thank you. The better's an enemy of the good, and in attempt of over optimizing this I made the design incorrect.
We should correct:
- Revert the https://gerrit.wikimedia.org/r/c/mediawiki/core/+/579557
- Change course to submit a thumbnail URL to PhotoDNA service and rely on standard thumbnail generation logic.
Thank you @Krinkle