Page MenuHomePhabricator

Scan images in chronological order
Closed, ResolvedPublic13 Estimated Story Points

Description

When scanning existing images, we should scan in chronological order, starting with the oldest images.

If we scan in alphabetical order, and we need to split up the scan into multiple runs to fit into our 10M/month limit, new files will be uploaded in the previous runs. So, if we start with files starting with 'A', any new files starting with 'A' won't be part of that scan.

Scanning in chronological order helps with that. New uploads will get scanned last.

I believe there are some edge cases: for example, older revisions of files that get undeleted. I'm not sure if that's even possible or how frequent it is, but my guess is that it's much less frequent than new uploads.

Event Timeline

Image metadata is stored in two tables: image and oldimage. The former holds current revisions of the files, while the latter holds older revisions.

The script has two modes - 'old' and 'new' for scanning old revisions and new revisions.

image table has an index on img_timestamp, so we can efficiently query the images in chronological order. This is better then going by name, since we would never need to re-scan, newer uploads will get into the new table with later timestamps.

For oldimage, there's no index by upload time, so we can't order by timestamp. While we are doing the initial scan, some images might be deleted and moved from still unscanned 'image' table, so we might miss some, but if we first scan 'image' table and then scan 'old image' ordering by name, we minimize the number of potential images that would fall through the cracks.

Currently the --start parameter takes the name of the image before the first image to scan. The script reports the last image that was scanned so that the next run of the script can use that image for --start. For testing, it is tricky to find the image before the one that you want to test - which is necessary if you are trying to scan one of the test images that will trigger a positive. The images are currently ordered alphabetically by name, which helps with that, but once they are sorted chronologically, that will be even more difficult. It seems it would be helpful to add an option to scan a single named file.

Helga_sf set the point value for this task to 13.Jul 7 2020, 12:36 PM
Helga_sf raised the priority of this task from Medium to High.Jul 14 2020, 2:06 PM

Change 614704 had a related patch set uploaded (by Art.tsymbar; owner: arttsymbar):
[mediawiki/extensions/MediaModeration@master] Scan images in chronological order

https://gerrit.wikimedia.org/r/614704

Change 614704 merged by jenkins-bot:
[mediawiki/extensions/MediaModeration@master] Scan images in chronological order

https://gerrit.wikimedia.org/r/614704

Change 615511 had a related patch set uploaded (by Art.tsymbar; owner: arttsymbar):
[mediawiki/extensions/MediaModeration@master] Add additional option to scan single file by name

https://gerrit.wikimedia.org/r/615511

Change 615511 merged by jenkins-bot:
[mediawiki/extensions/MediaModeration@master] Add additional option to scan single file by name

https://gerrit.wikimedia.org/r/615511

@eprodromou could you please review the task and resolve it if it is done?

@Pchelolo I think you ran a bunch of scans on images last Wednesday in chronological order. Is that the case? I can close this ticket then.

Did the testing also include scanning a single named image?

@Pchelolo I think you ran a bunch of scans on images last Wednesday in chronological order. Is that the case? I can close this ticket then.

Yes.

Did the testing also include scanning a single named image?

Not yet. That change wasn't deployed yet.

Did the testing also include scanning a single named image?

Not yet. That change wasn't deployed yet.

That seems to be a separate issue. If the code is already written, is it worthwhile for me to create a new ticket for it?

I tested scanning a single file on beta, and it worked, so no new ticket necessary.