Scan images in chronological order
Closed, ResolvedPublic13 Estimated Story Points
Actions

Description

When scanning existing images, we should scan in chronological order, starting with the oldest images.

If we scan in alphabetical order, and we need to split up the scan into multiple runs to fit into our 10M/month limit, new files will be uploaded in the previous runs. So, if we start with files starting with 'A', any new files starting with 'A' won't be part of that scan.

Scanning in chronological order helps with that. New uploads will get scanned last.

I believe there are some edge cases: for example, older revisions of files that get undeleted. I'm not sure if that's even possible or how frequent it is, but my guess is that it's much less frequent than new uploads.

Details

	Subject	Repo	Branch	Lines +/-
	Add additional option to scan single file by name	mediawiki/extensions/MediaModeration	master	+102 -6
	Scan images in chronological order	mediawiki/extensions/MediaModeration	master	+13 -6

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	None	T247977 Implement Hash Checking of Media Files
Resolved	Peter.ovchyn	T245595 MediaModeration extension MVP
Resolved	Art.tsymbar	T254499 Scan images in chronological order

Event Timeline

• eprodromou created this task.Jun 4 2020, 4:49 PM

Image metadata is stored in two tables: image and oldimage. The former holds current revisions of the files, while the latter holds older revisions.

The script has two modes - 'old' and 'new' for scanning old revisions and new revisions.

image table has an index on img_timestamp, so we can efficiently query the images in chronological order. This is better then going by name, since we would never need to re-scan, newer uploads will get into the new table with later timestamps.

For oldimage, there's no index by upload time, so we can't order by timestamp. While we are doing the initial scan, some images might be deleted and moved from still unscanned 'image' table, so we might miss some, but if we first scan 'image' table and then scan 'old image' ordering by name, we minimize the number of potential images that would fall through the cracks.

Helga_sf triaged this task as Medium priority.Jul 1 2020, 3:47 PM

Helga_sf edited projects, added Platform Team Workboards (S&F Workboard); removed Platform Team Workboards (User Stories).

Currently the --start parameter takes the name of the image before the first image to scan. The script reports the last image that was scanned so that the next run of the script can use that image for --start. For testing, it is tricky to find the image before the one that you want to test - which is necessary if you are trying to scan one of the test images that will trigger a positive. The images are currently ordered alphabetically by name, which helps with that, but once they are sorted chronologically, that will be even more difficult. It seems it would be helpful to add an option to scan a single named file.

Peter.ovchyn removed Peter.ovchyn as the assignee of this task.Jul 2 2020, 4:54 PM

Art.tsymbar claimed this task.Jul 3 2020, 9:35 AM

Art.tsymbar moved this task from Backlog to In Progress/Doing on the Platform Team Workboards (S&F Workboard) board.Jul 3 2020, 10:30 AM

Helga_sf set the point value for this task to 13.Jul 7 2020, 12:36 PM

Helga_sf raised the priority of this task from Medium to High.Jul 14 2020, 2:06 PM

Change 614704 had a related patch set uploaded (by Art.tsymbar; owner: arttsymbar):
[mediawiki/extensions/MediaModeration@master] Scan images in chronological order

https://gerrit.wikimedia.org/r/614704

gerritbot added a project: Patch-For-Review.Jul 20 2020, 9:36 AM

Art.tsymbar added a comment.Jul 20 2020, 12:44 PM

This comment was removed by Art.tsymbar.

Art.tsymbar moved this task from In Progress/Doing to In Progress/In Review on the Platform Team Workboards (S&F Workboard) board.Jul 20 2020, 12:47 PM

Change 614704 merged by jenkins-bot:
[mediawiki/extensions/MediaModeration@master] Scan images in chronological order

https://gerrit.wikimedia.org/r/614704

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.1; 2020-07-21).Jul 20 2020, 8:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2020, 8:11 PM

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.2; 2020-07-28); removed MW-1.36-notes (1.36.0-wmf.1; 2020-07-21).Jul 21 2020, 6:01 PM

Change 615511 had a related patch set uploaded (by Art.tsymbar; owner: arttsymbar):
[mediawiki/extensions/MediaModeration@master] Add additional option to scan single file by name

https://gerrit.wikimedia.org/r/615511

gerritbot added a project: Patch-For-Review.Jul 22 2020, 3:25 PM

Change 615511 merged by jenkins-bot:
[mediawiki/extensions/MediaModeration@master] Add additional option to scan single file by name

https://gerrit.wikimedia.org/r/615511

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2020, 5:10 PM

mdaniels5757 subscribed.Jul 22 2020, 7:44 PM

Helga_sf moved this task from In Progress/In Review to PM Sign Off on the Platform Team Workboards (S&F Workboard) board.Jul 23 2020, 12:07 PM

@eprodromou could you please review the task and resolve it if it is done?

• AMooney added a project: Platform Team Sprints Board (Sprint 0).Jul 24 2020, 2:12 PM

@Pchelolo I think you ran a bunch of scans on images last Wednesday in chronological order. Is that the case? I can close this ticket then.

Did the testing also include scanning a single named image?

@Pchelolo I think you ran a bunch of scans on images last Wednesday in chronological order. Is that the case? I can close this ticket then.

Yes.

Did the testing also include scanning a single named image?

Not yet. That change wasn't deployed yet.

In T254499#6338493, @Pchelolo wrote:

Did the testing also include scanning a single named image?

Not yet. That change wasn't deployed yet.

That seems to be a separate issue. If the code is already written, is it worthwhile for me to create a new ticket for it?

I tested scanning a single file on beta, and it worked, so no new ticket necessary.

Scan images in chronological orderClosed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Scan images in chronological order
Closed, ResolvedPublic13 Estimated Story Points
Actions

Related Objects
Search...