Page MenuHomePhabricator

Create a maintenance script to automatically scan files listed in mediamoderation_scan
Closed, ResolvedPublic3 Estimated Story Points

Description

A maintenance script is needed that automatically scans images listed in the mediamoderation_scan table. This script should meet several requirements:

  • It should be able to choose what files to process and/or prioritise based on the last time the file was scanned
  • It should be reliable in getting an appropriately sized thumbnail of the file being scanned
  • Any file that does not have a thumbnail and cannot be generated on-demand should be left for a future scan run
  • Any errors or warnings in the maintenance script are properly logged
  • An event be emitted to statsd when a check is performed for monitoring the number of requests to the API per wiki
  • This script should email to a specified email address if a file is determined to be a match (T351407)

This will replace the existing maintenance script so that it can be easily run automatically.

Acceptance criteria
  • Ensure the requirements for the maintenance script are met (except the last which will be done in a different task)
  • Ensure the maintenance script is well tested

Related Objects

Event Timeline

Change 983505 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/MediaModeration@master] [WIP] Add scanFilesInScanTable.php

https://gerrit.wikimedia.org/r/983505

Change 984169 had a related patch set uploaded (by Kosta Harlan; author: Dreamy Jazz):

[mediawiki/extensions/MediaModeration@wmf/1.42.0-wmf.10] Add maintenance script to scan files in the mediamoderation_scan table

https://gerrit.wikimedia.org/r/984169

Change 983505 merged by jenkins-bot:

[mediawiki/extensions/MediaModeration@master] Add maintenance script to scan files in the mediamoderation_scan table

https://gerrit.wikimedia.org/r/983505

Change 984169 merged by jenkins-bot:

[mediawiki/extensions/MediaModeration@wmf/1.42.0-wmf.10] Add maintenance script to scan files in the mediamoderation_scan table

https://gerrit.wikimedia.org/r/984169

Mentioned in SAL (#wikimedia-operations) [2023-12-19T14:17:05Z] <lucaswerkmeister-wmde@deploy2002> Started scap: Backport for [[gerrit:984166|Send PhotoDNA the mime type of the thumbnail and not original file (T351401)]], [[gerrit:984169|Add maintenance script to scan files in the mediamoderation_scan table (T351399)]]

Mentioned in SAL (#wikimedia-operations) [2023-12-19T14:18:37Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde and kharlan: Backport for [[gerrit:984166|Send PhotoDNA the mime type of the thumbnail and not original file (T351401)]], [[gerrit:984169|Add maintenance script to scan files in the mediamoderation_scan table (T351399)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-12-19T14:24:59Z] <lucaswerkmeister-wmde@deploy2002> Finished scap: Backport for [[gerrit:984166|Send PhotoDNA the mime type of the thumbnail and not original file (T351401)]], [[gerrit:984169|Add maintenance script to scan files in the mediamoderation_scan table (T351399)]] (duration: 07m 53s)

QA for this maintenance script requires local wiki because custom config needs to be defined (which cannot be done on betawikis) and a maintenance script needs to be run (which cannot be done on patch demo).

Suggested QA steps for a local environment. If you are unsure about the output of step 7 by the maintenance script, please feel free to send it to me for verification:

  1. Install MediaModeration (if required)
  2. Use the PopulateImageTables.php maintenance script to populate the wiki with a variety of testing images if you don't have many testing images (this can be run using ./maintenance/run MediaModeration:dev/PopulateImageTables.php which will add 100 images).
  3. If possible get the MediaModeration PhotoDNA API key and use it as the value of the $wgMediaModerationPhotoDNASubscriptionKey config in LocalSettings.php. If this is possible, skip the next few steps until step 6. If this is not possible, then continue on to step 4. Using the actual API key makes the QA testing more comprehensive. Otherwise, some of the code used in production does not get tested.
  4. Choose a variety of the images for the next step.
  5. Define the following config in LocalSettings.php where you add a variety of images without the File: prefix but with their file extension as keys in the configuration below. This configuration mocks the responses from the PhotoDNA API and allows testing the maintenance script on it's behaviour on a variety of conditions. The values for the first array can be true to indicate that a file is a match and false to indicate that it is not. The values for the second array are the response codes. Anything other than 3000 will cause the scan to be considered failed for this file.
$wgMediaModerationPhotoDNAMockServiceFiles = [
	'FilesToIsMatchMap' => [
		// change to false if you want IsMatch to be false
		'File.jpg' => true
	],
	'FilesToStatusCodeMap' => [
		// change to 3000 if you want the status code to be "OK"
		'File2.jpg' => 3004,
		'File3.jpg' => 3002,
		'File3.jpg' => 3206,
		'File3.jpg' => 3208,
		'File3.jpg' => 3209,
	]
];
  1. Run ./maintenance/run MediaModeration:importExistingFilesToScanTable.php
  2. Run ./maintenance/run MediaModeration:scanFilesInScanTable.php --verbose
    1. If using the PhotoDNA API key, then verify the scanning results look as expected. The errors related to ArchivedFile instances cannot be processed yet. is expected and should be ignored for the time being. Other errors are likely problems with the scanning script.
    2. If not using the PhotoDNA API key, then verify that the errors being shown and match statuses line up with the mock responses as defined in the config added in step 5.
Djackson-ctr subscribed.

I have verified that the new code has been implemented and is functioning and displaying as expected... Thank you for the QA Steps @Dreamy_Jazz.

Below is the configuration I used for $wgMediaModerationPhotoDNAMockServiceFiles, and then below that are the results after running step 7b:


$wgMediaModerationPhotoDNAMockServiceFiles = [
    'FilesToIsMatchMap' => [
        // change to false if you want IsMatch to be false
        'Feeding_(12533550343).jpg' => true
       
    ],
    'FilesToStatusCodeMap' => [
        // change to 3000 if you want the status code to be "OK"
        'LL-Q8097_(tel)-ప్రశాంతి-గంగపండగ.wav' => 3004,
        'Holger_Krisp.jpg' => 3002,
        'Groen_parkeren,_Zeist_(50346604211).jpg' => 3206,
        '86225_"Hardwicke"_at_Stafford.jpg' => 3208,
        '16_Simons.jpg' => 3209,
    ]
];

SHA-1 3fkfxskkihcp9v4inui6pj6rmaxyy7v: Positive match.
SHA-1 kfvxs5in8nxbs7xss19gh05e014sobi: No match.
SHA-1 lvadkf8ap30aur1znpkg1fehbqezavd
...3209: Request Size Exceeded
SHA-1 lvadkf8ap30aur1znpkg1fehbqezavd: Scan failed.
SHA-1 d653egj7e9wefpjmahe7q44uuuhppd6: No match.
SHA-1 7xvjsd1f0owdennu873q5xpdrfjubpq: No match.
SHA-1 fnh3b2mrz7ac18q4ugjcjyfw5ny61z6
...3206: The given file could not be verified as an image
SHA-1 fnh3b2mrz7ac18q4ugjcjyfw5ny61z6: Scan failed.
SHA-1 lntjwyqvm8b9lf47jq9ajb6nqlf997l: No match.
SHA-1 pftu02opuronogyhft0cse0gaxp0a47: No match.
SHA-1 fp2h4s3pi8nexr55617hl16uso1xgqt
...3002: Invalid or missing request parameter(s)
SHA-1 fp2h4s3pi8nexr55617hl16uso1xgqt: Scan failed.
SHA-1 jhax5st52hu6l2g68nv7bqjsh0u293g: No match.
SHA-1 hy68umbp1pktr838demzrniuhvhg2g2: No match.
SHA-1 r7wojzhbxy9k8ibsn5xfxx9vqumqhv4
...3208: Image size in pixels is not within allowed range
SHA-1 r7wojzhbxy9k8ibsn5xfxx9vqumqhv4: Scan failed.
SHA-1 adve2o0sageviju1f5p6ep195wk0su5: No match.