A database table to store the status of scans over images on wiki needs to be created. This database table needs to be able to store an identifier for the image, when the image was last checked and if it was a match.
The number of rows in the table is the sum of the rows of image, oldimage, and filearchive tables. On Commons, this is about 113M rows (of which 93M are in image).
This database table needs to be private and would be on an external store.
The creation of this table allows an easy reference as to what images have been scanned.
Answers to questions for the "Creating new tables" process
Questions from https://wikitech.wikimedia.org/wiki/Creating_new_tables
- Should this table be replicated to wiki replicas (does it not contain private data)? Contains private data through the mms_is_match and mms_last_checked columns.
- Will you be doing cross-joins with the wiki metadata? Yes, to the image, fileimage, and filearchive tables.
- Size of the table (number of rows expected). On commons 113M rows. Significantly smaller on all other wikis, e.g. ~5M for enwiki. Sum of the rows in image, oldimage, and filearchive for a wiki.
- Expected growth per year (number of rows). Size of table grows with the increase of growth in image, oldimage, and filearchive.
- Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok).
- UPDATE operations on rows when maintenance script is running. Will affect 5 rows per second on average. Limited by external API limits which may be increased. No update operations when maintenance script is stopped.
- INSERT operations on every insert to the image table. Average maximum of 1 new row per second - depends on the rate that files are being uploaded, which for everything but imports of files are limited to 1 per second (see Manual:Image_table#img_timestamp for why).
- SELECT operations on rows when maintenance script is running. Probably will select multiple rows at once for processing, but this would equate to 5 rows per second being read with WHERE conditions that would use indexes. Occasional read operations when the maintenance script is stopped to update graphs indicating progress.
- Examples of queries that will be using the table.
- Update table with scan result - UPDATE mediamoderation_scan SET mms_last_checked = <current timestamp>, mms_is_match = <check result> WHERE mms_sha1 = <sha1-value>;
- Insert on file upload - INSERT INTO mediamoderation_scan (mms_sha1) VALUES (<sha1 of image>);
- Selecting images that have not been marked as scanned - SELECT mms_sha1 FROM mediamoderation_scan WHERE mms_last_checked IS NULL LIMIT X;
- Selecting images that were scanned as negative and were scanned longest ago - SELECT mms_sha1 FROM mediamoderation_scan WHERE mms_last_checked < <cutoff month and year> AND mms_is_match = 0 LIMIT X;
- The release plan for the feature (are there specific wikis you'd like to test first etc). Planning for this table to be used on commonswiki first, but will likely expand to other wikis.
Acceptance criteria
- Add this table to the extension
- Get this table structure reviewed by DBA
- Verify that an external store DB can be used.
- Add this table to WMF wikis
QA Results - Visual Studio
AC | Status | Details |
---|---|---|
1 | ✅ | https://phabricator.wikimedia.org/T350321 here |