Page MenuHomePhabricator

[M] Create database table to store status of scans
Closed, ResolvedPublic

Description

A database table to store the status of scans over images on wiki needs to be created. This database table needs to be able to store an identifier for the image, when the image was last checked and if it was a match.

The number of rows in the table is the sum of the rows of image, oldimage, and filearchive tables. On Commons, this is about 113M rows (of which 93M are in image).

This database table needs to be private and would be on an external store.

The creation of this table allows an easy reference as to what images have been scanned.

Answers to questions for the "Creating new tables" process

Questions from https://wikitech.wikimedia.org/wiki/Creating_new_tables

  1. Should this table be replicated to wiki replicas (does it not contain private data)? Contains private data through the mms_is_match and mms_last_checked columns.
  2. Will you be doing cross-joins with the wiki metadata? Yes, to the image, fileimage, and filearchive tables.
  3. Size of the table (number of rows expected). On commons 113M rows. Significantly smaller on all other wikis, e.g. ~5M for enwiki. Sum of the rows in image, oldimage, and filearchive for a wiki.
  4. Expected growth per year (number of rows). Size of table grows with the increase of growth in image, oldimage, and filearchive.
  5. Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok).
    1. UPDATE operations on rows when maintenance script is running. Will affect 5 rows per second on average. Limited by external API limits which may be increased. No update operations when maintenance script is stopped.
    2. INSERT operations on every insert to the image table. Average maximum of 1 new row per second - depends on the rate that files are being uploaded, which for everything but imports of files are limited to 1 per second (see Manual:Image_table#img_timestamp for why).
    3. SELECT operations on rows when maintenance script is running. Probably will select multiple rows at once for processing, but this would equate to 5 rows per second being read with WHERE conditions that would use indexes. Occasional read operations when the maintenance script is stopped to update graphs indicating progress.
  6. Examples of queries that will be using the table.
    1. Update table with scan result - UPDATE mediamoderation_scan SET mms_last_checked = <current timestamp>, mms_is_match = <check result> WHERE mms_sha1 = <sha1-value>;
    2. Insert on file upload - INSERT INTO mediamoderation_scan (mms_sha1) VALUES (<sha1 of image>);
    3. Selecting images that have not been marked as scanned - SELECT mms_sha1 FROM mediamoderation_scan WHERE mms_last_checked IS NULL LIMIT X;
    4. Selecting images that were scanned as negative and were scanned longest ago - SELECT mms_sha1 FROM mediamoderation_scan WHERE mms_last_checked < <cutoff month and year> AND mms_is_match = 0 LIMIT X;
  7. The release plan for the feature (are there specific wikis you'd like to test first etc). Planning for this table to be used on commonswiki first, but will likely expand to other wikis.
Acceptance criteria
  • Add this table to the extension
  • Get this table structure reviewed by DBA
  • Verify that an external store DB can be used.
  • Add this table to WMF wikis

QA Results - Visual Studio

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
Resolvedkostajh
ResolvedNone
ResolvedTchanders
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedTchanders
OpenNone
ResolvedBUG REPORTDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedNone
ResolvedDreamy_Jazz
Resolvedkostajh
Resolvedkostajh
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz

Event Timeline

Dreamy_Jazz renamed this task from Create database table to store status of scans to [M] Create database table to store status of scans.Nov 2 2023, 12:05 AM

Change 970886 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/MediaModeration@master] [WIP] Add mediamoderation_scan table

https://gerrit.wikimedia.org/r/970886

Dreamy_Jazz added a project: DBA.

Adding DBA for visibility (item 4.2 in preparation for creating a new table at https://wikitech.wikimedia.org/wiki/Creating_new_tables)

The schema looks good to me, my only concern is about adding mms_uploaded_timestamp, it adds a binary(14) column which considering its index and the total size of the table can amount to a large size. Do you really need an upload timestamp? Yes is a valid answer.

The schema looks good to me, my only concern is about adding mms_uploaded_timestamp, it adds a binary(14) column which considering its index and the total size of the table can amount to a large size. Do you really need an upload timestamp? Yes is a valid answer.

Short answer: Yes

Long answer:
The ability to uniquely identify the content of a file by it's sha1 is possible, but not if the file is duplicated or if two files that are different images have the same sha1.

What the timestamp column does here is ensures that if the same image is uploaded again after scanning, the newly uploaded image will get scanned separately. This avoids any problems caused by collisions and duplications by using the using the uniqueness of the timestamp column to limit this to only imported images uploaded at the same time as another image with the same sha1.

Dreamy_Jazz moved this task from Refine to Triage on the DBA board.EditedNov 7 2023, 1:19 PM

Moving back to Triage per https://wikitech.wikimedia.org/wiki/Creating_new_tables. Now requesting DBA signoff (point 5).

Thoughts about whether this table can be on an external store would be welcome. No web request will trigger a read on this table, and only on a file upload does an insert operation occur. Such a insert can be delayed or throttled for a short time without concern (as long as the insert definitely will occur). Updates on all rows in the table may occur, so no rows can be set to be read only.

Change 972384 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/WikimediaMaintenance@master] Add MediaModeration to createExtensionTables.php

https://gerrit.wikimedia.org/r/972384

Change 972387 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/WikimediaMaintenance@master] Add MediaModeration to addWiki.php

https://gerrit.wikimedia.org/r/972387

The schema looks good to me, my only concern is about adding mms_uploaded_timestamp, it adds a binary(14) column which considering its index and the total size of the table can amount to a large size. Do you really need an upload timestamp? Yes is a valid answer.

Short answer: Yes

Long answer:
The ability to uniquely identify the content of a file by it's sha1 is possible, but not if the file is duplicated or if two files that are different images have the same sha1.

What the timestamp column does here is ensures that if the same image is uploaded again after scanning, the newly uploaded image will get scanned separately. This avoids any problems caused by collisions and duplications by using the using the uniqueness of the timestamp column to limit this to only imported images uploaded at the same time as another image with the same sha1.

I understand but the chance of hash collision is basically zero as hashes are designed to minimize collisions (do you have an example of two different image having the same hash in our production?). If you're talking about images that are different pages but the exact same file, we still want to treat them the same no matter what.

I want you to see this as a matter of trade-off and cost. Adding 150M (and counting) binary(14) adds a lot of cost to our infra, in terms of maintenance, storage, memory, replication, network, update cost, locking, and so on (specially since it's also used in an index too).

So if you're concern is on hash collision, It's not worth the cost it adds but if you have some other usecase, we could look at it and modify it to be more compact rather than a large string for each image.

The same goes for mms_last_checked as well but for that column I can see a clear and important usecase but we also might be able to reduce its size (e.g. by using an integer and holding only year and month? 202310 for example)

For other things, as long as it's stored in x1 and you don't write too fast on it (e.g. 100 rows a second continuously), it should be fine.

The schema looks good to me, my only concern is about adding mms_uploaded_timestamp, it adds a binary(14) column which considering its index and the total size of the table can amount to a large size. Do you really need an upload timestamp? Yes is a valid answer.

Short answer: Yes

Long answer:
The ability to uniquely identify the content of a file by it's sha1 is possible, but not if the file is duplicated or if two files that are different images have the same sha1.

What the timestamp column does here is ensures that if the same image is uploaded again after scanning, the newly uploaded image will get scanned separately. This avoids any problems caused by collisions and duplications by using the using the uniqueness of the timestamp column to limit this to only imported images uploaded at the same time as another image with the same sha1.

I understand but the chance of hash collision is basically zero as hashes are designed to minimize collisions (do you have an example of two different image having the same hash in our production?). If you're talking about images that are different pages but the exact same file, we still want to treat them the same no matter what.

I understand that this is a matter of trade-off, and I'm happy to remove the upload timestamp. An scenario, which I probably should have stated here before, is below which is why I thought that the timestamp was needed:

  1. File:A is uploaded
  2. File:A is scanned and determined to be a match
  3. This match status is saved to the DB and an email is sent detailing that a revision of File:A is a match
  4. File:B is uploaded which is a duplicate of File:A
  5. The match status is detected as a match for File:B
  6. The email which is sent contains both a revision of File:A and a revision of File:B, even though File:A was previously reported. This is because you wouldn't have the DB storing the upload timestamps to filter for just the new file.

This isn't likely to be a common occurrence though, and I had not realised that using an uploaded timestamp column would be a big issue. Therefore I will remove the column from the patch.

Until now I had not seen a way around this of just sending a new email if the sha1 is marked as a match in the DB and skipping changing the DB at all. I was caught up with the idea of having to save everything to the DB for reliability. The only issue with this is if the emailing fails, it would be more complicated to determine what rows need to be re-sent via email.

I want you to see this as a matter of trade-off and cost. Adding 150M (and counting) binary(14) adds a lot of cost to our infra, in terms of maintenance, storage, memory, replication, network, update cost, locking, and so on (specially since it's also used in an index too).

So if you're concern is on hash collision, It's not worth the cost it adds but if you have some other usecase, we could look at it and modify it to be more compact rather than a large string for each image.

The same goes for mms_last_checked as well but for that column I can see a clear and important usecase but we also might be able to reduce its size (e.g. by using an integer and holding only year and month? 202310 for example)

For the last_checked column, I don't think that reducing the specificity is a issue. I think a date would probably be enough. As the abstract schema structure doesn't seem to allow use of mediumint, adding the day won't affect the size of the column as integer already had to be used. Open to going back to year + month if mediumint can be used for mariadb/mysql.

For other things, as long as it's stored in x1 and you don't write too fast on it (e.g. 100 rows a second continuously), it should be fine.

Thanks for the feedback. Hopefully my thoughts have made sense.

If you want a TLDR of the above (I'm aware I've been lengthy in my above comment):

  • I've removed the uploaded timestamp column
  • I've updated the last checked column to be an integer storing year, month and day as the integer type is the option that is needed to store at least the year and month but can also store the day.
  • If the mediumint type can be used for mariadb/mysql, then happy to change this back to year and month to reduce size by one byte - Doesn't seem to be supported by the abstract schema.

I therefore probably have addressed the feedback in the last comment and will wait further thoughts.

If you want a TLDR of the above (I'm aware I've been lengthy in my above comment):

  • I've removed the uploaded timestamp column

Fine with me, but want to check on the following:

I understand but the chance of hash collision is basically zero as hashes are designed to minimize collisions (do you have an example of two different image having the same hash in our production?).

Per https://en.wikipedia.org/wiki/SHA-1#Attacks, I think we want to avoid the scenario of:

  • Attacker uploads an image with SHA-1 that isn't a match in PhotoDNA (PhotoDNA is using its own hashing to compare with files in its DB, it doesn't receive our SHA-1 or care about it)
  • We update the mediamoderation_scan table to say the image was scanned and there's no match in PhotoDNA
  • Attacker uploads a second image with the same SHA-1 as previous image
  • We don't re-scan the image, because mediamoderation_scan table says the SHA-1 was already checked, and not a match with PhotoDNA

That's why using the timestamp of the image from image, oldimage, or filearchive in mediamoderation_scan is useful: we would be able to tell that the second image in the example above needs to be scanned, because even though the SHA-1 exists in mediamoderation_scan, it belongs to a different timestamp.

As of 2023, the SHA-1 collision attack seems to be improbable in practice, but would be nice to future-proof for a couple of years.

This isn't likely to be a common occurrence though, and I had not realised that using an uploaded timestamp column would be a big issue. Therefore I will remove the column from the patch.

Thanks!

Until now I had not seen a way around this of just sending a new email if the sha1 is marked as a match in the DB and skipping changing the DB at all. I was caught up with the idea of having to save everything to the DB for reliability. The only issue with this is if the emailing fails, it would be more complicated to determine what rows need to be re-sent via email.

That's a valid issue, I wonder if we can find ways to get around this that's easier to implement, e.g. a private log being injected in logging table, a logstash entry, etc.?

For the last_checked column, I don't think that reducing the specificity is a issue. I think a date would probably be enough. As the abstract schema structure doesn't seem to allow use of mediumint, adding the day won't affect the size of the column as integer already had to be used. Open to going back to year + month if mediumint can be used for mariadb/mysql.

mediumint isn't really cross-platform compatible and on top, an int is much much smaller than string but the difference between mediumint and int isn't that big so the gain will be negligible so don't worry.

Per https://en.wikipedia.org/wiki/SHA-1#Attacks, I think we want to avoid the scenario of:

That attack is more viable to ever happen in security-related pieces, e.g. passowrds and could leak access and that's why it's not recommended for authentication related parts but that's not really possible to do for cases like ours, it requires at least 2^68 operations which could be viable for obtaining a password by someone who has access to a super computer or vast amount of computational powers (APTs), that's not really the threat model here.

As of 2023, the SHA-1 collision attack seems to be improbable in practice, but would be nice to future-proof for a couple of years.

That feels a bit of YAGNI to me, once we see a case like this, we can revisit and add a column but again, I highly highly doubt that could become a viable vector in less than two decades.

As of 2023, the SHA-1 collision attack seems to be improbable in practice, but would be nice to future-proof for a couple of years.

That feels a bit of YAGNI to me, once we see a case like this, we can revisit and add a column but again, I highly highly doubt that could become a viable vector in less than two decades.

Yeah, fair enough! Let's not worry about that for now, then.

Until now I had not seen a way around this of just sending a new email if the sha1 is marked as a match in the DB and skipping changing the DB at all. I was caught up with the idea of having to save everything to the DB for reliability. The only issue with this is if the emailing fails, it would be more complicated to determine what rows need to be re-sent via email.

That's a valid issue, I wonder if we can find ways to get around this that's easier to implement, e.g. a private log being injected in logging table, a logstash entry, etc.?

Probably a logstash event with type error or above would make sense here. Thanks.

For the last_checked column, I don't think that reducing the specificity is a issue. I think a date would probably be enough. As the abstract schema structure doesn't seem to allow use of mediumint, adding the day won't affect the size of the column as integer already had to be used. Open to going back to year + month if mediumint can be used for mariadb/mysql.

mediumint isn't really cross-platform compatible and on top, an int is much much smaller than string but the difference between mediumint and int isn't that big so the gain will be negligible so don't worry.

Thanks. I will leave it with the integer type then.

Change 973157 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[operations/mediawiki-config@master] MediaModeration: Define virtual domains mapping config

https://gerrit.wikimedia.org/r/973157

Change 973157 merged by jenkins-bot:

[operations/mediawiki-config@master] MediaModeration: Define virtual domains mapping config

https://gerrit.wikimedia.org/r/973157

Mentioned in SAL (#wikimedia-operations) [2023-11-09T14:43:29Z] <kharlan@deploy2002> Started scap: Backport for [[gerrit:973157|MediaModeration: Define virtual domains mapping config (T350321)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-09T14:44:51Z] <kharlan@deploy2002> kharlan and dreamyjazz: Backport for [[gerrit:973157|MediaModeration: Define virtual domains mapping config (T350321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-09T14:50:35Z] <kharlan@deploy2002> Finished scap: Backport for [[gerrit:973157|MediaModeration: Define virtual domains mapping config (T350321)]] (duration: 07m 07s)

Change 970886 merged by jenkins-bot:

[mediawiki/extensions/MediaModeration@master] Add mediamoderation_scan table

https://gerrit.wikimedia.org/r/970886

Remember DBAs do not create tables in production. That can be done during the train or by someone with deployment access.

Yup and thanks for the reminder. Was intending to ask @kostajh to do this once the the createExtensionTables.php change is merged (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/972384/).

To confirm, this table goes to x1, right?

To confirm, this table goes to x1, right?

Yes. That script should add it to x1 with the configuration in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/973157.

Change 972384 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] Add MediaModeration to createExtensionTables.php

https://gerrit.wikimedia.org/r/972384

Change 972387 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] Add MediaModeration to addWiki.php

https://gerrit.wikimedia.org/r/972387

Change 973345 had a related patch set uploaded (by Urbanecm; author: Dreamy Jazz):

[mediawiki/extensions/WikimediaMaintenance@wmf/1.42.0-wmf.4] Add MediaModeration to createExtensionTables.php

https://gerrit.wikimedia.org/r/973345

Change 973786 had a related patch set uploaded (by Urbanecm; author: Dreamy Jazz):

[mediawiki/extensions/WikimediaMaintenance@wmf/1.42.0-wmf.4] Add MediaModeration to addWiki.php

https://gerrit.wikimedia.org/r/973786

Change 973786 merged by Urbanecm:

[mediawiki/extensions/WikimediaMaintenance@wmf/1.42.0-wmf.4] Add MediaModeration to addWiki.php

https://gerrit.wikimedia.org/r/973786

Change 973345 merged by Urbanecm:

[mediawiki/extensions/WikimediaMaintenance@wmf/1.42.0-wmf.4] Add MediaModeration to createExtensionTables.php

https://gerrit.wikimedia.org/r/973345

Mentioned in SAL (#wikimedia-operations) [2023-11-13T14:15:30Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:973786|Add MediaModeration to addWiki.php (T350321)]], [[gerrit:973345|Add MediaModeration to createExtensionTables.php (T350321)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-13T14:16:48Z] <urbanecm@deploy2002> urbanecm: Backport for [[gerrit:973786|Add MediaModeration to addWiki.php (T350321)]], [[gerrit:973345|Add MediaModeration to createExtensionTables.php (T350321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-13T14:22:29Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:973786|Add MediaModeration to addWiki.php (T350321)]], [[gerrit:973345|Add MediaModeration to createExtensionTables.php (T350321)]] (duration: 06m 58s)

Change 973788 had a related patch set uploaded (by Urbanecm; author: Dreamy Jazz):

[mediawiki/extensions/MediaModeration@wmf/1.42.0-wmf.4] Add mediamoderation_scan table

https://gerrit.wikimedia.org/r/973788

Change 973788 merged by jenkins-bot:

[mediawiki/extensions/MediaModeration@wmf/1.42.0-wmf.4] Add mediamoderation_scan table

https://gerrit.wikimedia.org/r/973788

Mentioned in SAL (#wikimedia-operations) [2023-11-13T14:31:11Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:973784|Deploy Reader Demographics 2 survey (T345951)]], [[gerrit:973788|Add mediamoderation_scan table (T350321)]]

Change 973809 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[operations/puppet@production] Add mediamoderation_scan to $private_tables

https://gerrit.wikimedia.org/r/973809

Mentioned in SAL (#wikimedia-operations) [2023-11-13T14:32:28Z] <urbanecm@deploy2002> urbanecm and dani: Backport for [[gerrit:973784|Deploy Reader Demographics 2 survey (T345951)]], [[gerrit:973788|Add mediamoderation_scan table (T350321)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-13T14:40:25Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:973784|Deploy Reader Demographics 2 survey (T345951)]], [[gerrit:973788|Add mediamoderation_scan table (T350321)]] (duration: 09m 13s)

Mentioned in SAL (#wikimedia-operations) [2023-11-13T14:43:54Z] <urbanecm> mwmaint2002: foreachwiki extensions/WikimediaMaintenance/createExtensionTables.php MediaModeration (T350321)

Everything for this task should be done (some follow-up tasks are needed which I will create tickets for).

Unsure if QA is needed for this ticket, but would suggest:

  1. Install MediaModeration
  2. Run update.php
  3. Check that the DB for the wiki has the mediamoderation_scan table

Feel free to move out of QA if no QA is needed.

Change 973809 merged by Ladsgroup:

[operations/puppet@production] Add mediamoderation_scan to $private_tables

https://gerrit.wikimedia.org/r/973809

This puppet actually requires mysql threads being rebooted in sanitarium hosts but since this is not really needed and is just adding extra layers, I let it flow naturally.

@Dreamy_Jazz mediamoderation_scan table is showing as seen in the screenshots below. I will move this to Done and look out for follow-up tasks. Thanks for all your work!

Status: ✅PASS
Environment: Visual Studio
OS: macOS Sonoma 14.0
Browser: Chrome 119
Skins. N/A
Device: MBA M2
Emulated Device:: n/a
Test Links:
N/A

✅AC1: https://phabricator.wikimedia.org/T350321#9326751

2023-11-16_10-17-21.png (213×1 px, 40 KB)

2023-11-16_10-17-40.png (1×678 px, 158 KB)