Page MenuHomePhabricator

Provide a mechanism for detecting duplicate files in commons and a local wiki
Open, MediumPublic

Description

Currently, it is possible for a user to upload a file to Commons that is identical to one on a local project, or vice versa. We can detect it through analytical queries on the Cloud-Services wikireplica servers, by matching the img_sha1 value of the files in the image table in each of those two databases.

This query may fail once the changes to Replicas are implemented , as it won't be possible to join tables cross-wiki as is done here. (see https://lists.wikimedia.org/pipermail/cloud/2020-November/001290.html and https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign)

This query is useful because in some cases, the redundant file has a different name (so T267992 will not find it). The real goal here is to (a) remove redundeat photos, and (b) in cases where one file is identified to be a copyright violation, notify the other community about it too. The latter happens sometimes for fawiki, for instance, where I find a copyvio issue with a file that was also uploaded to Commons so I tag them for deletion in Commons as well.

Note that this doesn't need a real-time replica; one option might be to create a separate data store that contains a daily or weekly dump of the images table for all wikis.

Event Timeline

Huji renamed this task from Provide mechanism for detecig duplicate files in commons and a local wiki to Provide a mechanism for detecting duplicate files in commons and a local wiki.Nov 19 2020, 3:33 PM
Huji updated the task description. (Show Details)

Hey @Huji thanks for opening the tasks about your use case.

We are exploring something similar in T267992: Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's.

If it is possible to check just the images since the last time the bot run, by limiting the time period then it is quite feasible to do a query to a wiki to check the newly uploaded images and get the sha1s, and then do a query to the other wiki you want to check to get info about those sha1s if they are there.

Here is the comment with links: https://phabricator.wikimedia.org/T267992#6631548 I think in your case the queries would be smaller. I'll have a stab at this once I'm done with that one, feel free to have a try yourself if you can.

That is a fair point. The one caveat is for projects (like fawiki) that have been doing this for a while, you can expect them to deal with all historical cases before the DB Redesign happens, and then only do it for "recently uploaded files" going forward. But for projects that have never done this, if their very first attempt at this is after the DB Redesign happens, they will have a hard time pulling that off (the script needs to check a very large number of files by running a large number of queries).

That is why I am suggesting an alternative approach of having a select few tables from commons be available in a separate analytical environment. This aligns with some of the recent suggestion by @jcrespo on the Cloud listserv that we should have a data lake and bring relevant data fields together based on use cases, as opposed to trying to use MySQL as an OLAP solution which is not sustainable.

That is a fair point. The one caveat is for projects (like fawiki) that have been doing this for a while, you can expect them to deal with all historical cases before the DB Redesign happens, and then only do it for "recently uploaded files" going forward. But for projects that have never done this, if their very first attempt at this is after the DB Redesign happens, they will have a hard time pulling that off (the script needs to check a very large number of files by running a large number of queries).

Definitely, they would need a long script run initially and then rely on incremental checks. I posted a summary of the approach for the other shadows bot on enwiki T267992#6637250, AntiComposite's notebook outlines some approaches. Based on their estimations it would take ~6h to do a full run (the initial one) between enwiki and commonswiki, so I think that would be our biggest use case. Any other wikis should be smaller than that so it should be possible to use this approach.

That is why I am suggesting an alternative approach of having a select few tables from commons be available in a separate analytical environment. This aligns with some of the recent suggestion by @jcrespo on the Cloud listserv that we should have a data lake and bring relevant data fields together based on use cases, as opposed to trying to use MySQL as an OLAP solution which is not sustainable.

Definitely. We have also created T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema because there is a clear need for certain use cases to do these kinds of joins but it can't be at the expense of destabilizing WikiReplicas for everyone else. So I've documented these use cases there and I hope we can come to a proposal in the near future for solving things that are becoming very hard with the new architecture but that are valuable for the tool developers and editor communities.

Andrew triaged this task as Medium priority.Dec 8 2020, 5:23 PM
Andrew moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

FYI, as a subproduct of T262668, I have created the technology to keep a database of images and small amounts of metadata for all images of all wikis, if someone is interested on that.

@jcrespo For a few of the use cases unsupported by the new wikireplicas architecture, the most important metadata that would be used to find intersection between images across wikis are the page_title, img_name and img_sha1. Something to query based on date would be useful too, like page_touched maybe. And also the wiki project hosting the file.

Will these be available on that DB? If so we could think about the best way to expose this to tool developers to fix their tools when the new wikireplicas replace the old ones.

@Jhernandez Yes, some or all of those filelds are planned to be gathered. Here is the current version of the main table of source files for backups:

files table
CREATE TABLE `files` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `wiki` int(10) unsigned NOT NULL,
  `upload_name` varbinary(255) DEFAULT NULL,
  `swift_container` int(10) unsigned DEFAULT NULL,
  `swift_name` varbinary(270) DEFAULT NULL,
  `file_type` tinyint(3) unsigned DEFAULT NULL,
  `status` tinyint(3) unsigned DEFAULT NULL,
  `sha1` varbinary(40) DEFAULT NULL,
  `md5` varbinary(32) DEFAULT NULL,
  `size` int(10) unsigned DEFAULT NULL,
  `upload_timestamp` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `archived_timestamp` timestamp NULL DEFAULT NULL,
  `deleted_timestamp` timestamp NULL DEFAULT NULL,
  `backup_status` tinyint(3) unsigned DEFAULT 1,
  PRIMARY KEY (`id`),
  KEY `sha1` (`sha1`),
  KEY `file_type` (`file_type`),
  KEY `status` (`status`),
  KEY `backup_status` (`backup_status`),
  KEY `wiki` (`wiki`),
  KEY `swift_container` (`swift_container`),
  KEY `upload_name` (`upload_name`,`status`),
  KEY `upload_timestamp` (`upload_timestamp`),
  CONSTRAINT `files_ibfk_1` FOREIGN KEY (`file_type`) REFERENCES `file_types` (`id`),
  CONSTRAINT `files_ibfk_2` FOREIGN KEY (`status`) REFERENCES `file_status` (`id`),
  CONSTRAINT `files_ibfk_3` FOREIGN KEY (`wiki`) REFERENCES `wikis` (`id`),
  CONSTRAINT `files_ibfk_4` FOREIGN KEY (`backup_status`) REFERENCES `backup_status` (`id`),
  CONSTRAINT `files_ibfk_5` FOREIGN KEY (`swift_container`) REFERENCES `swift_containers` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=4549852 DEFAULT CHARSET=binary

The current last version is at: https://phabricator.wikimedia.org/diffusion/OSWB/browse/media-backups/wmfbackups/media/mediabackups.sql

However, there is a few caveats. The main focus of my current work is to produce media backups, I am not focusing on 3rd party access at the moment. This means public and private data is mixed, it lives in the production network, and may have periods of unavailability on reload/extended lag. None of these problems are impossible to overcome with the right additional work, but I cannot justify work on those myself. More work would be required to make it public-ready, and I am happy to collaborate integrating patches, but that will have to be the focus of analytics, technical-engagement or media dumps people (or any other 3rd party that wants to collaborate on making this publicly available).

@Jhernandez Another option would be to reuse the tooling but create a duplicate "public only" database. In any case, I am more than happy to integrate patches that fit your needs.

@Jhernandez Yes, some or all of those filelds are planned to be gathered. Here is the current version of the main table of source files for backups:

Great! I think this would be useful to get a join of image names or hashes across projects within a time period or globally.

The current last version is at: https://phabricator.wikimedia.org/diffusion/OSWB/browse/media-backups/wmfbackups/media/mediabackups.sql

However, there is a few caveats. The main focus of my current work is to produce media backups, I am not focusing on 3rd party access at the moment. This means public and private data is mixed, it lives in the production network, and may have periods of unavailability on reload/extended lag.

Of course, that makes a lot of sense. From what we have heard, these few bots needing this kind of query are running weekly or every 2 weeks, so the requirements can probably be different than the replicas. Needs more discussion across the board.

None of these problems are impossible to overcome with the right additional work, but I cannot justify work on those myself. More work would be required to make it public-ready, and I am happy to collaborate integrating patches, but that will have to be the focus of analytics, technical-engagement or media dumps people (or any other 3rd party that wants to collaborate on making this publicly available).

I'll bring this up with the Cloud Services team for discussion and in the sync we have with Analytics about the replicas to talk about how we could incorporate this work into a solution. Thanks for expanding and commenting!

I'm sure there are other uses for the functionality described here, but…

…isn't detecting exact duplicates of media between commons and client wikis functionality that should be built in to the platform rather than handled by query and compare after the fact tools? (both for identical files and identical filenames ala. T267992)

I was actually a little surprised this wasn't already in place since I could have sworn a local file description page once told me it was a duplicate of a named file on Commons, and it is such an obvious function to have. It also seems to me that this is most naturally solved in MW, and exposed through file description page notices and maintenance categories on the local wiki. If there was any need for bot work it could then pull its list of files finished from that category rather than constructing it from first principles itself.

I'm probably just being dumb and misunderstanding something here, but figured I'd throw it out there all the same. Feel free to ignore me. :)

I was actually a little surprised this wasn't already in place since I could have sworn a local file description page once told me it was a duplicate of a named file on Commons, and it is such an obvious function to have. It also seems to me that this is most naturally solved in MW, and exposed through file description page notices and maintenance categories on the local wiki.

The file usage section on file pages lists duplicates, including from Commons. However, there is no way to find these since Special:ListDuplicatedFiles only lists local duplicates.

The file usage section on file pages lists duplicates, including from Commons. However, there is no way to find these since Special:ListDuplicatedFiles only lists local duplicates.

Right. But that means MW already knows about them, so it should be possible within reasonable limits of effort (fsvo "reasonable", caveat architectural limitations I'm not aware if) to expose that knowledge in a category somewhere. There's plenty of other such things that are stuffed in a category (or listed on a special page like Special:UnusedFiles). Wouldn't that be a more sensible focus to solve this specific problem than bugging the Cloud Services team to build their own custom data lake? Even if it was only updated on a schedule like UnusedFiles (but probably not capped at 5k for this) it should cover this use case.

(again, there are plenty of other reasons why cross wiki joins or a functional replacement would be nice, so I'm only talking about this very narrow and specific application of it, and mostly because I'd kind like to have such a cat or special page to track this issue on my project without needing to bot it)

In fact, looking at the code in SpecialFileDuplicateSearch.php it looks like querying for Commons media isn't particularly more complicated than local media when inside core, and T175088 suggests Special:ListDuplicatedFiles should be on the monthly "expensive query pages" cron job in any case. In that context, is there any particular reason SpecialListDuplicatedFiles.php for a given project couldn't do a (very specialised version of a) cross-wiki join itself and stuff the results in a category?

@Bawolff @Reedy Thoughts?