Page MenuHomePhabricator

Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's
Open, MediumPublic

Description

What's the problem :

Currently it is possible to upload a File with the same name to Commons and to a local project. This results in the file on the local project appearing instead of the one at Commons (This is as designed behaviour.)

On English Wikipedia there are bots that check for this, and mark affected files with {{shadows commons}} or an approptiate equivalent template.

One of the bots that currently does this uses an SQL query to do this:- https://github.com/greencardamom/Shadows/blob/master/shadows.sql

This query may fail once the changes to Replicas are implemented , as it won't be possible to join tables cross-wiki as is done here. (see https://lists.wikimedia.org/pipermail/cloud/2020-November/001290.html and https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign)

What is desired:

A mechanism to do the same checking for 'shadowed' content, using 2 queries that doesn't need the relevant tables to be joined cross wiki.

Event Timeline

Bstorm added subscribers: Bstorm, Jhernandez.

Adding this to the WMCS workboard because this may be a good opportunity to generate a clear example of how to do this for a common use case of cross-database joins.

Technical query.

Would it be feasible to have a locally cached table on the relevant tools servers, that represents what would otherwise be a cross wiki table join, which is periodically updated (or when a bot specfically asked for a refresh. (I.E The relevant data is created programmatically via a suitable API, based on queries to each database seperately) and the bots query for entries in THAT cached table as opposed to the current cross wiki join?

Tl:Dr: Can we make a table that does the join using API calls rather than in SQL directly?

From that notebook, it seems the way to do this is not to query both tables and combine but to make a query against one and then nest short queries in a loop from the results or something like that, maybe? This won't work well if we try to use python as a join directly since we aren't going to have the RAM for the combined datasets most likely even if we get it working. I'll see if I can find a way to demonstrate something like that that to see how feasible it is.

That's basically what that notebook is doing in the paged subquery version. I'm just trying to reason through if there's anything I could add to that.

I wonder if we could leverage the recent_changes views to limit the initial dataset. The query would not need to review all history and records that haven't been touched, right?

Here is a pywikibot version of the search: https://public.paws.wmcloud.org/User:BDavis_(WMF)/ShadowedMedia.ipynb

It's not fast by any means, but it is generator based, so it should not have significant runtime memory issues.

Shadows bot currently runs every 24hrs, so the AntiCompositeNumber SQL method takes under 6 hours is workable, in that sense. Slower than the 90 seconds join method. It seems like a lot, 6 hours a day, on a SQL query paging through, and more error prone given the time exposure and networking. I can confirm using an API method would take around 70 hours, the pywikibot method would probably be similar if API based. Estimating data size of 65 million page title is 1-3 gigabytes. It's a lot to request, retrieve and process daily even in the same datacenter.

@Green_Cardamom Is it required to examine the entire set of records or just recent ones? That's why I was asking if there was some way to leverage the recentchanges views. I don't know if the api can do the same things. It seems reasonable that only things changed in the past month or so need to be checked, right? I, so far, don't know how to filter for that yet, but if I can find time, I'll try if that isn't a bad idea.

@bd808 Do you know the runtime on that version of it? It seemed to generate a much longer list than one of the SQL versions I checked.

OK thinking it through.. for example monitor recentchanges on the Commons side (the larger) and compare against the full corpus on the Enwiki side. This is great because it reduces the size of Commons to a few entries. Thus if the file pre-exists on Enwiki and is then added to Commons it works. But, if the file pre-exists on Commons and is then added to Enwiki, rececentchanges on Commons would not see it. So there would have to be recentchanges for both enwiki and Commons in both directions. But at the same, you would also need the full corpus in both directions, leading back to the original problem of a very large list for Commons. Unless the idea is to build that list from scratch Day 1 save to disk and each day add to it from recentchanges. Or maybe I am missing something.

@bd808 Do you know the runtime on that version of it? It seemed to generate a much longer list than one of the SQL versions I checked.

I started it and went to bed. It was done when I got up. :) The main speed limit in using the API like I did is the batch size limit of 50 titles per query for using action=query&title=.... An account with a bot right on commons can use batches of 500 which would help out a lot (10x reduction of secondary queries).

The major benefit of using the Action API over manual sql queries for this kind of thing is that the continuation method in the Action API is reentrant. Holding open an SQL select for hours (or even minutes) while you page through it for secondary lookups (the in-memory join as it were) is prone to a lot of interruption vectors.

@Green_Cardamom I was thinking of tracking recent_changes on the enwiki side and searching for specific items on the commons side as they are found from recent_changes in enwiki. That doesn't require loading the entire corpus of either into memory at any point and it would reduce the number of queries to commons (because going through the entire list of enwiki candidates and checking commons for all of them is slow). The inner joins are why I think that might be valid, and we are only tagging things in enwiki, right? If we also are looking for things to tag in commons, then the approach would not be valid.

@Green_Cardamom Maybe like @Bstorm suggested you could query for new/updated images since you last run the bot, and do the subqueries to commons with that smaller subset?

I'm not sure what the best way is, in the page and image table I can see page_touched and img_timestamp. img_timestamp seems to be the upload date so that could work.

For example you could run this one on enwiki: https://quarry.wmflabs.org/query/49931 and last week's uploads are ~4.600 (if I didn't mess something up), while the whole query is ~900.000 results.

Then you would batch a few queries like this: https://quarry.wmflabs.org/query/49933 to get the data from commons, which for the first 1900 results it only gives back one file.

Then on the application code, you would finally compare the enwiki results with the commons results and check that the sha1 is different, which is the only part from the original query that got left out from the two separate queries.

I'll try to put this together in PAWS to see it all together, I'm not great at python and I'm new to PAWS so it may take me a bit tho.

@Green_Cardamom I was also thinking that the enwiki query is "dominant" here because we are tagging things there and don't need to act unless there is a file there. That would suggest you could search enwiki's recent changes to such files and then only even check commons for that particular file (which is slow when searching the entire set, but it wouldn't be for a recent subset). I may be misunderstanding the ultimate goal, though.

@Bstorm what happens when a new File: is uploaded to Commons that overlaps with a pre-existing File: on enwiki for example one uploaded years ago?

@Green_Cardamom Now I get what you mean. Thank you for explaining it.

I think I may have missed that detail too.

If any new upload to commons would need to check enwiki, or any new upload to enwiki would need to check commons, then on my exploration I only did the enwiki-uploads-then-check-commons. I think it would be possible to then do the reverse, by checking the new uploads since the bot run on commons, and do the manual join to enwiki.

So to reduce the dataset for time, it would have to go either or both ways. Two queries, one on changes to commons and one on changes to enwiki that both go back and check for existing images on the other. That could end up more efficient, but it would also introduce complications etc. Worth experimenting with maybe.

Originally posted this comment in this (incorrect) ticket:
https://phabricator.wikimedia.org/T215858#6631859

RE: @Bstorm's comment T267992#6631541 about listening to recent changes:

We could start from a known good snapshot (like the last mediawiki history dataset), transform it into a useful shape, and listen to all the relevant page events, which in this case are:

  • page create
  • page move (rename)
  • page delete
  • page restore

The "useful shape" could be something really simple like:

page titlelist of wikis
File:Catenwiki,commonswiki
File:Batcommonswiki
File:Aristocratenwiki

Each event would issue a quick update, we would probably use the page id. The basic idea is we have the events we need and we could create a table that's fast/easy to query. We can also evolve this to a more comprehensive page schema that would support other use cases. It obviously requires some work and is not something we can do for free, just thinking out loud that some of the pieces are already there from a data point of view.

Summary of current explorations:

  • AntiComposite's notebook outlines a bunch of different approaches for doing the full check across the DBs
    • Runtime is ~6h estimated for some of them
  • bd808's notebook shows an API based approach using pywikibot to do the enwiki images checked against commons
    • I think it doesn't include the check of commons images against enwiki from what I understand
  • I've made a notebook too for exploring my approach of limiting the time period to do the checks on
    • This involves something like AntiComposite's approaches but setting limits using img_timestamp which could make the queries more manageable
    • enwiki to commons wiki is easily done and quite fast if you limit the time period, but
    • for checking new commons images against existing enwiki images, it becomes very cumbersome to do since there are so many new uploads to commons (182K in the last week, for example). It could still be doable using some of AntiComposite's approaches with pagination, I haven't gotten that far

I think this makes it apparent (also in T268240, T268242) that there is a very clear use case for matching commons images/files with other wikis' images/files, and to do it performantly we need would need some sort of explicit way to do this kind of cross join.

We can keep exploring solutions, but I'll also post this in T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema to keep track of the use cases.

Reposting bd808's comment: "Holding open an SQL select for hours (or even minutes) while you page through it for secondary lookups (the in-memory join as it were) is prone to a lot of interruption vectors." An SQL query that takes hours to complete, and is run daily, might run into aborts and lost data?

Hey @Milimetric I did see it, I had to collect my thoughts. We should probably continue discussion on T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema about this if it isn't something that is going to help this task in the short term.

I think that kind of table/querying would be very useful. AFAICT it wouldn't need to be realtime up to date, so if the data is a bit stale, like updated weekly or monthly it would still be useful from what we've heard up until now.

Regarding the sub-tasks we have now with the table you propose:

  • T267992: for this it would help get a subset of pages that could then be queried against commons and enwiki a lot more easily. Hopefully the subsets wouldn't be too big that would make querying the DBs separately very hard. We would need to check what the size of the intersection of file images between enwiki and commons is to get a better idea.
  • T268240: this one looks like a variation of the previous T267992, so it would help too
  • T268242: for this one (syncing fair-use images change of licenses or deletions from enwiki to smaller wiki) I'm not sure it would help. This query may be better suited for changing it to something incremental, I still have to chime in there.
  • T268244: for this one (links to non-existent images) it wouldn't help as it is based on imagelinks. Although this query may be more doable incrementally anyways, I still have to chime in there too.

I'll copy these comments to the OLAP task to continue there (T215858#6649771)

it wouldn't need to be realtime up to date, so if the data is a bit stale, like updated weekly or monthly

In the case of shadowbot it runs daily.

Given the recent changes idea, there could be 4 lists:

enwiki

A. Recent additions
B. All but recent additions

Commons

C. Recent additions
D. All but recent additions

A and C will be small, B will be a million, and D will be 65 million.

List comparisons would be made as follows:

  1. Compare C with A+B
  2. Compare D with A
  3. Compare A with C+D
  4. Compare B with C

The 2 and 3 compares are very large because D is 65 million. However, that is OK we have tools that can handle it without loading everything into memory. The challenge is simply how to generate the list of 65 million daily and reliably. This is solved by separating into two lists: C and D. Once the D list is generated it never needs to be created again, only appended each time C is generated. Deletions and renames might be solved by regenerating D every X days as a background process (preferably via API), or with more complexity attempting to maintain a reliable D list via recentchanges information.

I took another try at reimplementing this query, and have had much better results. https://github.com/AntiCompositeNumber/AntiCompositeBot/blob/2d13e62c0b4205682326316c6270db214347fe86/src/shadows.py is the code I used, which implements the "batched subquery method" I had previously proposed. Results are below:

Paul_Miller.jpg
Wellington_bomber.jpg
The_Director.jpg
Jye_Bolton.png
Stick_figure.png
And_Babies.jpg
Danylyszyn,_Andrew.jpg
Banorte_logo.svg
Roscoe_Conkling.jpg
1914_senegalese_tiralleurs.jpg
Ronnie_radke.jpg
Syam_Sudhakar.jpg
Voronoi.png
Holba.jpg
Mike_Stern.jpg
BeritaSatu.png
KMC_2.jpg
Carrillo.jpg
Kozłówka.jpg
Is_she_not_passing_fair_song_by_Elgar_cover_1908.jpg
Galapagos2.jpg
MoneySuperMarket_logo.svg
Rose_petals.jpg
Keyboard_21.jpg
Vivekananda_Rock_Memorial,_Kanyakumari,_India_2.jpg

This took about 1 minute to complete, compared to the 6 hours I was seeing with my previous attempts. This is probably mostly the result of improvements to the wikireplicas, but I also optimized the query.

@Green_Cardamom, are you interested in picking GreenC bot Job 10 back up or should I add it to AntiCompositeBot?

This is awesome. Confirming deployed AntiCompositeNumber's shadows.py to produce the list and the GreenC bot Job 10 ("shadows.awk") is back running, having just tagged 25 pages .

https://en.wikipedia.org/w/index.php?title=File:KMC_2.jpg&diff=prev&oldid=1054272421