First of all, I want to apologize in advance for the length of the message.
I intend to run my tool ("wikidata-redirects-conflicts-reports" ; "wrcr") periodically (every month or every two months perhaps ? ) over every wiki that can be linked to through Wikidata wikilinks.
It looks for Wikidata items that link to pages in the main namespace which redirect to other pages that have themselves another Wikidata item. The interface prevents (as it should) some modifications of either Wikidata items, and they actually represent inconsistency in Wikidata data.
The DB query in its current form is :
SELECT source.ips_item_id AS sourceitemid, source.ips_site_page AS sourcelink, COALESCE( CONCAT( redirect.rd_title, "#", redirect.rd_fragment ), redirect.rd_title ) AS redirecttarget, target.ips_site_page AS targelink, target.ips_item_i$ targetitemid FROM wikidatawiki_p.wb_items_per_site source -- item pointing to the redirecting wikipage INNER JOIN %1$s_p.page sourcepage -- {$wiki} ON REPLACE( source.ips_site_page, " " , "_" ) = sourcepage.page_title -- identifying the redirecting wikipage locally -- I know REPLACE() breaks indexes, but sadly I don't know of a workaround for the space/underscore mismatch INNER JOIN %1$s_p.redirect redirect -- {$wiki} ON sourcepage.page_id = redirect.rd_from -- identifying the redirect target locally INNER JOIN wikidatawiki_p.wb_items_per_site target ON redirect.rd_title = REPLACE( target.ips_site_page, " ", "_" ) -- item the redirect target corresponds to -- as stated earlier, I know REPLACE() breaks indexes, but sadly I don't know of a workaround for the space/underscore mismatch WHERE source.ips_site_id = "%1$s" -- {$wiki} AND target.ips_site_id = "%1$s" -- {$wiki} AND sourcepage.page_namespace = 0
where "%1$s" is replaced by the name of the wiki, which means "%1$s_p" is replaced by the name of the database for that wiki.
The results are written directly into a TSV file.
While this proves functional and surprisingly fast for most smaller wikis (10 minutes or less for wikis up to 250k content pages), the only wiki with more than 400k content pages I've tried it with (arwiki) gives a somewhat unexpected result taking a little over an hour, roughly 30 times more than the 2min30 taken for a 230k-page wiki (euwiki).
As jynus pointed out on IRC, it might prove better to "cut" the one big query in several smaller ones.
What I have in mind for that is running 3 parallel prepared statements using PHP's PDO::prepare() on three tables in two databases for each wiki :
- in wikidatawiki_p :
SELECT ips_item_id, ips_site_page FROM wb_items_per_site WHERE ips_site_id = :siteid
- for each result, store the first element (the Wikidata item id) and bind the second (the wikilink from Wikidata to the wiki) by value to :wikilink in the following query run in (site id)_p (passed in PDO::__construct() for the second PDO object) :
SELECT redirect.rd_title AS target, redirect.rd_fragment AS fragment FROM page INNER JOIN redirect ON page.page_id = redirect.rd_from WHERE REPLACE( :wikilink, " ", "_" ) = page.page_title
- if getting a result :target, run on wikidatawiki_p with the site id bound to :siteid (back on the first PDO object) :
SELECT ips_item_id FROM wb_items_per_site WHERE ips_site_id = :siteid AND ips_site_page = :target
- And finally, appending the results using PHP's fputcsv() to the current report.
However, running this might take fairly long, and keep connections to the two databases open for longer than running the "massively" joined query from above, not to mention running PHP for a fairly long time (in the current process, it is only responsible for building the query and passing it to $ mysql, which means it stops before the query even runs) ; and I'm afraid I can't think of a way to know in advance exactly how long.
I obviously don't intend to run the process concurrently for every wiki, but sequentially, hoping of course the sum of all processes doesn't take more than a reasonable period (as stated earlier, one or two months ? ).
Since it's not so demanding on small wikis (at worst 10min07 for 112k pages with bewiki, usually much less), it seems the first way is adapted for them.
Assuming I'm allowed to eventually run either one of them on bigger wikis, which do you think would be better ?