Page MenuHomePhabricator

What would be the preferred way to run a series of read queries on DB replica, each on two *_p databases ?
Closed, ResolvedPublic

Description

First of all, I want to apologize in advance for the length of the message.

I intend to run my tool ("wikidata-redirects-conflicts-reports" ; "wrcr") periodically (every month or every two months perhaps ? ) over every wiki that can be linked to through Wikidata wikilinks.
It looks for Wikidata items that link to pages in the main namespace which redirect to other pages that have themselves another Wikidata item. The interface prevents (as it should) some modifications of either Wikidata items, and they actually represent inconsistency in Wikidata data.

The DB query in its current form is :

SELECT source.ips_item_id AS sourceitemid, source.ips_site_page AS sourcelink, COALESCE( CONCAT( redirect.rd_title, "#", redirect.rd_fragment ), redirect.rd_title ) AS redirecttarget, target.ips_site_page AS targelink, target.ips_item_i$
targetitemid
    FROM
        wikidatawiki_p.wb_items_per_site source -- item pointing to the redirecting wikipage
        INNER JOIN %1$s_p.page sourcepage -- {$wiki}
            ON REPLACE( source.ips_site_page, " " , "_" ) = sourcepage.page_title -- identifying the redirecting wikipage locally
            -- I know REPLACE() breaks indexes, but sadly I don't know of a workaround for the space/underscore mismatch
        INNER JOIN %1$s_p.redirect redirect -- {$wiki}
            ON sourcepage.page_id = redirect.rd_from -- identifying the redirect target locally
        INNER JOIN wikidatawiki_p.wb_items_per_site target
            ON redirect.rd_title = REPLACE( target.ips_site_page, " ", "_" ) -- item the redirect target corresponds to
            -- as stated earlier, I know REPLACE() breaks indexes, but sadly I don't know of a workaround for the space/underscore mismatch
    WHERE
        source.ips_site_id = "%1$s" -- {$wiki}
        AND target.ips_site_id = "%1$s" -- {$wiki}
        AND sourcepage.page_namespace = 0

where "%1$s" is replaced by the name of the wiki, which means "%1$s_p" is replaced by the name of the database for that wiki.
The results are written directly into a TSV file.

While this proves functional and surprisingly fast for most smaller wikis (10 minutes or less for wikis up to 250k content pages), the only wiki with more than 400k content pages I've tried it with (arwiki) gives a somewhat unexpected result taking a little over an hour, roughly 30 times more than the 2min30 taken for a 230k-page wiki (euwiki).

As jynus pointed out on IRC, it might prove better to "cut" the one big query in several smaller ones.
What I have in mind for that is running 3 parallel prepared statements using PHP's PDO::prepare() on three tables in two databases for each wiki :

  • in wikidatawiki_p :
SELECT ips_item_id, ips_site_page FROM wb_items_per_site WHERE ips_site_id = :siteid
  • for each result, store the first element (the Wikidata item id) and bind the second (the wikilink from Wikidata to the wiki) by value to :wikilink in the following query run in (site id)_p (passed in PDO::__construct() for the second PDO object) :
SELECT redirect.rd_title AS target, redirect.rd_fragment AS fragment
    FROM
        page
        INNER JOIN redirect
            ON page.page_id = redirect.rd_from
        WHERE
            REPLACE( :wikilink, " ", "_" ) = page.page_title
  • if getting a result :target, run on wikidatawiki_p with the site id bound to :siteid (back on the first PDO object) :
SELECT ips_item_id FROM wb_items_per_site WHERE ips_site_id = :siteid AND ips_site_page = :target
  • And finally, appending the results using PHP's fputcsv() to the current report.

However, running this might take fairly long, and keep connections to the two databases open for longer than running the "massively" joined query from above, not to mention running PHP for a fairly long time (in the current process, it is only responsible for building the query and passing it to $ mysql, which means it stops before the query even runs) ; and I'm afraid I can't think of a way to know in advance exactly how long.

I obviously don't intend to run the process concurrently for every wiki, but sequentially, hoping of course the sum of all processes doesn't take more than a reasonable period (as stated earlier, one or two months ? ).

Since it's not so demanding on small wikis (at worst 10min07 for 112k pages with bewiki, usually much less), it seems the first way is adapted for them.

Assuming I'm allowed to eventually run either one of them on bigger wikis, which do you think would be better ?

Event Timeline

Alphos raised the priority of this task from to Medium.
Alphos updated the task description. (Show Details)
Alphos added projects: DBA, Toolforge.
Alphos subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Additional info : I think I may have found a decent approximation for the complexity of the query(es) I've run up to this point.

It appears the time is somewhat linear to ( pages on the wiki ) × ( results found on the wiki ), as shows this scatterplot. It includes what I previously thought to be outliers, which were in fact wikis with lots of positives.

Hope this helps.

I do not know if you are looking for advice of for permission. I cannot give you too much advice, but regarding permission there is not many rules enforced in advance.

However, if queries affect the rest of users because they are too taxing, the queries will be killed and the user warned and expected to take action.

I really am looking for both.

On another note, I'm in a bit of a tough time currently (for the past few weeks, just another strong bout of a fairly serious medical condition), so I'm not really able to work on that tool or any other just yet - hopefully, a new course of treatment (higher dose and different form factor) I've only just started will make things easier on me and let me resume work soon.

Hi @Alphos

I hope you are well. I was wondering if this task is still needed?

Going to mark this as resolved for now as it is an old one and as Jaime said there was not much we could do about it. If @Alphos still consider this a valid task, please reopen it :-)