Page MenuHomePhabricator

Labs query that involves Commons DB takes a long time
Closed, DeclinedPublic

Description

The following is an analytical query that we used to run every week on fawiki but now it takes too long (more than 30 minutes) and I need help optimizing the query. The query aims to identify pages in which there exists a link to a non-existing file (nearly always, an image). Obviously, the query should exclude links to files that don't exist locally but are present on Commons.

Adding Bryan and MA per previous discussion on Cloud-l in which they suggested that bringing these queries to their attention could potentially help with creation of new indexes that could benefit many more users.

SELECT
  page_title,
  il_to
FROM page
JOIN imagelinks
  ON il_from = page_id
WHERE
  page_namespace = 0
  AND il_to NOT IN (
    SELECT page_title
    FROM page
    WHERE
    page_namespace = 6
  )
  AND il_to NOT IN (
    SELECT page_title
    FROM commonswiki_p.page
    WHERE page_namespace = 6
  )
LIMIT 10000

Event Timeline

Huji renamed this task from Labs query takes a long time to Labs query that involves Commons DB takes a long time.May 4 2020, 6:02 PM

The two sub-selects in this query will return ALL rows in the database for the File namespace first for fawiki and then for commons. These large sets of data must be held in memory and then scanned repeatedly for each of the titles present in the imagelinks table in fawiki.

(u3518@commonswiki.analytics.db.svc.eqiad.wmflabs) [commonswiki_p]> select count(*) from page where page_namespace = 6;
+----------+
| count(*) |
+----------+
| 62624200 |
+----------+
1 row in set (33 min 29.03 sec)

It looks really likely that there is no index on the page_namespace column which could probably help quite a bit, but anything that requires holding 62M rows in memory or a temp table is going to be horribly slow on the Wiki Replicas. There is just not enough RAM for this kind of thing in a shared environment.

Thanks for looking into this. A few follow-up questions:

  1. At least for the first sub-query, it should use the name_title index (which is on page_namespace, page_title), no? Possibly for the second sub-query as well? At least that is how I interpret the explain results for this query; am I wrong?
  2. Could the issue be that, even using indexes, the number of matches for the first sub-query is so large that it still takes forever to join them with the commons DB?
  3. Lastly, do you think using a NOT EXISTS approach (instead of NOT IN) would be any better? See this alternative strategy please. Does that look any better?

Thanks for looking into this. A few follow-up questions:

  1. At least for the first sub-query, it should use the name_title index (which is on page_namespace, page_title), no? Possibly for the second sub-query as well? At least that is how I interpret the explain results for this query; am I wrong?

I think you are correct. My naive count() didn't involve page_title so that could explain why it took 30m just to count the rows.

  1. Could the issue be that, even using indexes, the number of matches for the first sub-query is so large that it still takes forever to join them with the commons DB?

Yes. And actually if you look at the plan the first thing it shows is looking at 25,222,635 rows through the index.

  1. Lastly, do you think using a NOT EXISTS approach (instead of NOT IN) would be any better? See this alternative strategy please. Does that look any better?

Yes. That looks marginally better. It only scans the 25M rows from commons and not also the 1.6M rows from fawiki. Oops, I just noticed that your explains are actually against enwiki rather than fawiki. The numbers on the in-wiki File namespace will go down a lot in reality when comparing against fawiki which has far fewer File namespace pages. Ultimately though computational cost is going to be based on the size of the commonswiki File namespace which is an ever growing number of rows.

Your LIMIT 10000 restriction probably doesn't do what you expect either. That will tell the database that it can stop after it has produced 10,000 result rows. But to find those 10,000 results it still has to look at all the rows in the commons File ns 10,000 times.

I honestly think that finding a way to fix Special:WantedFiles so that it filtered out media on commons ("Files from foreign repositories") would be the 'best' fix for this class of problem. The bots and scripts that folks make to work around gaps in MediaWiki are important, but fixing MediaWiki helps everyone.

Your LIMIT 10000 restriction probably doesn't do what you expect either. That will tell the database that it can stop after it has produced 10,000 result rows. But to find those 10,000 results it still has to look at all the rows in the commons File ns 10,000 times.

The limit is not there for efficiency reasons; it is because we don't want the actual on-wiki report to be too large (or else pywikibot will fail to save the page, and all the time spend in retrieving the data will be wasted).

I honestly think that finding a way to fix Special:WantedFiles so that it filtered out media on commons ("Files from foreign repositories") would be the 'best' fix for this class of problem. The bots and scripts that folks make to work around gaps in MediaWiki are important, but fixing MediaWiki helps everyone.

True; but wouldn't that require a similar query that runs through the entire commons DB? Wouldn't that still be a super slow and costly query, which may not even pass approval of our DBAs? Could that be why people have shied away from fixing it in MediaWiki itself?

I think T8220 is what you asked for... with no action since 2016.

True; but wouldn't that require a similar query that runs through the entire commons DB? Wouldn't that still be a super slow and costly query, which may not even pass approval of our DBAs?

It would be super slow and costly, but it could be reviewed and updated by a lot of folks. Production also has a LOT more database capacity that the Wiki Replicas. The results for "expensive" maintenance pages like Special:WantedFiles are computed using cron jobs that the DBAs have some control over and run against secondary database instances.

Could that be why people have shied away from fixing it in MediaWiki itself?

Maybe, but maybe it is just not fixed because nobody who cared about this report has actually tried to write the code and get it merged. I found T8220: Shared repositories support for Special:WantedFiles and https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/GlobalUsage/+/143835/ rotting away with a quick search of Phabricator.

I conclude that there is no way for us to run this analytical query on Labs. I rebased the patch associated with T8220 and added a few additional reviewers. I guess that is the extent of what I can do right now.

Thanks for taking the time to review this! Marking it as declined as this task in itself did not lead to any changes to the query or to the wiki replica DBs.