Page MenuHomePhabricator

Aggregate query on page, revision and langlinks takes a long time to run
Open, Needs TriagePublic

Description

The query attempts to calculate an "h-index" type metric for each user, but instead of the count of citations, it looks at the count of interwiki links. The complete query can be found here and used to run in about 15 minutes. It started to time out around the time the actor table was introduced. It can certainly benefit from using some alternative views. However, I think the actual issue lies somewhere else and has nothing to do with the actor table.

At its core, this query depends on a subquery like this:

SELECT
  actor_name,
  rev_actor,
  rev_page,
  COUNT(ll_lang) AS iwlinks
FROM revision_userindex
JOIN actor_revision
  on rev_actor = actor_id
JOIN page
  ON rev_page = page_id
  AND page_namespace = 0
  AND page_is_redirect = 0
JOIN langlinks
  ON ll_from = rev_page
WHERE
  rev_parent_id = 0
GROUP BY
  actor_name,
  rev_actor,
  rev_page
HAVING COUNT(ll_lang) > 5

It will then join that with itself and apply some additional logic and aggregations, all of which can also be done in the python script that runs the query. The only part that we absolutely need to be in SQL is the short query above. However, even that query does not finish within 30 minutes. Alas, even the following version which doesn't even join actor does not finish within 30 minutes:

SELECT
  rev_actor,
  rev_page,
  COUNT(ll_lang) AS iwlinks
FROM revision_userindex
JOIN page
  ON rev_page = page_id
  AND page_namespace = 0
  AND page_is_redirect = 0
JOIN langlinks
  ON ll_from = rev_page
WHERE
  rev_parent_id = 0
GROUP BY
  rev_actor,
  rev_page
HAVING COUNT(ll_lang) > 5

Do you have any advice on how to optimize the query?

Also, this is one of the several queries we run which looks at who created an article (i.e. joins page and revision and filters on page_namespace and rev_parent_id). I wonder if we should create a table called page_creator that just contains the information (perhaps with three columns page_id, create_actor_id, create_rev_id)

Event Timeline

Huji updated the task description. (Show Details)

As I thought more about this, I think the issue is really that the langlinks table is being scanned nearly in its entirety. The much shorter query below ends up using filesort.

SELECT
  ll_from,
  COUNT(ll_lang) AS iwlinks
FROM langlinks
GROUP BY ll_from
HAVING COUNT(ll_lang) > 5

The able only has one index which is not useful for the query above:

tables.sql
-- Index for ApiQueryLangbacklinks
CREATE INDEX /*i*/ll_lang ON /*_*/langlinks (ll_lang, ll_title);

Should we consider creating additional indexes in wiki replicas?

As I thought more about this, I think the issue is really that the langlinks table is being scanned nearly in its entirety. The much shorter query below ends up using filesort.

SELECT
  ll_from,
  COUNT(ll_lang) AS iwlinks
FROM langlinks
GROUP BY ll_from
HAVING COUNT(ll_lang) > 5

COUNT(ll_lang) AS iwlinks -- this is functionally equivalent to COUNT(*) because the ll_lang column is defined as NOT NULL. COUNT(<column>) only ever differs from COUNT(*) if the column is nullable. If it is nullable then it counts the non-null rows which in turn forces a full table scan. See https://www.percona.com/blog/2007/04/10/count-vs-countcol/

COUNT(DISTINCT <column>) may be what you are actually wanting to ask if the intent is to know how many different ll_lang values are in the table for each ll_from value.

Should we consider creating additional indexes in wiki replicas?

I think the GROUP BY might be sped up by an index on ll_from, but I think only in this reduced case where langlinks is not joined to any other tables. "This is possible under the following conditions: The query is over a single table." -- https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html#loose-index-scan

COUNT(DISTINCT <column>) may be what you are actually wanting to ask if the intent is to know how many different ll_lang values are in the table for each ll_from value.

Right. Although, because langlinks does not have any duplicates for the ll_from, ll_lang pair, the results should be the same, correct? Would adding DISTINCT have any benefit from an optimization perspective?

Should we consider creating additional indexes in wiki replicas?

I think the GROUP BY might be sped up by an index on ll_from, but I think only in this reduced case where langlinks is not joined to any other tables. "This is possible under the following conditions: The query is over a single table." -- https://dev.mysql.com/doc/refman/8.0/en/group-by-optimization.html#loose-index-scan

What if I rewrote the query as a nested query, with the innermost nest being a query on langlinks alone? One example is shown below; would this be treated any differently by the DB engine? That is, would an index on ll_from be helpful here?

different nesting approach
SELECT
  rev_actor,
  iwlinks
FROM page
JOIN revision
  ON rev_page = page_id
JOIN (
  SELECT
    ll_from,
    COUNT(ll_lang) AS iwlinks
  FROM langlinks
  GROUP BY ll_from
  HAVING COUNT(ll_lang) > 5
) ll
  ON ll_from = page_id
WHERE
  page_namespace = 0
  AND rev_parent_id = 0

PS: The query that only uses langlinks table seems to run in about 15 minutes now. I am giving the different nesting approach a try to see how that works out.

PPS: The query with different nesting approach was killed after 30 minutes.