Page MenuHomePhabricator

ApiQueryAllPages is slow on langlinks edgecase
Closed, DeclinedPublicPRODUCTION ERROR

Description

Spotted in production:

May  1 15:45:18 mw1223:  SlowTimer [59999ms] at runtime/ext_mysql: slow query:
SELECT /* ApiQueryAllPages::run Pywikibot-test */  /*! STRAIGHT_JOIN */
    page_namespace,page_title,page_id,page_content_model,page_restrictions,page_is_redirect,
    page_is_new,page_touched,page_latest,page_len 
FROM
    `page`,`langlinks`
WHERE
    page_namespace = '0' AND (page_title>='!') AND (page_id=ll_from)  
GROUP BY
    page_title,page_id ORDER BY page_title LIMIT 6

The explain for the query is deceptively simple:

1mysql:wikiadmin@db1052 [enwiki]> explain SELECT /* ApiQueryAllPages::run Pywikibot-test */ /*! STRAIGHT_JOIN */ page_namespace,page_title,page_id,page_content_model,page_restrictions,page_is_redirect,page_is_new,page_touched,page_latest,page_len FROM `page`,`langlinks` WHERE page_namespace = '0' AND (page_title>='!') AND (page_id=ll_from) GROUP BY page_title,page_id ORDER BY page_title LIMIT 6;
2+------+-------------+-----------+------+--------------------+------------+---------+---------------------+----------+------------------------------------+
3| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
4+------+-------------+-----------+------+--------------------+------------+---------+---------------------+----------+------------------------------------+
5| 1 | SIMPLE | page | ref | PRIMARY,name_title | name_title | 4 | const | 18727509 | Using index condition; Using where |
6| 1 | SIMPLE | langlinks | ref | ll_from | ll_from | 4 | enwiki.page.page_id | 2 | Using index |
7+------+-------------+-----------+------+--------------------+------------+---------+---------------------+----------+------------------------------------+
82 rows in set (0.08 sec)

Per @Anomie on IRC:

It's choosing a good index and not filesorting. The only bit that looks suspicious to me is the join with langlinks, if langlinks is mostly empty on that wiki, which would make it have to go through a lot of rows."

On a wiki like enwiki, this query is fast. On something like wikidatawiki where there's basically no langlinks in NS0, the edge case is really really slow, as in 60s or more.

An idea was floated:

No idea. Removing the straight join might work for wikidatawiki since langlinks is so small, but still wouldn't help the general case where other namespaces have lots of langlinks. Could add a "ll_from_namespace" column, I suppose.

But:

That could work, but would require backfilling to be useful.

Event Timeline

demon raised the priority of this task from to Medium.
demon updated the task description. (Show Details)
demon added subscribers: demon, Anomie.

It's basically one of those "Go through all pages in the namespace to find the tiny fraction that pass the filter" issues.

If nothing can be done to have the filtering done in SQL, I suppose we could select the $limit rows with a left join and filter them in PHP when $wgMiserMode is set (returning fewer than $limit, possibly 0, to the client), like we do in a few other places.

hashar set Security to None.
hashar moved this task from Untriaged to Dec2019/1.35.wmf.10+ on the Wikimedia-production-error board.
This comment was removed by Krinkle.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:12 PM
Aklapper changed the task status from Stalled to Open.Oct 19 2020, 4:34 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.

(Smallprint, as general orientation for task management: If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead. If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks. If this task is stalled on an upstream project, then the Upstream tag should be added. If this task requires info from the task reporter, then there should be instructions which info is needed. If this task needs retesting, then the TestMe tag should be added. If this task is either out of scope and nobody should ever work on this, or nobody else managed to reproduce the problem described in this task, then this task should have the "Declined" status. If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

Closing as this is an old timeout-related task without clear steps to reproduce the issue. It may have been intermitent or since solved or obsoleted. There has not been any on-topic comment since its creation five years ago.

On this task specifically, I suspect it may have ceased to timeout with improvements in hardware, or be obsoleted by the move of this data to Wikidata for many wikis.