Page MenuHomePhabricator

Special:ActiveUsers queries result in fatal error (due to query/request timeout)
Closed, ResolvedPublic

Description

On https://commons.wikimedia.org/w/index.php?title=Special:ActiveUsers&limit=1&username=Wolfman12405

MediaWiki internal error.

Original exception: [W0HB1QpAICAAABpvRNYAAACC] 2018-07-08 07:49:37: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError"

Exception caught inside exception handler.

Set $wgShowExceptionDetails = true; and $wgShowDBErrorBacktrace = true; at the bottom of LocalSettings.php to show detailed debugging information.

Details

Related Gerrit Patches:

Event Timeline

Josve05a created this task.Jul 8 2018, 7:56 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 8 2018, 7:56 AM
Josve05a updated the task description. (Show Details)
Josve05a moved this task from Backlog to Tasks to follow on the User-Josve05a board.
Marostegui added a subscriber: Marostegui.

This looks like the offending query when I try to open the link which is probably being killed by the query killer as it takes longer than 60 seconds to execute ( (https://logstash.wikimedia.org/goto/5970062daa09e98d04b189ff7399cd1c) ):

root@db1091[commonswiki]> SELECT  qcc_title,qcc_title AS `user_name`,MAX(user_id) AS `user_id`,COUNT(*) AS `recentedits`  FROM `querycachetwo`,`user`,`recentchanges`    WHERE qcc_type = 'activeusers' AND qcc_namespace = '2' AND (user_name = qcc_title) AND (rc_user_text = qcc_title) AND (rc_type != '5') AND (rc_type != '6') AND (rc_log_type IS NULL OR rc_log_type != 'newusers') AND (rc_timestamp >= 'redacted') AND (qcc_title >= 'redacted') AND (NOT EXISTS (SELECT  1  FROM `ipblocks`    WHERE (ipb_user=user_id) AND ipb_deleted = '1'  ))  GROUP BY qcc_title ORDER BY qcc_title LIMIT 2  ;

redacted result 

2 rows in set (2 min 3.38 sec)

And this is the explain

+------+--------------+---------------+--------+---------------------------------+--------------+---------+-------------------------------------+--------+------------------------------------------------------------------------+
| id   | select_type  | table         | type   | possible_keys                   | key          | key_len | ref                                 | rows   | Extra                                                                  |
+------+--------------+---------------+--------+---------------------------------+--------------+---------+-------------------------------------+--------+------------------------------------------------------------------------+
|    1 | PRIMARY      | querycachetwo | range  | qcc_type,qcc_title,qcc_titletwo | qcc_title    | 295     | NULL                                |   1931 | Using where; Using index; Using temporary; Using filesort              |
|    1 | PRIMARY      | user          | eq_ref | user_name                       | user_name    | 257     | commonswiki.querycachetwo.qcc_title |      1 | Using where; Using index                                               |
|    1 | PRIMARY      | recentchanges | range  | rc_timestamp,rc_user_text       | rc_user_text | 273     | NULL                                | 882230 | Using index condition; Using where; Using join buffer (flat, BNL join) |
|    2 | MATERIALIZED | ipblocks      | ALL    | ipb_user                        | NULL         | NULL    | NULL                                |  63099 | Using where                                                            |
+------+--------------+---------------+--------+---------------------------------+--------------+---------+-------------------------------------+--------+------------------------------------------------------------------------+
4 rows in set (0.01 sec)
Anomie added a subscriber: Anomie.Jul 9 2018, 4:23 PM

The part of the query on recentchanges here seems to be the problem. It's having to go through every user from querycachetwo that meets the other conditions in order to filter two for the result, and since it's doing a filesort it can't stop early as soon as it finds two that pass.

If we add an index on something like (rc_user_text,rc_type,rc_log_type,rc_timestamp), that might improve it. That would take some experimenting, and I'm not too confident it would make enough of a difference. Maybe if we used it in subquery-fashion as below with an existence check inside the subquery.

Or we could trust querycachetwo more and accept the possibility that the special page might return some users who're not "active" anymore but were the last time querycachetwo was updated:

SELECT qcc_title,qcc_title AS `user_name`,MAX(user_id) AS `user_id`,COUNT(*) AS `recentedits`
 FROM
  (SELECT qcc_title,user_id 
    FROM `querycachetwo`
     JOIN `user` ON((user_name = qcc_title))
    WHERE qcc_type = 'activeusers'
     AND qcc_namespace = '2'
     AND (qcc_title >= 'redacted')
     AND (NOT EXISTS (SELECT  1  FROM `ipblocks`    WHERE (ipb_user=user_id) AND ipb_deleted = '1'  ))
    ORDER BY qcc_title LIMIT 2
  ) AS `qcc_users`
  JOIN `recentchanges` ON ((rc_user_text = qcc_title))
 WHERE (rc_type != '5') AND (rc_type != '6') AND (rc_log_type IS NULL OR rc_log_type != 'newusers')
  AND (rc_timestamp >= '20180608074837')
 GROUP BY qcc_title ORDER BY qcc_title
+------+--------------------+---------------+----------------+---------------------------------+--------------+---------+-------------------------------------+------+------------------------------------+
| id   | select_type        | table         | type           | possible_keys                   | key          | key_len | ref                                 | rows | Extra                              |
+------+--------------------+---------------+----------------+---------------------------------+--------------+---------+-------------------------------------+------+------------------------------------+
|    1 | PRIMARY            | <derived2>    | ALL            | NULL                            | NULL         | NULL    | NULL                                |    2 | Using temporary; Using filesort    |
|    1 | PRIMARY            | recentchanges | ref            | rc_timestamp,rc_user_text       | rc_user_text | 257     | qcc_users.qcc_title                 |   78 | Using index condition; Using where |
|    2 | DERIVED            | querycachetwo | range          | qcc_type,qcc_title,qcc_titletwo | qcc_title    | 295     | NULL                                | 1918 | Using where; Using index           |
|    2 | DERIVED            | user          | eq_ref         | user_name                       | user_name    | 257     | commonswiki.querycachetwo.qcc_title |    1 | Using where; Using index           |
|    3 | DEPENDENT SUBQUERY | ipblocks      | index_subquery | ipb_user                        | ipb_user     | 4       | func                                |    1 | Using where                        |
+------+--------------------+---------------+----------------+---------------------------------+--------------+---------+-------------------------------------+------+------------------------------------+

That way only two users will have to be queried from recentchanges instead of an unbounded number of them.

Actually doing this in ActiveUsersPager will require overriding buildQueryInfo() in addition to getQueryInfo(), but it seems straightforward enough.

Change 444666 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Improve performance of ActiveUsersPager query

https://gerrit.wikimedia.org/r/444666

Krinkle triaged this task as High priority.Jul 11 2018, 1:38 AM
Krinkle added a subscriber: Krinkle.

As of this month, the PHP timeout is now shorter than the MySQL timeout. So the error is now a fatal (with WMF error page) instead of a Database exception rendered with normal wiki page.

PHP Fatal Error: entire web request took longer than 60 seconds and timed out

#3 /srv/mediawiki/php-1.32.0-wmf.20/includes/libs/rdbms/database/Database.php(1655): Wikimedia\Rdbms\Database->query(string, string)
#4 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(368): Wikimedia\Rdbms\Database->select(array, array, array, string, array, array)
#5 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(225): IndexPager->reallyDoQuery(string, integer, boolean)
#6 /srv/mediawiki/php-1.32.0-wmf.20/includes/pager/IndexPager.php(422): IndexPager->doQuery()
#7 /srv/mediawiki/php-1.32.0-wmf.20/includes/specials/SpecialActiveusers.php(62): IndexPager->getBody()
#8 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPage.php(569): SpecialActiveUsers->execute(NULL)
#9 /srv/mediawiki/php-1.32.0-wmf.20/includes/specialpage/SpecialPageFactory.php(581): SpecialPage->run(NULL)
#10 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(288): MediaWiki\Special\SpecialPageFactory->executePath(Title, RequestContext)
#11 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(868): MediaWiki->performRequest()
#12 /srv/mediawiki/php-1.32.0-wmf.20/includes/MediaWiki.php(525): MediaWiki->main()
#13 /srv/mediawiki/php-1.32.0-wmf.20/index.php(42): MediaWiki->run()
Krinkle renamed this task from Special:ActiveUsers results in fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError" to Special:ActiveUsers queries result in fatal error (due to query/request timeout).Sep 17 2018, 5:13 PM
Krinkle changed Risk Rating from N/A to default.
daniel added a subscriber: daniel.Jan 23 2019, 8:42 AM

It seems to me this is blocked on the fact that the new code is not trivial to review.

@Anomie, could you post some example queries that result from the new code you wrote, so they can be reviewed by DBAs? I can personally poke at the PHP code, but I can't really assess whether the new queries are better, or may cause new fun problems.

daniel assigned this task to Anomie.Jan 23 2019, 9:32 AM

See T199044#4408658 above, the patch implements the "or we could trust querycachetwo more" bit. Variations include, in the inner query,

  • Once actor migration proceeds, a JOIN actor ON (actor_user = user_id), including actor_id in the inner query's selected fields, and changing the outer query's recentchanges join condition to rc_actor = actor_id.
    • In this situation, the existing code in master instead joins actor on rc_actor = actor_id and recentchanges on actor_name = qcc_title to the same effect.
  • A possible JOIN user_groups AS ug1 ON (ug1.ug_user = user_id) plus WHERE conditions ug1.ug_group IN (...) AND (ug1.ug_expiry IS NULL OR ug1.ug_expiry >= 'timestamp').
    • There's a similar condition in the existing code in master, lacking the ug1 alias.
  • A possible LEFT JOIN user_groups AS ug2 ON (ug2.ug_user = user_id AND ug2.ug_group IN (...) AND (ug2.ug_expiry IS NULL OR ug2.ug_expiry >= 'timestamp')) plus WHERE condition ug2.ug_user IS NULL.
    • In the existing code in master the same effect is achieved with a NOT EXISTS(...) subquery WHERE condition.
  • Omitting the ipblocks check, if the user using the special page has the ability to see suppressed usernames.
    • This possibility exists in the existing code too.

We're trying to determine what the next step is here. Is there someone already involved in the ticket who can review the SQL query that @Anomie posted above? @Marostegui ?

Is it worthwhile to have @daniel review the PHP code in parallel?

Change 444666 merged by jenkins-bot:
[mediawiki/core@master] Improve performance of ActiveUsersPager query

https://gerrit.wikimedia.org/r/444666

Anomie closed this task as Resolved.Feb 7 2019, 8:04 PM

The fix should be deployed to Wikimedia wikis with 1.33.0-wmf.17, see https://www.mediawiki.org/wiki/MediaWiki_1.33/Roadmap for a schedule.

aaron added a subscriber: aaron.Mar 4 2019, 4:52 AM

Here is a form of the query without filesorts:

EXPLAIN SELECT qcc_title,qcc_title AS `user_name`,MAX(user_id) AS `user_id`, (SELECT COUNT(*) FROM `recentchanges` WHERE ((rc_user_text = qcc_title) AND (rc_type != '5') AND (rc_type != '6') AND (rc_log_type IS NULL OR rc_log_type != 'newusers') AND (rc_timestamp >= '20190101010101'))) AS `recentedits`  FROM `querycachetwo` INNER JOIN `user` ON (user_name = qcc_title) WHERE qcc_type = 'activeusers' AND qcc_namespace = '2' AND qcc_title >= 'redacted' AND (NOT EXISTS (SELECT  1  FROM `ipblocks`    WHERE (ipb_user=user_id) AND ipb_deleted = '1'  ))  GROUP BY qcc_title HAVING `recentedits` > 0 ORDER BY qcc_title LIMIT 2;

Which results in:

stdClass Object
(
    [id] => 1
    [select_type] => PRIMARY
    [table] => querycachetwo
    [type] => range
    [possible_keys] => qcc_type,qcc_title,qcc_titletwo
    [key] => qcc_title
    [key_len] => 295
    [ref] =>
    [rows] => 1298
    [Extra] => Using where; Using index
)
stdClass Object
(
    [id] => 1
    [select_type] => PRIMARY
    [table] => user
    [type] => eq_ref
    [possible_keys] => user_name
    [key] => user_name
    [key_len] => 257
    [ref] => commonswiki.querycachetwo.qcc_title
    [rows] => 1
    [Extra] => Using where; Using index
)
stdClass Object
(
    [id] => 3
    [select_type] => DEPENDENT SUBQUERY
    [table] => ipblocks
    [type] => index_subquery
    [possible_keys] => ipb_user
    [key] => ipb_user
    [key_len] => 4
    [ref] => func
    [rows] => 1
    [Extra] => Using where
)
stdClass Object
(
    [id] => 2
    [select_type] => DEPENDENT SUBQUERY
    [table] => recentchanges
    [type] => ref
    [possible_keys] => rc_timestamp,rc_user_text
    [key] => rc_user_text
    [key_len] => 257
    [ref] => func
    [rows] => 26
    [Extra] => Using index condition; Using where
)
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM