Page MenuHomePhabricator

Special:Log/Fanghong results in fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError"
Closed, ResolvedPublicPRODUCTION ERROR

Description

Might be the same issue as T199044, however, I didn't get a "MediaWiki internal error" page this time, only a pink square on Commons.


On https://commons.wikimedia.org/w/index.php?title=Special:Log/Fanghong&offset=20180625020905&type=&user=Fanghong

[W03fbgpAAEEAAAPdxnYAAADI] 2018-07-17 12:23:06: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError"

Event Timeline

Marostegui added subscribers: Anomie, jcrespo, Marostegui.

I have narrowed this to this query:

SELECT /* IndexPager::buildQueryInfo (LogPager) */  log_id,log_type,log_action,log_timestamp,log_namespace,log_title,log_params,log_deleted,user_id,user_name,user_editcount,COALESCE( comment_log_comment.comment_text, log_comment ) AS `log_comment_text`,comment_log_comment.comment_data AS `log_comment_data`,comment_log_comment.comment_id AS `log_comment_cid`,log_user,log_user_text,NULL AS `log_actor`,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`    WHERE ct_log_id=log_id  ) AS `ts_tags`  FROM `logging` LEFT JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id)) LEFT JOIN `user` ON ((user_id=log_user))   WHERE (log_type NOT IN ('spamblacklist','titleblacklist','abusefilterprivatedetails','suppress')) AND ((log_user = 'redacted')) AND ((log_deleted & 4) = 0) AND (log_type != 'thanks') AND (log_type != 'patrol') AND (log_type != 'tag') AND (log_timestamp<'20180625020905')  ORDER BY log_timestamp DESC LIMIT 51;

This query works when it hits db1097 and takes no time:

+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+-------+-------------+
| id   | select_type        | table               | type   | possible_keys                                | key       | key_len | ref                                | rows  | Extra       |
+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+-------+-------------+
|    1 | PRIMARY            | logging             | range  | type_time,user_time,times,log_user_type_time | user_time | 20      | NULL                               | 12252 | Using where |
|    1 | PRIMARY            | comment_log_comment | eq_ref | PRIMARY                                      | PRIMARY   | 8       | commonswiki.logging.log_comment_id |     1 |             |
|    1 | PRIMARY            | user                | eq_ref | PRIMARY                                      | PRIMARY   | 4       | commonswiki.logging.log_user       |     1 |             |
|    2 | DEPENDENT SUBQUERY | change_tag          | ref    | ct_log_id,change_tag_log_tag                 | ct_log_id | 5       | commonswiki.logging.log_id         |     1 | Using index |
+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+-------+-------------+
4 rows in set (0.00 sec)

However, when it hits db1103 (the other rc slave) it takes longer and get killed by the query killer. This is the explain:

+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+--------+-------------+
| id   | select_type        | table               | type   | possible_keys                                | key       | key_len | ref                                | rows   | Extra       |
+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+--------+-------------+
|    1 | PRIMARY            | logging             | index  | type_time,user_time,times,log_user_type_time | user_time | 20      | NULL                               | 644509 | Using where |
|    1 | PRIMARY            | comment_log_comment | eq_ref | PRIMARY                                      | PRIMARY   | 8       | commonswiki.logging.log_comment_id |      1 |             |
|    1 | PRIMARY            | user                | eq_ref | PRIMARY                                      | PRIMARY   | 4       | commonswiki.logging.log_user       |      1 |             |
|    2 | DEPENDENT SUBQUERY | change_tag          | ref    | ct_log_id,change_tag_log_tag                 | ct_log_id | 5       | commonswiki.logging.log_id         |      1 | Using index |
+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+--------+-------------+
4 rows in set (0.00 sec)

I just realised that the logging table on db1103 is not partitioned whereas the one on db1097 is:

+------------+
| @@hostname |
+------------+
| db1097     |
+------------+
*************************** 1. row ***************************
       Table: logging
Create Table: CREATE TABLE `logging` (
  `log_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `log_type` varbinary(32) NOT NULL DEFAULT '',
  `log_action` varbinary(32) NOT NULL DEFAULT '',
  `log_timestamp` varbinary(14) NOT NULL DEFAULT '19700101000000',
  `log_user` int(10) unsigned NOT NULL DEFAULT '0',
  `log_namespace` int(11) NOT NULL DEFAULT '0',
  `log_title` varbinary(255) NOT NULL DEFAULT '',
  `log_comment` varbinary(255) NOT NULL DEFAULT '',
  `log_comment_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  `log_params` blob NOT NULL,
  `log_deleted` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `log_user_text` varbinary(255) NOT NULL DEFAULT '',
  `log_actor` bigint(20) unsigned NOT NULL DEFAULT '0',
  `log_page` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`log_id`,`log_user`),
  KEY `type_time` (`log_type`,`log_timestamp`),
  KEY `user_time` (`log_user`,`log_timestamp`),
  KEY `page_time` (`log_namespace`,`log_title`,`log_timestamp`),
  KEY `times` (`log_timestamp`),
  KEY `log_page_id_time` (`log_page`,`log_timestamp`),
  KEY `log_user_text_time` (`log_user_text`(16),`log_timestamp`),
  KEY `log_title_time` (`log_title`(16),`log_timestamp`),
  KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
  KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),
  KEY `actor_time` (`log_actor`,`log_timestamp`),
  KEY `log_actor_type_time` (`log_actor`,`log_type`,`log_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=273088926 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
/*!50100 PARTITION BY RANGE (log_user)
(PARTITION p1 VALUES LESS THAN (1) ENGINE = InnoDB,
 PARTITION p5000 VALUES LESS THAN (5000) ENGINE = InnoDB,
 PARTITION p10000 VALUES LESS THAN (10000) ENGINE = InnoDB,
 PARTITION p20000 VALUES LESS THAN (20000) ENGINE = InnoDB,
 PARTITION p30000 VALUES LESS THAN (30000) ENGINE = InnoDB,
 PARTITION p40000 VALUES LESS THAN (40000) ENGINE = InnoDB,
 PARTITION p50000 VALUES LESS THAN (50000) ENGINE = InnoDB,
 PARTITION p75000 VALUES LESS THAN (75000) ENGINE = InnoDB,
 PARTITION p100000 VALUES LESS THAN (100000) ENGINE = InnoDB,
 PARTITION p150000 VALUES LESS THAN (150000) ENGINE = InnoDB,
 PARTITION p200000 VALUES LESS THAN (200000) ENGINE = InnoDB,
 PARTITION p250000 VALUES LESS THAN (250000) ENGINE = InnoDB,
 PARTITION p300000 VALUES LESS THAN (300000) ENGINE = InnoDB,
 PARTITION p350000 VALUES LESS THAN (350000) ENGINE = InnoDB,
 PARTITION p400000 VALUES LESS THAN (400000) ENGINE = InnoDB,
 PARTITION p500000 VALUES LESS THAN (500000) ENGINE = InnoDB,
 PARTITION p750000 VALUES LESS THAN (750000) ENGINE = InnoDB,
 PARTITION p1000000 VALUES LESS THAN (1000000) ENGINE = InnoDB,
 PARTITION p1500000 VALUES LESS THAN (1500000) ENGINE = InnoDB,
 PARTITION p2000000 VALUES LESS THAN (2000000) ENGINE = InnoDB,
 PARTITION p3000000 VALUES LESS THAN (3000000) ENGINE = InnoDB,
 PARTITION p4000000 VALUES LESS THAN (4000000) ENGINE = InnoDB,
 PARTITION p5000000 VALUES LESS THAN (5000000) ENGINE = InnoDB,
 PARTITION pMAXVALUE VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
+------------+
| @@hostname |
+------------+
| db1103     |
+------------+
*************************** 1. row ***************************
       Table: logging
Create Table: CREATE TABLE `logging` (
  `log_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `log_type` varbinary(32) NOT NULL DEFAULT '',
  `log_action` varbinary(32) NOT NULL DEFAULT '',
  `log_timestamp` varbinary(14) NOT NULL DEFAULT '19700101000000',
  `log_user` int(10) unsigned NOT NULL DEFAULT '0',
  `log_namespace` int(11) NOT NULL DEFAULT '0',
  `log_title` varbinary(255) NOT NULL DEFAULT '',
  `log_comment` varbinary(255) NOT NULL DEFAULT '',
  `log_comment_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  `log_params` blob NOT NULL,
  `log_deleted` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `log_user_text` varbinary(255) NOT NULL DEFAULT '',
  `log_actor` bigint(20) unsigned NOT NULL DEFAULT '0',
  `log_page` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`log_id`),
  KEY `type_time` (`log_type`,`log_timestamp`),
  KEY `user_time` (`log_user`,`log_timestamp`),
  KEY `page_time` (`log_namespace`,`log_title`,`log_timestamp`),
  KEY `times` (`log_timestamp`),
  KEY `log_page_id_time` (`log_page`,`log_timestamp`),
  KEY `log_user_text_time` (`log_user_text`(16),`log_timestamp`),
  KEY `log_title_time` (`log_title`(16),`log_timestamp`),
  KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
  KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),
  KEY `actor_time` (`log_actor`,`log_timestamp`),
  KEY `log_actor_type_time` (`log_actor`,`log_type`,`log_timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=273088926 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

I guess this explain the problem. I will work on partitioning the logging table on db1103.

/*!50100 PARTITION BY RANGE (log_user)

What's going to happen there when we eventually drop log_user in favor of log_actor?

/*!50100 PARTITION BY RANGE (log_user)

What's going to happen there when we eventually drop log_user in favor of log_actor?

We should probably reevalute this (and similar queries) as this looks like that it does benefit from the partitions

Looking at the queries, I see SHOW EXPLAIN says it's using an even worse plan:

wikiadmin@10.64.0.164(commonswiki)> show explain for 3441450538;
+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+----------+-----------------------------+
| id   | select_type        | table               | type   | possible_keys                                | key       | key_len | ref                                | rows     | Extra                       |
+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+----------+-----------------------------+
|    1 | PRIMARY            | logging             | index  | type_time,user_time,times,log_user_type_time | user_time | 20      | NULL                               | 77328774 | Using where; Using filesort |
|    1 | PRIMARY            | comment_log_comment | eq_ref | PRIMARY                                      | PRIMARY   | 8       | commonswiki.logging.log_comment_id |        1 |                             |
|    1 | PRIMARY            | user                | eq_ref | PRIMARY                                      | PRIMARY   | 4       | commonswiki.logging.log_user       |        1 |                             |
|    2 | DEPENDENT SUBQUERY | change_tag          | ref    | ct_log_id,change_tag_log_tag                 | ct_log_id | 5       | commonswiki.logging.log_id         |        1 | Using index                 |
+------+--------------------+---------------------+--------+----------------------------------------------+-----------+---------+------------------------------------+----------+-----------------------------+

Looks like it's scanning the whole table and filesorting, WTF.

On a hunch I decided to see if this was like T197486#4293781. And it looks like it is: ignoring an index that it's supposedly not using anyway makes the query suddenly much faster:

wikiadmin@db1103(commonswiki)> explain SELECT  log_id,log_type,log_action,log_timestamp,log_namespace,log_title,log_params,log_deleted,user_id,user_name,user_editcount,COALESCE( comment_log_comment.comment_text, log_comment ) AS `log_comment_text`,comment_log_comment.comment_data AS `log_comment_data`,comment_log_comment.comment_id AS `log_comment_cid`,log_user,log_user_text,NULL AS `log_actor`,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`    WHERE ct_log_id=log_id  ) AS `ts_tags`  FROM `logging` ignore index(log_user_type_time) LEFT JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id)) LEFT JOIN `user` ON ((user_id=log_user))   WHERE (log_type NOT IN ('spamblacklist','titleblacklist','abusefilterprivatedetails','suppress')) AND ((log_user = '784')) AND ((log_deleted & 12) != 9) AND (log_type != 'thanks') AND (log_type != 'patrol') AND (log_type != 'tag') AND (log_timestamp<'20180625020905')  ORDER BY log_timestamp DESC LIMIT 51;
+------+--------------------+---------------------+--------+------------------------------+-----------+---------+------------------------------------+-------+-------------+
| id   | select_type        | table               | type   | possible_keys                | key       | key_len | ref                                | rows  | Extra       |
+------+--------------------+---------------------+--------+------------------------------+-----------+---------+------------------------------------+-------+-------------+
|    1 | PRIMARY            | logging             | range  | type_time,user_time,times    | user_time | 20      | NULL                               | 11876 | Using where |
|    1 | PRIMARY            | comment_log_comment | eq_ref | PRIMARY                      | PRIMARY   | 8       | commonswiki.logging.log_comment_id |     1 |             |
|    1 | PRIMARY            | user                | eq_ref | PRIMARY                      | PRIMARY   | 4       | commonswiki.logging.log_user       |     1 |             |
|    2 | DEPENDENT SUBQUERY | change_tag          | ref    | ct_log_id,change_tag_log_tag | ct_log_id | 5       | commonswiki.logging.log_id         |     1 | Using index |
+------+--------------------+---------------------+--------+------------------------------+-----------+---------+------------------------------------+-------+-------------+
wikiadmin@db1103(commonswiki)> SELECT  log_id,log_type,log_action,log_timestamp,log_namespace,log_title,log_params,log_deleted,user_id,user_name,user_editcount,COALESCE( comment_log_comment.comment_text, log_comment ) AS `log_comment_text`,comment_log_comment.comment_data AS `log_comment_data`,comment_log_comment.comment_id AS `log_comment_cid`,log_user,log_user_text,NULL AS `log_actor`,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`    WHERE ct_log_id=log_id  ) AS `ts_tags`  FROM `logging` ignore index(log_user_type_time) LEFT JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id)) LEFT JOIN `user` ON ((user_id=log_user))   WHERE (log_type NOT IN ('spamblacklist','titleblacklist','abusefilterprivatedetails','suppress')) AND ((log_user = '784')) AND ((log_deleted & 12) != 9) AND (log_type != 'thanks') AND (log_type != 'patrol') AND (log_type != 'tag') AND (log_timestamp<'20180625020905')  ORDER BY log_timestamp DESC LIMIT 51;
[...]
51 rows in set (0.00 sec)

db1103 and db1097 are both using 10.1.31.
I will upgrade db1103 to 10.1.34 but I don't think that will help anyways - I guess we have to partition the table and then later decide how to overcome the future once log_user is dropped.

I will be running this on db1103 to let it with the same schema as the rest of rc slaves in production across all the sections - I will also double check if we have more rc servers with this similar issue:

ALTER TABLE logging
  DROP PRIMARY KEY,
  ADD PRIMARY KEY (log_id, log_user)
  PARTITION BY RANGE (log_user) (
  PARTITION p1 VALUES LESS THAN (1),
  PARTITION p10000 VALUES LESS THAN (10000),
  PARTITION p20000 VALUES LESS THAN (20000),
  PARTITION p30000 VALUES LESS THAN (30000),
  PARTITION p40000 VALUES LESS THAN (40000),
  PARTITION p50000 VALUES LESS THAN (50000),
  PARTITION p100000 VALUES LESS THAN (100000),
  PARTITION p150000 VALUES LESS THAN (150000),
  PARTITION p200000 VALUES LESS THAN (200000),
  PARTITION p300000 VALUES LESS THAN (300000),
  PARTITION p400000 VALUES LESS THAN (400000),
  PARTITION p500000 VALUES LESS THAN (500000),
  PARTITION p750000 VALUES LESS THAN (750000),
  PARTITION p1000000 VALUES LESS THAN (1000000),
  PARTITION pMAXVALUE VALUES LESS THAN MAXVALUE );

Change 446598 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103:3314

https://gerrit.wikimedia.org/r/446598

Change 446598 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1103:3314

https://gerrit.wikimedia.org/r/446598

Mentioned in SAL (#wikimedia-operations) [2018-07-18T14:42:47Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 for alter table - T199790 (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2018-07-18T14:47:31Z] <marostegui> Partition commonswiki.logging table on db1103:3314 - T199790

Marostegui claimed this task.

The partitioning has finished and the query now finishes instantly, like in db1097. The explain is now fine:

+------+--------------------+---------------------+--------+------------------------------+-----------+---------+------------------------------------+-------+-------------+
| id   | select_type        | table               | type   | possible_keys                | key       | key_len | ref                                | rows  | Extra       |
+------+--------------------+---------------------+--------+------------------------------+-----------+---------+------------------------------------+-------+-------------+
|    1 | PRIMARY            | logging             | range  | type_time,user_time,times    | user_time | 20      | NULL                               | 11832 | Using where |
|    1 | PRIMARY            | comment_log_comment | eq_ref | PRIMARY                      | PRIMARY   | 8       | commonswiki.logging.log_comment_id |     1 |             |
|    1 | PRIMARY            | user                | eq_ref | PRIMARY                      | PRIMARY   | 4       | commonswiki.logging.log_user       |     1 |             |
|    2 | DEPENDENT SUBQUERY | change_tag          | ref    | ct_log_id,change_tag_log_tag | ct_log_id | 5       | commonswiki.logging.log_id         |     1 | Using index |
+------+--------------------+---------------------+--------+------------------------------+-----------+---------+------------------------------------+-------+-------------+

I am going to upgrade MySQL there as well so we have the newest 10.1 version anyways.
This isn now resolved.
Thanks for reporting this @Josve05a

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM