Page MenuHomePhabricator

Wikidata produces a lot of failed requests for recentchanges API
Closed, ResolvedPublic

Description

Recently, WDQS servers started to get a lot of failed requests from www.wikidata.org for recentchanges API:

Aug 24 17:01:37 wdqs1003 wdqs-updater[26210]: Caused by: org.apache.http.NoHttpResponseException: www.wikidata.org:443 failed to respond

This slows down the update process and leads to WDQS lags. Also may indicate something is wrong with Wikidata servers, as this hasn't happened before.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle claimed this task.Sep 10 2018, 6:44 PM
Krinkle moved this task from Inbox to Doing on the Performance-Team board.

I'll investigate a bit to try to determine whether MediaWiki recentchanges may have regressed more generally or whether this might be specific to the Wikidata setup.

Gehel added a comment.Sep 11 2018, 5:59 PM

The issue as seen from WDQS can be followed on logstash.

Gehel added a comment.Sep 11 2018, 6:16 PM

It looks like there is a correlation between bot activity on wikidata query service (T202765) and the rate of those errors. This would tend to indicate that cause of this issue is load on wdqs and not slowdown on wikidata. I don't have any explanation of the causality chain except the correlation, so that might be completely wrong.

Doesn't seem to be WDQS related entirely - e.g. if I call 'https://www.wikidata.org/w/api.php?format=json&action=query&list=recentchanges&rcdir=newer&rcprop=title%7Cids%7Ctimestamp&rcnamespace=0%7C120&rclimit=100&continue=&rcstart=2018-09-14T00%3A00%3A00Z' - i.e. try to load 100 items from start of the day today - it takes 29 seconds:

real 0m29.500s
user 0m0.023s
sys 0m0.011s

I don't think it's normal for this operation to take this long. Same with different timestamp: 'https://www.wikidata.org/w/api.php?format=json&action=query&list=recentchanges&rcdir=newer&rcprop=title%7Cids%7Ctimestamp&rcnamespace=0%7C120&rclimit=100&continue=&rcstart=2018-09-14T14%3A00%3A00Z'

takes 0.3s. No idea why.

Change 460613 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] debug: Allow the DBQuery channel to be used

https://gerrit.wikimedia.org/r/460613

This query:

SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;

on mwmaint1001 takes 0.00 sec, on mwmaint2001 takes 56.50 sec!

Explains:

codfw:

1+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
2| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
3+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
4| 1 | SIMPLE | recentchanges | index | rc_timestamp,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16 | NULL | 2654 | Using where |
5+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+

eqiad:
1+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+--------+------------------------------------+
2| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
3+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+--------+------------------------------------+
4| 1 | SIMPLE | recentchanges | range | rc_timestamp,rc_ns_usertext,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16 | NULL | 833000 | Using index condition; Using where |
5+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+--------+------------------------------------+

Looks like codfw one does not use index. @jcrespo do you have any idea why that could happen?

Krinkle removed Krinkle as the assignee of this task.Sep 15 2018, 12:32 AM
Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Reedy added a subscriber: Reedy.Sep 15 2018, 12:51 AM

What db host are you running it against? There are specific hosts for recent changes as per the db-*.php files

		'db1099:3318' => 1, # B2 3.6TB 512GB # rc, log: s1 and s8
		'db1101:3318' => 1, # C2 3.6TB 512GB # rc, log: s7 and s8
		'db2085:3318' => 1,   # A5 3.3TB 512GB # rc, log: s1 and s8
		'db2086:3318' => 1,   # B1 3.3TB 512GB # rc, log: s7 and s8

There's some minor variation in mariadb versions; 1099 10.1.34, 1101 10.1.35, 2085 10.1.35, 2086 10.1.32

However, in all cases, I see Using where not Using index condition; Using where

I do see Using index condition; Using where in eqiad when no host is specified

EQIAD:

reedy@deploy1001:~$ sql wikidatawiki --host=db1101:3318
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 513658047
Server version: 10.1.35-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys                                                                                               | key          | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
|    1 | SIMPLE      | recentchanges | index | rc_timestamp,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 2507 | Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
1 row in set (0.02 sec)

MariaDB [wikidatawiki]> Bye
reedy@deploy1001:~$ sql wikidatawiki --host=db1099:3318
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 905507227
Server version: 10.1.34-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys                                                                                               | key          | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
|    1 | SIMPLE      | recentchanges | index | rc_timestamp,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 2557 | Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
1 row in set (0.03 sec)

MariaDB [wikidatawiki]> Bye
reedy@deploy1001:~$ sql wikidatawiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3385663373
Server version: 10.1.34-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+--------+------------------------------------+
| id   | select_type | table         | type  | possible_keys                                                                                         | key          | key_len | ref  | rows   | Extra                              |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+--------+------------------------------------+
|    1 | SIMPLE      | recentchanges | range | rc_timestamp,rc_ns_usertext,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 832070 | Using index condition; Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+--------+------------------------------------+
1 row in set (0.00 sec)

MariaDB [wikidatawiki]> show variables like 'hostname';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| hostname      | db1104 |
+---------------+--------+
1 row in set (0.01 sec)

MariaDB [wikidatawiki]>

CODFW:

reedy@deploy2001:~$ sql wikidatawiki --host=db2086:3318
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 62186542
Server version: 10.1.32-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys                                                                                               | key          | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
|    1 | SIMPLE      | recentchanges | index | rc_timestamp,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 2653 | Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
1 row in set (0.01 sec)

MariaDB [wikidatawiki]> Bye
reedy@deploy2001:~$ sql wikidatawiki --host=db2085:3318
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 56689485
Server version: 10.1.35-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
^CCtrl-C -- query killed. Continuing normally.
ERROR 1317 (70100): Query execution was interrupted
MariaDB [wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys                                                                                               | key          | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
|    1 | SIMPLE      | recentchanges | index | rc_timestamp,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 2600 | Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
1 row in set (0.00 sec)

MariaDB [wikidatawiki]> Bye
reedy@deploy2001:~$ sql wikidatawiki
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 167966653
Server version: 10.1.32-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys                                                                                               | key          | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
|    1 | SIMPLE      | recentchanges | index | rc_timestamp,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 2498 | Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
1 row in set (0.00 sec)

MariaDB [wikidatawiki]> show variables like 'hostname';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| hostname      | db2082 |
+---------------+--------+
1 row in set (0.00 sec)

MariaDB [wikidatawiki]>

There's definitely index differences between the hosts. The ones without Use index condition seem to have indexes called tmp_2 and tmp_3. The ones using Using index condition; Using where don't have those two

EQIAD

MariaDB [wikidatawiki]> show variables like 'hostname';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| hostname      | db1104 |
+---------------+--------+
1 row in set (0.01 sec)

MariaDB [wikidatawiki]> show indexes from recentchanges;
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name                         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| recentchanges |          0 | PRIMARY                          |            1 | rc_id        | A         |    19266664 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_timestamp                     |            1 | rc_timestamp | A         |     4816666 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_cur_id                        |            1 | rc_cur_id    | A         |    19266664 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            1 | rc_new       | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            2 | rc_namespace | A         |        3367 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            3 | rc_timestamp | A         |     6422221 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ip                            |            1 | rc_ip        | A         |      152910 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            1 | rc_namespace | A         |        1691 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            2 | rc_user_text | A         |      100872 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            1 | rc_user_text | A         |      170501 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            2 | rc_timestamp | A         |    19266664 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            1 | rc_namespace | A         |        2514 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            2 | rc_type      | A         |        3353 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            3 | rc_patrolled | A         |        3353 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            4 | rc_timestamp | A         |     9633332 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            1 | rc_namespace | A         |        2277 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            2 | rc_actor     | A         |        2277 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            1 | rc_actor     | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            2 | rc_timestamp | A         |     4816666 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            1 | rc_namespace | A         |        2145 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            2 | rc_title     | A         |    19266664 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            3 | rc_timestamp | A         |    19266664 |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
22 rows in set (0.00 sec)

MariaDB [wikidatawiki]> Bye
reedy@deploy1001:~$ sql wikidatawiki --host=db1099:3318
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 905507480
Server version: 10.1.34-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> show indexes from recentchanges;
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name                         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| recentchanges |          0 | PRIMARY                          |            1 | rc_id        | A         |    19255601 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_timestamp                     |            1 | rc_timestamp | A         |     6418533 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_cur_id                        |            1 | rc_cur_id    | A         |    19255601 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            1 | rc_new       | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            2 | rc_namespace | A         |        3483 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            3 | rc_timestamp | A         |     6418533 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ip                            |            1 | rc_ip        | A         |      163183 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            1 | rc_namespace | A         |        1695 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            2 | rc_user_text | A         |      103524 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            1 | rc_user_text | A         |      207049 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            2 | rc_timestamp | A         |    19255601 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_2                            |            1 | rc_bot       | A         |         972 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_2                            |            2 | rc_timestamp | A         |     9627800 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_3                            |            1 | rc_namespace | A         |        1783 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_3                            |            2 | rc_timestamp | A         |     6418533 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            1 | rc_namespace | A         |        2483 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            2 | rc_type      | A         |        3313 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            3 | rc_patrolled | A         |        4966 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            4 | rc_timestamp | A         |     9627800 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            1 | rc_namespace | A         |        2432 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            2 | rc_actor     | A         |        2432 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            1 | rc_actor     | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            2 | rc_timestamp | A         |     6418533 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            1 | rc_namespace | A         |        3218 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            2 | rc_title     | A         |    19255601 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            3 | rc_timestamp | A         |    19255601 |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
26 rows in set (0.00 sec)

MariaDB [wikidatawiki]> Bye
reedy@deploy1001:~$

All in codfw show the same...

CODFW

MariaDB [wikidatawiki]> show variables like 'hostname';
+---------------+--------+
| Variable_name | Value  |
+---------------+--------+
| hostname      | db2082 |
+---------------+--------+
1 row in set (0.00 sec)

MariaDB [wikidatawiki]> show indexes from recentchanges;
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name                         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| recentchanges |          0 | PRIMARY                          |            1 | rc_id        | A         |    19705640 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_timestamp                     |            1 | rc_timestamp | A         |     6568546 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_cur_id                        |            1 | rc_cur_id    | A         |    19705640 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            1 | rc_new       | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            2 | rc_namespace | A         |        1720 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            3 | rc_timestamp | A         |     6568546 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ip                            |            1 | rc_ip        | A         |       59176 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            1 | rc_namespace | A         |        2385 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            2 | rc_user_text | A         |      410534 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            1 | rc_user_text | A         |      298570 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            2 | rc_timestamp | A         |    19705640 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_2                            |            1 | rc_bot       | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_2                            |            2 | rc_timestamp | A         |     9852820 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_3                            |            1 | rc_namespace | A         |        1740 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_3                            |            2 | rc_timestamp | A         |     6568546 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            1 | rc_namespace | A         |        2507 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            2 | rc_type      | A         |        2507 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            3 | rc_patrolled | A         |        3341 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            4 | rc_timestamp | A         |     9852820 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            1 | rc_namespace | A         |        2246 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            2 | rc_actor     | A         |        2246 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            1 | rc_actor     | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            2 | rc_timestamp | A         |     6568546 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            1 | rc_namespace | A         |        1560 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            2 | rc_title     | A         |    19705640 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            3 | rc_timestamp | A         |    19705640 |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
26 rows in set (0.00 sec)

MariaDB [wikidatawiki]> Bye
reedy@deploy2001:~$ sql wikidatawiki --host=db2085:3318
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 56759524
Server version: 10.1.35-MariaDB MariaDB Server

Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wikidatawiki]> show indexes from recentchanges;
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table         | Non_unique | Key_name                         | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| recentchanges |          0 | PRIMARY                          |            1 | rc_id        | A         |    19754005 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_timestamp                     |            1 | rc_timestamp | A         |     4938501 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_cur_id                        |            1 | rc_cur_id    | A         |    19754005 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            1 | rc_new       | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            2 | rc_namespace | A         |        3414 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | new_name_timestamp               |            3 | rc_timestamp | A         |     9877002 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ip                            |            1 | rc_ip        | A         |      221955 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            1 | rc_namespace | A         |        2359 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_usertext                   |            2 | rc_user_text | A         |      470333 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            1 | rc_user_text | A         |      263386 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_user_text                     |            2 | rc_timestamp | A         |    19754005 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_2                            |            1 | rc_bot       | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_2                            |            2 | rc_timestamp | A         |     9877002 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_3                            |            1 | rc_namespace | A         |        2616 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | tmp_3                            |            2 | rc_timestamp | A         |     6584668 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            1 | rc_namespace | A         |        2443 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            2 | rc_type      | A         |        3258 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            3 | rc_patrolled | A         |        3258 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_name_type_patrolled_timestamp |            4 | rc_timestamp | A         |     6584668 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            1 | rc_namespace | A         |        1102 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_ns_actor                      |            2 | rc_actor     | A         |        1102 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            1 | rc_actor     | A         |           2 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_actor                         |            2 | rc_timestamp | A         |     4938501 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            1 | rc_namespace | A         |        1557 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            2 | rc_title     | A         |    19754005 |     NULL | NULL   |      | BTREE      |         |               |
| recentchanges |          1 | rc_namespace_title_timestamp     |            3 | rc_timestamp | A         |    19754005 |     NULL | NULL   |      | BTREE      |         |               |
+---------------+------------+----------------------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
26 rows in set (0.00 sec)

MariaDB [wikidatawiki]> 
`

@Reedy I am not sure which host, I just logged in to maintenance host for eqiad and codfw. Lookups show db2082.codfw.wmnet and db1092.eqiad.wmnet.

I tried db2085:3318 and the result the same as other codfw host. So if that's what actual API is using, that could be the reason why it is so slow.

Reedy added a comment.Sep 15 2018, 1:09 AM

I tried db2085:3318 and the result the same as other codfw host. So if that's what actual API is using, that could be the reason why it is so slow.

It should be the rc hosts it's using for those queries

Krinkle added a comment.EditedSep 15 2018, 3:37 AM

Earlier when I was investigating with Stas, performing the slow recentchanges api request with X-Wikimedia-Debug: log shows the following in Logstash (also: Xhgui profile):

[DEBUG] [DBConnection] Wikimedia\Rdbms\LoadBalancer::getReaderIndex: using server db2080 for group 'api'

Which indeed shows that it is using a dbhost from the "api" group instead of the "rc" group.

The actual query:

[DEBUG] [DBQuery]

SELECT rc_id,rc_timestamp, rc_namespace,rc_title,rc_cur_id, rc_type,rc_deleted, rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101

runtime: 27.811768054962 seconds

I do note that while I can reproduce maybe 8 of 10 times with the codfw mwdebug hosts, I could not on the eqiad mwdebug hosts.

Stas found the same when executing it from the command-line, that it is slow from mwmaint2001.codfw but normal from mwmaint1001.eqiad.

I haven't looked further, but I would assume that either means that in eqiad it is correctly picking the "rc" group, or that in eqiad the query is fast on dbhosts from the "api" group.

Marostegui added subscribers: Banyek, Marostegui.EditedSep 15 2018, 6:03 AM

So taking a quick look at the schema differences between rc groups in different DCs, they look the same:

The rc slaves in codfw:

root@neodymium:~# diff -u <(mysql.py -hdb2085:3318 -e "show create table recentchanges\G" wikidatawiki) <(mysql.py -hdb2086:3318 -e "show create table recentchanges\G" wikidatawiki)
root@neodymium:~#

The rc slaves in eqiad:

root@neodymium:~# diff -u <(mysql.py -hdb1101:3318 -e "show create table recentchanges\G" wikidatawiki) <(mysql.py -hdb1099:3318 -e "show create table recentchanges\G" wikidatawiki)
root@neodymium:~#

eqiad vs codfw:

root@neodymium:~# diff -u <(mysql.py -hdb1101:3318 -e "show create table recentchanges\G" wikidatawiki) <(mysql.py -hdb2085:3318 -e "show create table recentchanges\G" wikidatawiki)
root@neodymium:~#

API slaves eqiad:

root@neodymium:~# diff -u <(mysql.py -hdb1104 -e "show create table recentchanges\G" wikidatawiki) <(mysql.py -hdb1092 -e "show create table recentchanges\G" wikidatawiki)
root@neodymium:~#

API slaves codfw:

root@neodymium:~# diff -u <(mysql.py -hdb2081 -e "show create table recentchanges\G" wikidatawiki) <(mysql.py -hdb2080 -e "show create table recentchanges\G" wikidatawiki)
root@neodymium:~#

And as @Reedy points out, there are indexes differences between eqiad and codfw for the API hosts
eqiad vs codfw:

root@neodymium:~# diff -u <(mysql.py -hdb2081 -e "show create table recentchanges\G" wikidatawiki) <(mysql.py -hdb1104 -e "show create table recentchanges\G" wikidatawiki)
--- /dev/fd/63	2018-09-15 05:31:53.629312288 +0000
+++ /dev/fd/62	2018-09-15 05:31:53.629312288 +0000
@@ -34,10 +34,8 @@
   KEY `rc_ip` (`rc_ip`),
   KEY `rc_ns_usertext` (`rc_namespace`,`rc_user_text`),
   KEY `rc_user_text` (`rc_user_text`,`rc_timestamp`),
-  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
-  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
   KEY `rc_name_type_patrolled_timestamp` (`rc_namespace`,`rc_type`,`rc_patrolled`,`rc_timestamp`),

Those two indexes tmp2 and tmp3 (T202167) are present on codfw but not in eqiad, maybe they are messing up the query plan?

What is interesting to note is that the query plan for codfw is supposedly to better than the one in eqiad by looking at the number of rows for the API slaves:

root@db2081.codfw.wmnet[wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
| id   | select_type | table         | type  | possible_keys                                                                                               | key          | key_len | ref  | rows | Extra       |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
|    1 | SIMPLE      | recentchanges | index | rc_timestamp,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 1767 | Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------------+--------------+---------+------+------+-------------+
1 row in set (0.04 sec)

And:

root@db1092.eqiad.wmnet[wikidatawiki]> explain SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+---------+------------------------------------+
| id   | select_type | table         | type  | possible_keys                                                                                         | key          | key_len | ref  | rows    | Extra                              |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+---------+------------------------------------+
|    1 | SIMPLE      | recentchanges | range | rc_timestamp,rc_ns_usertext,rc_name_type_patrolled_timestamp,rc_ns_actor,rc_namespace_title_timestamp | rc_timestamp | 16      | NULL | 1079546 | Using index condition; Using where |
+------+-------------+---------------+-------+-------------------------------------------------------------------------------------------------------+--------------+---------+------+---------+------------------------------------+
1 row in set (0.00 sec)

Even for the same index chosen rc_timestamp. Even if the tmp_2 and tmp_3 are not chosen (or looks like they wouldn't be chosen as per the explain above), those tmp_2, and tmp_3 indexes are messing up with the optimizer. So maybe they are actually being chosen by the optimizer.
The query runtimes are:

eqiad

101 rows in set (0.01 sec)

codfw:

101 rows in set (27.94 sec)

But what if we ignore those tmp_2 and tmp_3 indexes in codfw? Even though they are not supposed to be used (as per the explain)?:
Then the query runs as fast as it does in eqiad:

root@db2081.codfw.wmnet[wikidatawiki]> SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges` IGNORE INDEX(tmp_2,tmp_3)    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+-----------+----------------+--------------+-----------+-----------+---------+------------+---------------+---------------+
| rc_id     | rc_timestamp   | rc_namespace | rc_title  | rc_cur_id | rc_type | rc_deleted | rc_this_oldid | rc_last_oldid |
+-----------+----------------+--------------+-----------+-----------+---------+------------+---------------+---------------+

<snip>
101 rows in set (0.04 sec)

The solution then is to drop those two indexes on recentchanges table, which are not even on tables.sql (or on eqiad).
@Reedy @Krinkle @Smalyshev I assume nothing else in code is forced to use those indexes, as that would have been failing in eqiad since the indexes aren't present there but, maybe it is worth a quick grep in code just in case?. Specially for recentchanges related queries (as those indexes do exist on eqiad rc group slaves).
I could delete those two indexes from one of the API hosts on Monday, check that nothing really breaks and then go ahead and kill it from the other one?
And then we should also check and delete them from eqiad too?. These are the hosts with the indexes:

root@neodymium:/home/marostegui# ./section s8 | while read host port ; do echo "$host:$port" ; mysql.py -h$host:$port -e "show create table recentchanges\G" wikidatawiki | egrep "tmp_2|tmp_3" ; done
labsdb1011.eqiad.wmnet:3306
labsdb1010.eqiad.wmnet:3306
labsdb1009.eqiad.wmnet:3306
dbstore2001.codfw.wmnet:3318
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
dbstore1002.eqiad.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2094.codfw.wmnet:3318
db2086.codfw.wmnet:3318
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2085.codfw.wmnet:3318
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2083.codfw.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2082.codfw.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2081.codfw.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2080.codfw.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2079.codfw.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2045.codfw.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db1124.eqiad.wmnet:3318
db1109.eqiad.wmnet:3306
db1104.eqiad.wmnet:3306
db1101.eqiad.wmnet:3318 -----> RC group
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db1099.eqiad.wmnet:3318 -----> RC group
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db1092.eqiad.wmnet:3306
db1087.eqiad.wmnet:3306
db1071.eqiad.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),

Doing a quick check on one of rc slaves sys schema shows that that index might actually being used (this is not 100% real, but it is a pretty good indication):

root@db2085.codfw.wmnet[sys]> select * from schema_unused_indexes where index_name like '%tmp%';
Empty set (0.06 sec)

On an API host it is the same:

root@db2081.codfw.wmnet[sys]> select * from schema_unused_indexes where index_name like '%tmp%';
Empty set (0.06 sec)

My proposal for now would be to leave codfw as eqiad.
That is, leaving only that index on the rc replicas until we can better analyze if it is really used there.

So we'd drop the index from all codfw except db2085 and db2086 (also drop it from db1071 (eqiad master))

@Marostegui So to confirm, recentchanges db hosts are the same within and between eqiad/codfw. But the api db hosts are different, right? Only api dbhosts in codfw have the two extra indexes. Is that right?

Also, are there differences between recentchanges db hosts and api db hosts?

Marostegui added a comment.EditedSep 16 2018, 6:49 PM

@Marostegui So to confirm, recentchanges db hosts are the same within and between eqiad/codfw. But the api db hosts are different, right? Only api dbhosts in codfw have the two extra indexes. Is that right?

Correct, recentchanges hosts are the same between both DCs.
Correct, API db hosts in codfw have two indexes that eqiad ones do not have.

Also, are there differences between recentchanges db hosts and api db hosts?

Yes, recentchanges hosts (in both DCs) have the two indexes whereas API hosts only in codfw have the two indexes.

To sum up:

codfw
All hosts in have the indexes.

eqiad:
Only recentchanges hosts and the master have the indexes

jcrespo moved this task from Triage to In progress on the DBA board.Sep 17 2018, 4:11 PM

The plan is for us dbas to test setting up a single API with the same structure than eqiad and do all assuming that fixies it, and later we will have to evaluate what is the right long-term status, given some unknowns and related tasks such as T202167:

One note: This index might be already existing but under name of tmp_1 or tmp_2 or tmp_3. I don't know how complex renaming an index is but that'd be useful.

The plan is for us dbas to test setting up a single API with the same structure than eqiad and do all assuming that fixies it, and later we will have to evaluate what is the right long-term status, given some unknowns and related tasks such as T202167:

To be more precise, my idea is to leave eqiad and codfw exactly the same (that is to leave the rc codfw replicas with the indexes, and remove it from everywhere in codfw).
After that, we will create a task to see what's the future of those indexes, as they are spread across other wikis and hosts.

I am a bit confused by now - is the original problem because recentchanges is using a wrong host, or it's using right host and the indexes there are wrong, or something else? And how can it be fixed? WDQS poller depends on RC API, and having it take 30+ seconds instead of usual sub-second response time is a serious issue.

I am a bit confused by now - is the original problem because recentchanges is using a wrong host, or it's using right host and the indexes there are wrong, or something else? And how can it be fixed? WDQS poller depends on RC API, and having it take 30+ seconds instead of usual sub-second response time is a serious issue.

The problem is with the API hosts in codfw. They have an index that the eqiad slaves do not have, and looks like they are messing up with the optimizer and getting the query to run for long.
We are going to start getting rid of it, to leave eqiad and codfw with the same schema.

Change 461025 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2080

https://gerrit.wikimedia.org/r/461025

Change 461025 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2080

https://gerrit.wikimedia.org/r/461025

Mentioned in SAL (#wikimedia-operations) [2018-09-18T05:08:25Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool db2080 - T202764 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2018-09-18T05:08:56Z] <marostegui> Drop tmp_2 and tmp_3 index from wikidatawiki.recentchanges on db2080 - T202764

Mentioned in SAL (#wikimedia-operations) [2018-09-18T05:14:49Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2080 - T202764 (duration: 00m 49s)

Change 461027 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2081

https://gerrit.wikimedia.org/r/461027

Change 461027 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2081

https://gerrit.wikimedia.org/r/461027

Mentioned in SAL (#wikimedia-operations) [2018-09-18T05:22:45Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Depool db2081 - T202764 (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2018-09-18T05:22:53Z] <marostegui> Drop tmp_2 and tmp_3 index from wikidatawiki.recentchanges on db2081 - T202764

@Smalyshev - the indexes have been removed from the API hosts.
The queries on those two servers (db2080 and db2081) now take around 0.05 sec to run. Can you check if this makes a difference from your end?.
Keep in mind that the indexes still exists on other hosts (recentchanges and main traffic ones).

Mentioned in SAL (#wikimedia-operations) [2018-09-18T05:28:56Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Repool db2081 - T202764 (duration: 00m 49s)

The API requests for recentchanges now seem to be faster, but I still get exceptions in the log :( I also get a bunch of errors for Wikidata URLs like: https://www.wikidata.org/wiki/Special:EntityData/Q33799921.ttl?nocache=1537250691109&flavor=dump
These are supposed to be pretty fast but still produce "no response" sometimes. I'll try to see what else can be causing those. Individual requests that I am testing seem to be fine, but I wonder if it's possible that the request still occasionally uses the DB host with wrong index?

The API requests for recentchanges now seem to be faster, but I still get exceptions in the log :( I also get a bunch of errors for Wikidata URLs like: https://www.wikidata.org/wiki/Special:EntityData/Q33799921.ttl?nocache=1537250691109&flavor=dump
These are supposed to be pretty fast but still produce "no response" sometimes. I'll try to see what else can be causing those. Individual requests that I am testing seem to be fine, but I wonder if it's possible that the request still occasionally uses the DB host with wrong index?

Not sure how's the logic behind that,and if they only use API hosts in db-codfw.php file, but maybe for some reason they are failing back to some main traffic hosts, and those still have the indexes.
I am going to go ahead now and leave s8 codfw exactly as eqiad is. In the meantime, if you can dig a bit more and try to find out which hosts are those...that'd be good.

Thanks

Mentioned in SAL (#wikimedia-operations) [2018-09-18T06:21:10Z] <marostegui> Drop tmp_2 and tmp_3 index from wikidatawiki.recentchanges on dbstore2001, db2079, db2082,db2083 - T202764

Marostegui added a comment.EditedSep 18 2018, 6:25 AM

@Smalyshev eqiad and codfw are now the same.
The index only exists on recentchanges replicas and the masters (you can ignore dbstore1002, it is not used in production).

root@neodymium:/home/marostegui# ./section s8  | while read host port ; do echo "$host:$port" ; mysql.py -h$host:$port -e "show create table recentchanges\G" wikidatawiki | egrep "tmp_2|tmp_3" ; done
labsdb1011.eqiad.wmnet:3306
labsdb1010.eqiad.wmnet:3306
labsdb1009.eqiad.wmnet:3306
dbstore2001.codfw.wmnet:3318
dbstore1002.eqiad.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2094.codfw.wmnet:3318
db2086.codfw.wmnet:3318
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2085.codfw.wmnet:3318
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db2083.codfw.wmnet:3306
db2082.codfw.wmnet:3306
db2081.codfw.wmnet:3306
db2080.codfw.wmnet:3306
db2079.codfw.wmnet:3306
db2045.codfw.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db1124.eqiad.wmnet:3318
db1109.eqiad.wmnet:3306
db1104.eqiad.wmnet:3306
db1101.eqiad.wmnet:3318
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db1099.eqiad.wmnet:3318
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
db1092.eqiad.wmnet:3306
db1087.eqiad.wmnet:3306
db1071.eqiad.wmnet:3306
  KEY `tmp_2` (`rc_bot`,`rc_timestamp`),
  KEY `tmp_3` (`rc_namespace`,`rc_timestamp`),
Addshore moved this task from incoming to monitoring on the Wikidata board.Sep 18 2018, 2:37 PM

@Smalyshev have you noticed any improvements since the above comment was done, and the index is gone from everywhere but the recetchanges slaves (like in eqiad)?

Looking at logstash: https://logstash.wikimedia.org/goto/39a6fe9edd787798129b66ae9d61ed90 there's definitely a drop in timeouts, but there are still present, so I will monitor this further.

Looking at logstash: https://logstash.wikimedia.org/goto/39a6fe9edd787798129b66ae9d61ed90 there's definitely a drop in timeouts, but there are still present, so I will monitor this further.

By looking at that graph but expanded for the last 15 days the amount of errors is similar that the one eqiad had before we switched it over. Which is consistent with the schema now codfw has.
I would consider this "resolved" as in, it is now in the same state as eqiad is.

I will create a task to follow up and decide what to do with tmp_2 and tmp_3 indexes in general across the cluster, as they are appearing on some other wikis and hosts.

Krinkle assigned this task to Smalyshev.Sep 19 2018, 3:15 PM
Krinkle moved this task from Watching to Perf issue on the Performance-Team (Radar) board.

Considering this done from our team, but keeping open for now because the WDQS issues appear unresolved still. Assigning back to Stas. Let me know if there are another angles in MW we need to investigate (either for perf or other teams).

Marostegui moved this task from In progress to Done on the DBA board.Sep 19 2018, 3:16 PM

300 errors in the last 24h, I think we are good?

Smalyshev lowered the priority of this task from High to Normal.Sep 20 2018, 11:34 PM

Vast majority of remaining failures is from wdqs2003: https://logstash.wikimedia.org/goto/fe077467d39c2ee03ce8127bdca517ae

I am not sure why it is so anomalous, might be something that warrants further investigation but so far not a huge deal for the rest of the service. So I'll keep it open for now but everybody that is not interested in the particular anomaly please feel free to unsubscribe.

Smalyshev closed this task as Resolved.Oct 3 2018, 7:21 AM

Looks like the problems are gone for now.

Smalyshev reopened this task as Open.Oct 13 2018, 8:06 PM
Smalyshev raised the priority of this task from Normal to High.

This started happening again - the events jumped starting Oct 11 and are continuing now.

eqiad db API hosts return the query in less than a second as they have the correct schema:

root@db1104.eqiad.wmnet[wikidatawiki]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1104     |
+------------+
1 row in set (0.00 sec)

root@db1104.eqiad.wmnet[wikidatawiki]> SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+-----------+----------------+--------------+-----------+-----------+---------+------------+---------------+---------------+
| rc_id     | rc_timestamp   | rc_namespace | rc_title  | rc_cur_id | rc_type | rc_deleted | rc_this_oldid | rc_last_oldid |
+-----------+----------------+--------------+-----------+-----------+---------+------------+---------------+---------------+
| 781488474 | 20180914110000 |            0 | Q11614745 |  12755105 |       0 |          0 |     745861888 |     745695128 |
| 781488475 | 20180914110000 |            0 | Q49559730 |  50562117 |       0 |          0 |     745861890 |     708599303 |
| 781488476 | 20180914110000 |            0 | Q49909619 |  50909548 |       0 |          0 |     745861889 |     745861887 |
| 781488477 | 20180914110000 |            0 | Q34205543 |  35645575 |       0 |          0 |     745861891 |     743199425 |
<snip>
101 rows in set (0.00 sec)





root@db1092.eqiad.wmnet[wikidatawiki]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1092     |
+------------+
1 row in set (0.00 sec)

root@db1092.eqiad.wmnet[wikidatawiki]> SELECT  rc_id,rc_timestamp,rc_namespace,rc_title,rc_cur_id,rc_type,rc_deleted,rc_this_oldid,rc_last_oldid  FROM `recentchanges`    WHERE (rc_timestamp>='20180914110000') AND rc_namespace IN ('0','120')  AND rc_type IN ('0','1','3','6')   ORDER BY rc_timestamp ASC,rc_id ASC LIMIT 101  ;
+-----------+----------------+--------------+-----------+-----------+---------+------------+---------------+---------------+
| rc_id     | rc_timestamp   | rc_namespace | rc_title  | rc_cur_id | rc_type | rc_deleted | rc_this_oldid | rc_last_oldid |
+-----------+----------------+--------------+-----------+-----------+---------+------------+---------------+---------------+
| 781488474 | 20180914110000 |            0 | Q11614745 |  12755105 |       0 |          0 |     745861888 |     745695128 |
| 781488475 | 20180914110000 |            0 | Q49559730 |  50562117 |       0 |          0 |     745861890 |     708599303 |
| 781488476 | 20180914110000 |            0 | Q49909619 |  50909548 |       0 |          0 |     745861889 |     745861887 |
| 781488477 | 20180914110000 |            0 | Q34205543 |  35645575 |       0 |          0 |     745861891 |     743199425 |
| 781488478 | 20180914110000 |            0 | Q49909619 |  50909548 |       0 |          0 |     745861892 |     745861889 |
<snip>

101 rows in set (0.01 sec)
Smalyshev removed Smalyshev as the assignee of this task.Oct 23 2018, 11:19 PM

Change 460613 merged by jenkins-bot:
[mediawiki/core@master] debug: Allow the DBQuery channel to be used

https://gerrit.wikimedia.org/r/460613

Can someone add me to T202765, mentioned above by @Gehel? The issue that @Smalyshev created last week seems to involve requests that are never getting to wikidata, which suggests a wdqs related issue of some sort (and thus potentially related to spikes in resource use or something like that.)

*is also interested in what T202765 is*

It's so restricted that even with access to all security issues, I still don't have access to it.

Gehel added a comment.Nov 1 2018, 7:57 PM

For context, T202765 is about a bot sending annoying and somewhat expensive requests. That specific issue is now resolved.

T202765 is about specific bot misbehavior, and was restricted because it contains samples of the logs and details about specific bot, which can contain PII. It is OK to see those for people under NDA, but it should not be public.

Change 471737 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Evict idle connections from HTTP pool.

https://gerrit.wikimedia.org/r/471737

Change 471982 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Instrument HTTP connection manager to expose JMX metrics.

https://gerrit.wikimedia.org/r/471982

Change 471737 merged by jenkins-bot:
[wikidata/query/rdf@master] Evict idle connections from HTTP pool.

https://gerrit.wikimedia.org/r/471737

Change 471982 merged by jenkins-bot:
[wikidata/query/rdf@master] Instrument HTTP connection manager to expose JMX metrics.

https://gerrit.wikimedia.org/r/471982

@Gehel have the patches referenced above been deployed?

@Imarlier yes the patches have been deployed though we don't use RC API now for production so T207718 is more important than this one.

Trying to run Updater on labs (for T206636) where there's no Kafka, I still get these errors all the time. So the problem does not seem to be solved.

Smalyshev closed this task as Resolved.Feb 20 2019, 11:44 PM
Smalyshev claimed this task.

Doesn't seem to happen anymore, so closing it.