list=logevents slow for users with last log action long time ago
Open, NormalPublic

Description

Loading https://de.wikipedia.org/w/api.php?action=query&list=logevents&lelimit=1&leprop=timestamp&leuser=Gardini&format=jsonfm takes for me 114 seconds (try to append some bogus parameter if you hit the cache), and I have similar results for other users which are inactive for a long time. OTOH https://de.wikipedia.org/wiki/Spezial:Logbuch/Gardini is fast, though it provides even more information.


Version: 1.24rc
Severity: normal
URL: https://de.wikipedia.org/w/api.php?action=query&list=logevents&lelimit=1&leprop=timestamp&leuser=Gardini&format=jsonfm

Details

Reference
bz69222
bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz69222.
Schnark created this task.Aug 7 2014, 10:23 AM
Anomie added a comment.EditedAug 7 2014, 12:35 PM

Sean, this seems to be a database issue. The database query here is:

SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2

This seems to run fine on db1026, db1021, and db1045; EXPLAIN says it's using the user_time index, which makes sense. On db1005, db1037, and db1049, EXPLAIN says it uses the times index and executes slowly.

db1005
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => index
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => times
    [key_len] => 16
    [ref] => 
    [rows] => 6716
    [Extra] => Using where
)
db1026
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => ref
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => user_time
    [key_len] => 4
    [ref] => const
    [rows] => 21108
    [Extra] => Using where
)
db1021
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => ref
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => user_time
    [key_len] => 4
    [ref] => const
    [rows] => 21108
    [Extra] => Using where
)
db1037
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => index
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => times
    [key_len] => 16
    [ref] => 
    [rows] => 6194
    [Extra] => Using where
)
db1045
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => ref
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => user_time
    [key_len] => 4
    [ref] => const
    [rows] => 21108
    [Extra] => Using where
)
db1049
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => index
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => times
    [key_len] => 16
    [ref] => 
    [rows] => 6497
    [Extra] => Using where
)
Anomie moved this task from Unsorted to Non-Code on the MediaWiki-API board.Feb 19 2015, 6:33 PM

Now I get an error for the query (and for other similar [= where a user has been inactive for a long time] queries):

{
    "servedby": "mw1203",
    "error": {
        "code": "internal_api_error_DBQueryError",
        "info": "[c0f53a23] Database query error"
    }
}

That's because a database timeout was added at some point, to kill such long-running queries rather than allowing them to continue to consume resources.

Krenair added a subscriber: Krenair.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2015, 2:54 AM
Krenair removed Springle as the assignee of this task.Oct 4 2015, 2:54 AM

Has anybody fixed this recently? The above query and similar ones are now much faster than they were some days ago. I regularly test my user scripts, one of them calling the API with the above query. While this was consistently slow ever since I reported the issue, starting last week (still before the update to 1.28.0-wmf.2) it became fast.

Nothing was done on the PHP side. I see @jcrespo recently (7d2f575) switched s5 API queries from db1045 (which currently chooses the wrong index for the query, according to EXPLAIN) to mostly db1070 (which currently chooses the right index). db1071 (the backup api node) is currently choosing the wrong index, though.

Poking at db1045 and db1071 a bit more, even simple queries like SELECT * FROM logging WHERE log_user = 103726 ORDER BY log_timestamp LIMIT 10 that seem like a perfect fit for the user_time index aren't using it.

Volans added a subscriber: Volans.May 23 2016, 5:11 PM

@jcrespo I took a quick look and db1070 has few indexes less than db1071:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`)

And the explain on db1071 uses log_user_type_time instead of user_time for the query @Anomie reported. This causes it to use filesort for the sorting instead of the index.

The above commit was done due to the loss of an s5 server, that got fried. I reorganized the servers among the ones that made more sense, if this fixes things we need to understand why (it could be the MariaDB 10 upgrade, that is now exposed to certain amount of new queries, or it could be the special partitioning).

The problem with this is that for each query that is fixed, others are "broken", so we need to have a more scientific method for this- I will try to add some monitoring in the future.

As a short term thing, I will try to understand the differences between db1045 and db1071 an the others (probably it is the partitioning needed for other queries to go fast).

Anomie merged a task: Restricted Task.Jul 12 2018, 4:05 PM
Anomie added subscribers: hoo, Ladsgroup, dpatrick.

This is most likely due to: T71127 - this is pretty much fixed everywhere due to all the recent reimages, cloning, indexes fixing and all that.
The only two hosts where the query is using the wrong indexes are: db1100 and db2052 (codfw master) - should be easy to fix.

This is yet another case of the optimizer doing different things:

root@neodymium:/home/marostegui# diff -u <(mysql.py -hdb1100 dewiki -e "show create table logging\G") <(mysql.py -hdb2075 dewiki -e "show create table logging\G")
root@neodymium:/home/marostegui#
root@db2075.codfw.wmnet[dewiki]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.1.33-MariaDB |
+-----------------+
1 row in set (0.04 sec)

root@db2075.codfw.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                          | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time,log_user_type_time | user_time | 4       | const | 21070 | Using where |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.04 sec)
root@db1100.eqiad.wmnet[dewiki]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.1.34-MariaDB |
+-----------------+
1 row in set (0.00 sec)

root@db1100.eqiad.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+-------+----------------------------------------+-------+---------+------+------+-------------+
| id   | select_type | table   | type  | possible_keys                          | key   | key_len | ref  | rows | Extra       |
+------+-------------+---------+-------+----------------------------------------+-------+---------+------+------+-------------+
|    1 | SIMPLE      | logging | index | type_time,user_time,log_user_type_time | times | 16      | NULL | 7964 | Using where |
+------+-------------+---------+-------+----------------------------------------+-------+---------+------+------+-------------+
1 row in set (0.00 sec)

Maybe worth trying an analyze on db1100 to see if has any effect (or upgrade db2075.codfw.wmnet to 10.1.34 and see if it behaves the same way of db1100)

As expected nothing to do with versions, 10.1.34 behaves the same way as 10.1.33 - so something related to 10.:

root@neodymium:/home/marostegui# diff -u <(mysql.py -hdb1100 dewiki -e "show create table logging\G") <(mysql.py -hdb2038 dewiki -e "show create table logging\G")
root@db2038.codfw.wmnet[dewiki]> select @@version; explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+-----------------+
| @@version       |
+-----------------+
| 10.1.34-MariaDB |
+-----------------+
1 row in set (0.03 sec)

+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                          | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time,log_user_type_time | user_time | 4       | const | 21070 | Using where |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.03 sec)

It is interesting to see that this query only misbehave on db2052 and db1100 across all s5, both 10.1.34.
db1100 is the only 10.1.34 on s5 eqiad.
db1100 is the only host in eqiad (apart from rc slaves) that has:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

db2052 and db2038 are the only 10.1.34 hosts in codfw. As shown above, the query runs fine on db2038.
All hosts in codfw have:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

And only db2052 in codfw misbehaves.

db1100 is the only host in eqiad (apart from rc slaves) that has:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

Interestingly (and reminiscent of T197486#4293781), it suddenly runs fast if we ignore log_user_type_time:

wikiadmin@10.64.32.197(dewiki)> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` ignore index (log_user_type_time) LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+---------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys       | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+---------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time | user_time | 4       | const | 22014 | Using where |
+------+-------------+---------+------+---------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

wikiadmin@10.64.32.197(dewiki)> SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` ignore index (log_user_type_time) LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+---------+----------+------------+----------------+-------------+
| log_id  | log_type | log_action | log_timestamp  | log_deleted |
+---------+----------+------------+----------------+-------------+
| 9776086 | delete   | delete     | 20090317133931 |           0 |
| 9776077 | delete   | delete     | 20090317133915 |           0 |
+---------+----------+------------+----------------+-------------+
2 rows in set (0.00 sec)

OTOH, it uses a similar plan but still runs very slowly if we ignore times instead:

wikiadmin@10.64.32.197(dewiki)> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` ignore index (times) LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+------+-------------+
| id   | select_type | table   | type  | possible_keys                          | key       | key_len | ref  | rows | Extra       |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+------+-------------+
|    1 | SIMPLE      | logging | index | type_time,user_time,log_user_type_time | user_time | 20      | NULL | 7964 | Using where |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+------+-------------+
wikiadmin@10.64.32.197(dewiki)> show explain for 586438762;
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+----------+-----------------------------+
| id   | select_type | table   | type  | possible_keys                          | key       | key_len | ref  | rows     | Extra                       |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+----------+-----------------------------+
|    1 | SIMPLE      | logging | index | type_time,user_time,log_user_type_time | user_time | 20      | NULL | 86583297 | Using where; Using filesort |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+----------+-----------------------------+

db2052 seems to show the same behavior. I have no good guesses as to why db2038 doesn't, although I see that the EXPLAIN output for forcing user_time versus forcing log_user_type_time shows the former with fewer estimated rows on db2038 but the latter on db2052 and db1100.

db1100 is the only host in eqiad (apart from rc slaves) that has:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

Interestingly (and reminiscent of T197486#4293781), it suddenly runs fast if we ignore log_user_type_time:

Exactly, but without even forcing the ignore, on codfw hosts it runs fine and they have exactly the same table definition than db1100. Maybe because those have no activity the buffer pool isn't changing constantly?
I will use the analyze bullet on Monday and test again on db1100.

Moreover, db1100 is the only host in eqiad (apart from rc ones) that have the keys as defined on tables.sql - the rest are lacking those indexes. I wonder if adding them to another host in eqiad will also make the query slower. That is another test we can do.

-- Special:Log filter by performer and type
CREATE INDEX /*i*/log_user_type_time ON /*_*/logging (log_user, log_type, log_timestamp);
CREATE INDEX /*i*/log_actor_type_time ON /*_*/logging (log_actor, log_type, log_timestamp);