Page MenuHomePhabricator

list=logevents slow for users with last log action long time ago
Closed, ResolvedPublic

Description

Loading https://de.wikipedia.org/w/api.php?action=query&list=logevents&lelimit=1&leprop=timestamp&leuser=Gardini&format=jsonfm takes for me 114 seconds (try to append some bogus parameter if you hit the cache), and I have similar results for other users which are inactive for a long time. OTOH https://de.wikipedia.org/wiki/Spezial:Logbuch/Gardini is fast, though it provides even more information.


Version: 1.24rc
Severity: normal
URL: https://de.wikipedia.org/w/api.php?action=query&list=logevents&lelimit=1&leprop=timestamp&leuser=Gardini&format=jsonfm

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:41 AM
bzimport set Reference to bz69222.

Sean, this seems to be a database issue. The database query here is:

SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2

This seems to run fine on db1026, db1021, and db1045; EXPLAIN says it's using the user_time index, which makes sense. On db1005, db1037, and db1049, EXPLAIN says it uses the times index and executes slowly.

db1005
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => index
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => times
    [key_len] => 16
    [ref] => 
    [rows] => 6716
    [Extra] => Using where
)
db1026
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => ref
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => user_time
    [key_len] => 4
    [ref] => const
    [rows] => 21108
    [Extra] => Using where
)
db1021
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => ref
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => user_time
    [key_len] => 4
    [ref] => const
    [rows] => 21108
    [Extra] => Using where
)
db1037
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => index
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => times
    [key_len] => 16
    [ref] => 
    [rows] => 6194
    [Extra] => Using where
)
db1045
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => ref
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => user_time
    [key_len] => 4
    [ref] => const
    [rows] => 21108
    [Extra] => Using where
)
db1049
stdClass Object
(
    [id] => 1
    [select_type] => SIMPLE
    [table] => logging
    [type] => index
    [possible_keys] => type_time,user_time,log_user_type_time
    [key] => times
    [key_len] => 16
    [ref] => 
    [rows] => 6497
    [Extra] => Using where
)

Now I get an error for the query (and for other similar [= where a user has been inactive for a long time] queries):

{
    "servedby": "mw1203",
    "error": {
        "code": "internal_api_error_DBQueryError",
        "info": "[c0f53a23] Database query error"
    }
}

That's because a database timeout was added at some point, to kill such long-running queries rather than allowing them to continue to consume resources.

Has anybody fixed this recently? The above query and similar ones are now much faster than they were some days ago. I regularly test my user scripts, one of them calling the API with the above query. While this was consistently slow ever since I reported the issue, starting last week (still before the update to 1.28.0-wmf.2) it became fast.

Nothing was done on the PHP side. I see @jcrespo recently (7d2f575) switched s5 API queries from db1045 (which currently chooses the wrong index for the query, according to EXPLAIN) to mostly db1070 (which currently chooses the right index). db1071 (the backup api node) is currently choosing the wrong index, though.

Poking at db1045 and db1071 a bit more, even simple queries like SELECT * FROM logging WHERE log_user = 103726 ORDER BY log_timestamp LIMIT 10 that seem like a perfect fit for the user_time index aren't using it.

@jcrespo I took a quick look and db1070 has few indexes less than db1071:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`)

And the explain on db1071 uses log_user_type_time instead of user_time for the query @Anomie reported. This causes it to use filesort for the sorting instead of the index.

The above commit was done due to the loss of an s5 server, that got fried. I reorganized the servers among the ones that made more sense, if this fixes things we need to understand why (it could be the MariaDB 10 upgrade, that is now exposed to certain amount of new queries, or it could be the special partitioning).

The problem with this is that for each query that is fixed, others are "broken", so we need to have a more scientific method for this- I will try to add some monitoring in the future.

As a short term thing, I will try to understand the differences between db1045 and db1071 an the others (probably it is the partitioning needed for other queries to go fast).

Anomie added subscribers: hoo, Ladsgroup, dpatrick.

This is most likely due to: T71127 - this is pretty much fixed everywhere due to all the recent reimages, cloning, indexes fixing and all that.
The only two hosts where the query is using the wrong indexes are: db1100 and db2052 (codfw master) - should be easy to fix.

This is yet another case of the optimizer doing different things:

root@neodymium:/home/marostegui# diff -u <(mysql.py -hdb1100 dewiki -e "show create table logging\G") <(mysql.py -hdb2075 dewiki -e "show create table logging\G")
root@neodymium:/home/marostegui#
root@db2075.codfw.wmnet[dewiki]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.1.33-MariaDB |
+-----------------+
1 row in set (0.04 sec)

root@db2075.codfw.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                          | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time,log_user_type_time | user_time | 4       | const | 21070 | Using where |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.04 sec)
root@db1100.eqiad.wmnet[dewiki]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.1.34-MariaDB |
+-----------------+
1 row in set (0.00 sec)

root@db1100.eqiad.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+-------+----------------------------------------+-------+---------+------+------+-------------+
| id   | select_type | table   | type  | possible_keys                          | key   | key_len | ref  | rows | Extra       |
+------+-------------+---------+-------+----------------------------------------+-------+---------+------+------+-------------+
|    1 | SIMPLE      | logging | index | type_time,user_time,log_user_type_time | times | 16      | NULL | 7964 | Using where |
+------+-------------+---------+-------+----------------------------------------+-------+---------+------+------+-------------+
1 row in set (0.00 sec)

Maybe worth trying an analyze on db1100 to see if has any effect (or upgrade db2075.codfw.wmnet to 10.1.34 and see if it behaves the same way of db1100)

As expected nothing to do with versions, 10.1.34 behaves the same way as 10.1.33 - so something related to 10.:

root@neodymium:/home/marostegui# diff -u <(mysql.py -hdb1100 dewiki -e "show create table logging\G") <(mysql.py -hdb2038 dewiki -e "show create table logging\G")
root@db2038.codfw.wmnet[dewiki]> select @@version; explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+-----------------+
| @@version       |
+-----------------+
| 10.1.34-MariaDB |
+-----------------+
1 row in set (0.03 sec)

+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                          | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time,log_user_type_time | user_time | 4       | const | 21070 | Using where |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.03 sec)

It is interesting to see that this query only misbehave on db2052 and db1100 across all s5, both 10.1.34.
db1100 is the only 10.1.34 on s5 eqiad.
db1100 is the only host in eqiad (apart from rc slaves) that has:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

db2052 and db2038 are the only 10.1.34 hosts in codfw. As shown above, the query runs fine on db2038.
All hosts in codfw have:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

And only db2052 in codfw misbehaves.

db1100 is the only host in eqiad (apart from rc slaves) that has:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

Interestingly (and reminiscent of T197486#4293781), it suddenly runs fast if we ignore log_user_type_time:

wikiadmin@10.64.32.197(dewiki)> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` ignore index (log_user_type_time) LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+---------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys       | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+---------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time | user_time | 4       | const | 22014 | Using where |
+------+-------------+---------+------+---------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

wikiadmin@10.64.32.197(dewiki)> SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` ignore index (log_user_type_time) LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+---------+----------+------------+----------------+-------------+
| log_id  | log_type | log_action | log_timestamp  | log_deleted |
+---------+----------+------------+----------------+-------------+
| 9776086 | delete   | delete     | 20090317133931 |           0 |
| 9776077 | delete   | delete     | 20090317133915 |           0 |
+---------+----------+------------+----------------+-------------+
2 rows in set (0.00 sec)

OTOH, it uses a similar plan but still runs very slowly if we ignore times instead:

wikiadmin@10.64.32.197(dewiki)> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` ignore index (times) LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+------+-------------+
| id   | select_type | table   | type  | possible_keys                          | key       | key_len | ref  | rows | Extra       |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+------+-------------+
|    1 | SIMPLE      | logging | index | type_time,user_time,log_user_type_time | user_time | 20      | NULL | 7964 | Using where |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+------+-------------+
wikiadmin@10.64.32.197(dewiki)> show explain for 586438762;
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+----------+-----------------------------+
| id   | select_type | table   | type  | possible_keys                          | key       | key_len | ref  | rows     | Extra                       |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+----------+-----------------------------+
|    1 | SIMPLE      | logging | index | type_time,user_time,log_user_type_time | user_time | 20      | NULL | 86583297 | Using where; Using filesort |
+------+-------------+---------+-------+----------------------------------------+-----------+---------+------+----------+-----------------------------+

db2052 seems to show the same behavior. I have no good guesses as to why db2038 doesn't, although I see that the EXPLAIN output for forcing user_time versus forcing log_user_type_time shows the former with fewer estimated rows on db2038 but the latter on db2052 and db1100.

db1100 is the only host in eqiad (apart from rc slaves) that has:

KEY `log_user_type_time` (`log_user`,`log_type`,`log_timestamp`),
KEY `log_title_type_time` (`log_title`(16),`log_type`,`log_timestamp`),

Interestingly (and reminiscent of T197486#4293781), it suddenly runs fast if we ignore log_user_type_time:

Exactly, but without even forcing the ignore, on codfw hosts it runs fine and they have exactly the same table definition than db1100. Maybe because those have no activity the buffer pool isn't changing constantly?
I will use the analyze bullet on Monday and test again on db1100.

Moreover, db1100 is the only host in eqiad (apart from rc ones) that have the keys as defined on tables.sql - the rest are lacking those indexes. I wonder if adding them to another host in eqiad will also make the query slower. That is another test we can do.

-- Special:Log filter by performer and type
CREATE INDEX /*i*/log_user_type_time ON /*_*/logging (log_user, log_type, log_timestamp);
CREATE INDEX /*i*/log_actor_type_time ON /*_*/logging (log_actor, log_type, log_timestamp);

Looking back at this, db1100 still uses the times index for the original query, and it uses the right plan when told to ignore log_user_type_time. On the other hand, it now also uses the right plan when told to ignore times, unlike what was the case in T71222#4456488.

I also ran an EXPLAIN over the query using log_actor instead of log_user that we'll start using soon. I don't see any pattern in the results.

HostVersionIndexesindex for log_userindex for log_actortable options
db107010.1.36-MariaDBnot alluser_timeactor_timeENGINE=InnoDB AUTO_INCREMENT=115010915 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db108210.1.38-MariaDBalluser_timeactor_timeENGINE=InnoDB AUTO_INCREMENT=115010916 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED
db110010.1.37-MariaDBalltimestimesENGINE=InnoDB AUTO_INCREMENT=115010919 DEFAULT CHARSET=binary
db111010.1.37-MariaDBnot alluser_timetimesENGINE=InnoDB AUTO_INCREMENT=115010919 DEFAULT CHARSET=binary
db1113:331510.1.37-MariaDBnot alluser_timetimesENGINE=InnoDB AUTO_INCREMENT=115010922 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db205210.1.37-MariaDBalltimestimesENGINE=InnoDB AUTO_INCREMENT=115010911 DEFAULT CHARSET=binary
db203810.1.37-MariaDBalluser_timeactor_timeENGINE=InnoDB AUTO_INCREMENT=115010911 DEFAULT CHARSET=binary
db205910.1.37-MariaDBalluser_timetimesENGINE=InnoDB AUTO_INCREMENT=115010911 DEFAULT CHARSET=binary
db206610.1.37-MariaDBalluser_timetimesENGINE=InnoDB AUTO_INCREMENT=115010912 DEFAULT CHARSET=binary
db207510.1.37-MariaDBalluser_timetimesENGINE=InnoDB AUTO_INCREMENT=115010913 DEFAULT CHARSET=binary
db1096:331510.1.37-MariaDBpartitionedlog_user_type_timeactor_timeENGINE=InnoDB AUTO_INCREMENT=115010917 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db1097:331510.1.37-MariaDBpartitionedlog_user_type_timeactor_timeENGINE=InnoDB AUTO_INCREMENT=115010919 DEFAULT CHARSET=binary
db2084:331510.1.37-MariaDBpartitioneduser_timeactor_timeENGINE=InnoDB AUTO_INCREMENT=115010914 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
db2089:331510.1.38-MariaDBpartitioneduser_timeactor_timeENGINE=InnoDB AUTO_INCREMENT=115010914 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8

What's the difference between db2038 and db2059 that makes one use the good plan for log_actor and the other not?

I can try to run an analyze on Monday (the table is 71G) and see if that changes the plan.
What is the query you are using, @Anomie?

explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_actor = '154300' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;

Mentioned in SAL (#wikimedia-operations) [2019-03-04T06:06:36Z] <marostegui> Run analyze table logging on db2038 and db2059 - T71222

After the analyze, they (db2038 and db2059) now use the same plan:

root@db2038.codfw.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                          | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time,log_user_type_time | user_time | 4       | const | 21070 | Using where |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.04 sec)

root@db2038.codfw.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_actor = '154300' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+-------+------------------------------------------+-------+---------+------+------+-------------+
| id   | select_type | table   | type  | possible_keys                            | key   | key_len | ref  | rows | Extra       |
+------+-------------+---------+-------+------------------------------------------+-------+---------+------+------+-------------+
|    1 | SIMPLE      | logging | index | type_time,actor_time,log_actor_type_time | times | 16      | NULL | 8591 | Using where |
+------+-------------+---------+-------+------------------------------------------+-------+---------+------+------+-------------+
1 row in set (0.04 sec)
root@db2059.codfw.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                          | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time,log_user_type_time | user_time | 4       | const | 21070 | Using where |
+------+-------------+---------+------+----------------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.04 sec)

root@db2059.codfw.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_actor = '154300' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+-------+------------------------------------------+-------+---------+------+------+-------------+
| id   | select_type | table   | type  | possible_keys                            | key   | key_len | ref  | rows | Extra       |
+------+-------------+---------+-------+------------------------------------------+-------+---------+------+------+-------------+
|    1 | SIMPLE      | logging | index | type_time,actor_time,log_actor_type_time | times | 16      | NULL | 8591 | Using where |
+------+-------------+---------+-------+------------------------------------------+-------+---------+------+------+-------------+
1 row in set (0.03 sec)

Unfortunately that made db2038 start using times rather than actor_time for the second query. I was hoping it would go the other way, because we want actor_time for this query.

BTW, the poor plan shows up with a simplified query where it should be very obvious the actor_time index should be used:

wikiadmin@10.192.32.9(dewiki)> explain SELECT *  FROM `logging`  WHERE log_actor = 154300  ORDER BY log_timestamp DESC LIMIT 2;
+------+-------------+---------+-------+--------------------------------+-------+---------+------+------+-------------+
| id   | select_type | table   | type  | possible_keys                  | key   | key_len | ref  | rows | Extra       |
+------+-------------+---------+-------+--------------------------------+-------+---------+------+------+-------------+
|    1 | SIMPLE      | logging | index | actor_time,log_actor_type_time | times | 16      | NULL | 8755 | Using where |
+------+-------------+---------+-------+--------------------------------+-------+---------+------+------+-------------+

The planner thinks using the right index will have to touch many more rows for some reason:

wikiadmin@10.192.32.9(dewiki)> explain SELECT *  FROM `logging` ignore index(times) WHERE log_actor = 154300  ORDER BY log_timestamp DESC LIMIT 2;
+------+-------------+---------+-------+--------------------------------+------------+---------+------+-------+-------------+
| id   | select_type | table   | type  | possible_keys                  | key        | key_len | ref  | rows  | Extra       |
+------+-------------+---------+-------+--------------------------------+------------+---------+------+-------+-------------+
|    1 | SIMPLE      | logging | range | actor_time,log_actor_type_time | actor_time | 8       | NULL | 21780 | Using where |
+------+-------------+---------+-------+--------------------------------+------------+---------+------+-------+-------------+

Other strangeness in that plan is that the type is "range" rather than "ref". And oddly (but as noted before), if we have it ignore log_actor_type_time instead then the planner suddenly gets "ref" right and doesn't have to be told not to use times:

wikiadmin@10.192.32.9(dewiki)> explain SELECT *  FROM `logging` ignore index(log_actor_type_time) WHERE log_actor = 154300  ORDER BY log_timestamp DESC LIMIT 2; 
+------+-------------+---------+------+---------------+------------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys | key        | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+---------------+------------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | actor_time    | actor_time | 8       | const | 21780 | Using where |
+------+-------------+---------+------+---------------+------------+---------+-------+-------+-------------+

And that ^ also happens with 10.1.38, so I guess it is one again one of those "features" of the optimizer.

It is unbelievable the optimizer chooses times:

root@db2038.codfw.wmnet[dewiki]> show explain for 3580770;
+------+-------------+---------+-------+--------------------------------+-------+---------+-------+----------+-----------------------------+
| id   | select_type | table   | type  | possible_keys                  | key   | key_len | ref   | rows     | Extra                       |
+------+-------------+---------+-------+--------------------------------+-------+---------+-------+----------+-----------------------------+
|    1 | SIMPLE      | logging | index | actor_time,log_actor_type_time | times | 16      | const | 91195480 | Using where; Using filesort |
+------+-------------+---------+-------+--------------------------------+-------+---------+-------+----------+-----------------------------+
1 row in set, 1 warning (0.04 sec)

And the other query using actor_time takes 0.04 on db2038...

I think we've analyzed this to the point where we can say that the best short-term fix will be to have the API alter its query to add IGNORE INDEX (times) to avoid this problem occurring more frequently when we start using log_actor rather than log_user in the near future. Then, as usual, we should review this again when we upgrade MySQL/MariaDB to see if the optimizer has been fixed.

Change 494256 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] ApiQueryLogEvents: Add IGNORE INDEX to avoid MariaDB optimizer bug

https://gerrit.wikimedia.org/r/494256

I think we've analyzed this to the point where we can say that the best short-term fix will be to have the API alter its query to add IGNORE INDEX (times) to avoid this problem occurring more frequently when we start using log_actor rather than log_user in the near future. Then, as usual, we should review this again when we upgrade MySQL/MariaDB to see if the optimizer has been fixed.

+1

Please make sure times exist everywhere and with that name, as I know Manuel is in the middle of matching the indexes on that table. Ignore fails even if index doesn't exist. https://bugs.mysql.com/bug.php?id=89805

Please make sure times exist everywhere and with that name, as I know Manuel is in the middle of matching the indexes on that table. Ignore fails even if index doesn't exist. https://bugs.mysql.com/bug.php?id=89805

I am not touching that index, but I am going to double check it is on every host on every wiki :)

Confirmed, that index exists everywhere.

Change 494256 merged by jenkins-bot:
[mediawiki/core@master] ApiQueryLogEvents: Add IGNORE INDEX to avoid MariaDB optimizer bug

https://gerrit.wikimedia.org/r/494256

Anomie removed a project: Patch-For-Review.

The query with IGNORE INDEX should be deployed next week with 1.33.0-wmf.21. I'll leave it to DBAs to decide whether to close this or keep it open for future re-review on MariaDB/MySQL upgrades.

Marostegui changed the task status from Open to Stalled.Mar 11 2019, 6:16 AM
Marostegui moved this task from Backlog to Blocked external/Not db team on the DBA board.

L'et stall it so we don't forget about it, and we can test the query plan with the following mariadb releases

This weird query plan still happens with the new release 10.1.39

I wanted to test this issue with 10.3 on db1114.
I copied logging page and user tables from dewiki from one of the hosts that have the weird plans and placed in this 10.3 server.

root@db1114.eqiad.wmnet[(none)]> select @@version;
+---------------------+
| @@version           |
+---------------------+
| 10.3.16-MariaDB-log |
+---------------------+
1 row in set (0.00 sec)

The optimizer looks like it is behaving correctly on 10.3, obviously it is hard to know if it is because of the full import (mysqldump) or why, as the behaviour we have seen it is not very consistent with versions, analyze, optimize as we have seen before.
Anyways, 10.3 on db1114 looks like working fine:

root@db1114.eqiad.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_user = '103726' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+--------------------------------------------------------+-----------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                                          | key       | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+--------------------------------------------------------+-----------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,user_time,log_user_type_time,log_type_action | user_time | 4       | const | 19994 | Using where |
+------+-------------+---------+------+--------------------------------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.02 sec)

root@db1114.eqiad.wmnet[dewiki]> explain SELECT log_id,log_type,log_action,log_timestamp,log_deleted  FROM `logging` LEFT JOIN `user` ON ((user_id=log_user)) LEFT JOIN `page` ON ((log_namespace=page_namespace) AND (log_title=page_title))  WHERE (log_type != 'suppress') AND log_actor = '154300' AND ((log_deleted & 12) != 12)  ORDER BY log_timestamp DESC,log_id DESC LIMIT 2;
+------+-------------+---------+------+----------------------------------------------------------+------------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                                            | key        | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+----------------------------------------------------------+------------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | type_time,actor_time,log_actor_type_time,log_type_action | actor_time | 8       | const | 18426 | Using where |
+------+-------------+---------+------+----------------------------------------------------------+------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

root@db1114.eqiad.wmnet[dewiki]> explain SELECT *  FROM `logging`  WHERE log_actor = 154300  ORDER BY log_timestamp DESC LIMIT 2;
+------+-------------+---------+------+--------------------------------+------------+---------+-------+-------+-------------+
| id   | select_type | table   | type | possible_keys                  | key        | key_len | ref   | rows  | Extra       |
+------+-------------+---------+------+--------------------------------+------------+---------+-------+-------+-------------+
|    1 | SIMPLE      | logging | ref  | actor_time,log_actor_type_time | actor_time | 8       | const | 18426 | Using where |
+------+-------------+---------+------+--------------------------------+------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)

I think I am going to close this, as the task was really fixed with T71222#5012278. 10.1 will for sure not fix this issue as they are not going to make changes to the optimizer as per my last chat with MariaDB Dev Team. So we'll see if 10.3 got this fixed or it was just because we re-imported the table (although we didn't see that making any difference on 10.1 as well).

Marostegui assigned this task to Anomie.
Marostegui removed Anomie as the assignee of this task.

I think I am going to close this, as the task was really fixed with T71222#5012278. 10.1 will for sure not fix this issue as they are not going to make changes to the optimizer as per my last chat with MariaDB Dev Team. So we'll see if 10.3 got this fixed or it was just because we re-imported the table (although we didn't see that making any difference on 10.1 as well).

If 10.3 really does fix it, do we want to remove the IGNORE INDEX once 10.3 is everywhere?

@Anomie it is probably the old decision of Mediawiki vs Wikimedia. I don't have visibility of what is the impact outside of WMF, but I would suggest to set up a lower priority, generic task to review optimizer hints and document them or remove the unnecessary ones, starting by this one.