Page MenuHomePhabricator

Error: 2013 Lost connection to MySQL server during query on NewFilesPager
Closed, ResolvedPublic

Description

https://commons.wikimedia.org/wiki/Special:NewFiles?hidepatrolled=1&limit=50&offset=

A database query error has occurred. This may indicate a bug in the software.
Function: IndexPager::buildQueryInfo (NewFilesPager)
Error: 2013 Lost connection to MySQL server during query (10.64.32.26)

Related Objects

Event Timeline

Josve05a raised the priority of this task from to Needs Triage.
Josve05a updated the task description. (Show Details)
Josve05a subscribed.

This is new from rMWc53f647658ad: Allow patrol of uploads and only occurs when "Hide patrolled uploads" is checked. I've just noticed that the page can be pretty slow and sometimes times out, as it happened here.

The query is:

SELECT *
FROM `image`
LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user))
INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  
WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0'
ORDER BY img_timestamp DESC
LIMIT 51

Explain:

+------+-------------+---------------+--------+-----------------------+---------+---------+-----------------------------------------+---------+----------------------------------------------+
| id   | select_type | table         | type   | possible_keys         | key     | key_len | ref                                     | rows    | Extra                                        |
+------+-------------+---------------+--------+-----------------------+---------+---------+-----------------------------------------+---------+----------------------------------------------+
|    1 | SIMPLE      | recentchanges | ALL    | rc_timestamp          | NULL    | NULL    | NULL                                    | 3980662 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | image         | eq_ref | PRIMARY,img_timestamp | PRIMARY | 257     | commonswiki.recentchanges.rc_title      |       1 | Using where                                  |
|    1 | SIMPLE      | user_groups   | eq_ref | PRIMARY,ug_group      | PRIMARY | 261     | commonswiki.recentchanges.rc_user,const |       1 | Using where; Using index; Not exists         |
+------+-------------+---------------+--------+-----------------------+---------+---------+-----------------------------------------+---------+----------------------------------------------+

I fiddled with it a bit, and apparently just changing the ORDER BY field to rc_timestamp makes it magically fast. The values of that and img_timestamp are the same.

SELECT *
FROM `image`
LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user))
INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  
WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0'
ORDER BY rc_timestamp DESC
LIMIT 51
+------+-------------+---------------+--------+-----------------------+--------------+---------+-----------------------------------------+------+--------------------------------------+
| id   | select_type | table         | type   | possible_keys         | key          | key_len | ref                                     | rows | Extra                                |
+------+-------------+---------------+--------+-----------------------+--------------+---------+-----------------------------------------+------+--------------------------------------+
|    1 | SIMPLE      | recentchanges | index  | rc_timestamp          | rc_timestamp | 16      | NULL                                    |   51 | Using where                          |
|    1 | SIMPLE      | image         | eq_ref | PRIMARY,img_timestamp | PRIMARY      | 257     | commonswiki.recentchanges.rc_title      |    1 | Using where                          |
|    1 | SIMPLE      | user_groups   | eq_ref | PRIMARY,ug_group      | PRIMARY      | 261     | commonswiki.recentchanges.rc_user,const |    1 | Using where; Using index; Not exists |
+------+-------------+---------------+--------+-----------------------+--------------+---------+-----------------------------------------+------+--------------------------------------+

Also, since we're using rc_title in the query, we should probably specify rc_namespace too (which is going to be 6, NS_FILE).

I'm pretty sure that query shouldn't even need to query the image table, just recentchanges

Change 269356 had a related patch set uploaded (by Bartosz Dziewoński):
Special:NewFiles: Make 'hidepatrolled' query less slow

https://gerrit.wikimedia.org/r/269356

Thanks for this. This changes are tricky because this kind of special pages have lots of filters and orders.

Joining to image makes sure that there is an actual image uploaded. I wonder if there is would be cases where something it is in the uploaded log but not in the image (e.g. image has been deleted). I agree that if that wasn't a problem, dropping image in this specific case and using the namespace would be wise. But we should do an extensive testing of both queries, on serveral wikis and several servers.

Sadly, I have found several instances where the new query is actually slower than the original one:

Original:
51 rows in set (0.36 sec)
New:
51 rows in set (2 min 14.38 sec)

MariaDB DBSTORE localhost enwiki > EXPLAIN SELECT * FROM `image` 
LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) 
INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) 
AND (rc_timestamp = img_timestamp))   
WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND 
rc_patrolled = '0' ORDER BY rc_timestamp DESC LIMIT 51;
+------+-------------+---------------+--------+-----------------------+--------------+---------+-----------------------------+--------+--------------
| id   | select_type | table         | type   | possible_keys         | key          | key_len | ref                         | rows   | Extra        
+------+-------------+---------------+--------+-----------------------+--------------+---------+-----------------------------+--------+--------------
|    1 | SIMPLE      | image         | ALL    | PRIMARY,img_timestamp | NULL         | NULL    | NULL                        | 911549 | Using tempora
|    1 | SIMPLE      | user_groups   | eq_ref | PRIMARY,ug_group      | PRIMARY      | 261     | enwiki.image.img_user,const |      1 | Using where; 
|    1 | SIMPLE      | recentchanges | ref    | rc_timestamp          | rc_timestamp | 16      | enwiki.image.img_timestamp  |      2 | Using where  
+------+-------------+---------------+--------+-----------------------+--------------+---------+-----------------------------+--------+--------------
3 rows in set (0.14 sec)

MariaDB DBSTORE localhost enwiki > EXPLAIN SELECT * FROM `image` 
LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) 
INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) 
AND (rc_timestamp = img_timestamp))   
WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' 
AND rc_patrolled = '0' ORDER BY img_timestamp DESC LIMIT 51;
+------+-------------+---------------+--------+-----------------------+---------------+---------+-----------------------------+------+---------------
| id   | select_type | table         | type   | possible_keys         | key           | key_len | ref                         | rows | Extra         
+------+-------------+---------------+--------+-----------------------+---------------+---------+-----------------------------+------+---------------
|    1 | SIMPLE      | image         | index  | PRIMARY,img_timestamp | img_timestamp | 18      | NULL                        |   25 |               
|    1 | SIMPLE      | user_groups   | eq_ref | PRIMARY,ug_group      | PRIMARY       | 261     | enwiki.image.img_user,const |    1 | Using where; U
|    1 | SIMPLE      | recentchanges | ref    | rc_timestamp          | rc_timestamp  | 16      | enwiki.image.img_timestamp  |    2 | Using where   
+------+-------------+---------------+--------+-----------------------+---------------+---------+-----------------------------+------+---------------
3 rows in set (0.00 sec)

The query worked fine and was fast up until now. What could explain that this no longer works ?

I'm pretty sure that query shouldn't even need to query the image table, just recentchanges

Maybe we should try that.

The query worked fine and was fast up until now. What could explain that this no longer works ?

The number 1 reason usually is "the data changed". The optimizer bases its queries on the index and internal heuristical statistics. If the data grows considerably, those numbers can change. That is why sometimes a query works well on the English Wikipedia but not on smaller wikis, and viceversa, even on the same code. See, for example, that in a couple of tests on production, I am still getting the old (good) behavior.

I am also regularly upgrading/restarting the databases, which can affect slightly the optimizer decisions and the generation of those stats. Usually very small changes, but that can be very visible in a high-thoughput environment.

My EXPLAINs were done on commonswiki… are you saying the tables in the two databases (commonswiki and enwiki) have different indexes? Or is MySQL just deciding to execute them differently because the data is different?

These are the differences, some already tracked and pending to be fixed, some maintenance that is low priority:

$ mysql -h s4-master commonswiki -e "SHOW CREATE TABLE user_groups\G SHOW CREATE TABLE recentchanges\G SHOW CREATE TABLE image\G" > ~/commons.sql
$ mysql -h s1-master enwiki -e "SHOW CREATE TABLE user_groups\G SHOW CREATE TABLE recentchanges\G SHOW CREATE TABLE image\G" > ~/enwiki.sql
$ diff ~/commons.sql ~/enwiki.sql 
38c38
<   `rc_params` blob,
---
>   `rc_params` blob NOT NULL,
46a47
>   KEY `tmp_1` (`rc_this_oldid`),
49c50
< ) ENGINE=InnoDB AUTO_INCREMENT=195466880 DEFAULT CHARSET=binary
---
> ) ENGINE=InnoDB AUTO_INCREMENT=795613037 DEFAULT CHARSET=binary
61c62
<   `img_minor_mime` varbinary(32) NOT NULL DEFAULT 'unknown',
---
>   `img_minor_mime` varbinary(100) NOT NULL DEFAULT 'unknown',

So the answer is "is MySQL just deciding to execute them differently because the data is different?". In particular, the size of recentchanges vs. image.

Commons needing special treatment for images should not be surprising. Now the question is how to implement that in the cleanest way possible (data-dependent code). Until now there is some things already implemented in terms of server-side code rewrite (which I do not like, because it is "hidden") and slaves with special partitioning.

There is one last thing I would like to try before that, which is regenerating the table statistics.

The query works fine at entwiki, I don't think it ever failed like at commons. A big difference in data is that at commons the ratio of autopatrolled uploads is much higher. At enwiki, most recent uploads are unpatrolled but at commons it's the opposite.

I've tried it again a few times at commons and it worked but was very slow.
Maybe a very large number of files were recently uploaded by one or several autopatrolled users, which badly impacted the query for unpatrolled files.

After a lot of reading docs and StackOverflow…

Both queries can be made fast is MySQL is forced to first query the table from which the ORDER BY column is. (Note how both my commonswiki examples query recentchanges first and image later (slow with img_timestamp, fast with rc_timestamp), and both of your enwiki examples query image first and recentchanges later (fast with img_timestamp, slow with rc_timestamp).

So here's the original query with a STRAIGHT_JOIN, thanks to StackOverflow: http://stackoverflow.com/questions/3456134/is-there-a-way-to-force-mysql-execution-order

mysql:research@analytics-store.eqiad.wmnet [enwiki]> explain SELECT * FROM `image` LEFT JOIN `user_groups` ON (ug_group
 = 'bot' AND (ug_user = img_user)) STRAIGHT_JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND
 (rc_timestamp = img_timestamp))   WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patroll
ed = '0' ORDER BY img_timestamp DESC LIMIT 51;
+------+-------------+---------------+--------+-----------------------+---------------+---------+-----------------------------+------+--------------------------------------+
| id   | select_type | table         | type   | possible_keys         | key           | key_len | ref                         | rows | Extra                                |
+------+-------------+---------------+--------+-----------------------+---------------+---------+-----------------------------+------+--------------------------------------+
|    1 | SIMPLE      | image         | index  | PRIMARY,img_timestamp | img_timestamp | 18      | NULL                        |   51 |                                      |
|    1 | SIMPLE      | user_groups   | eq_ref | PRIMARY,ug_group      | PRIMARY       | 261     | enwiki.image.img_user,const |    1 | Using where; Using index; Not exists |
|    1 | SIMPLE      | recentchanges | ref    | rc_timestamp          | rc_timestamp  | 16      | enwiki.image.img_timestamp  |    1 | Using where                          |
+------+-------------+---------------+--------+-----------------------+---------------+---------+-----------------------------+------+--------------------------------------+
3 rows in set (0.00 sec)

mysql:research@analytics-store.eqiad.wmnet [commonswiki]> explain SELECT * FROM `image` LEFT JOIN `user_groups` ON (ug_
group = 'bot' AND (ug_user = img_user)) STRAIGHT_JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user
) AND (rc_timestamp = img_timestamp))   WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_pa
trolled = '0' ORDER BY img_timestamp DESC LIMIT 51;
+------+-------------+---------------+--------+-----------------------+---------------+---------+----------------------------------+------+--------------------------------------+
| id   | select_type | table         | type   | possible_keys         | key           | key_len | ref                              | rows | Extra                                |
+------+-------------+---------------+--------+-----------------------+---------------+---------+----------------------------------+------+--------------------------------------+
|    1 | SIMPLE      | image         | index  | PRIMARY,img_timestamp | img_timestamp | 18      | NULL                             |   51 |                                      |
|    1 | SIMPLE      | user_groups   | eq_ref | PRIMARY,ug_group      | PRIMARY       | 261     | commonswiki.image.img_user,const |    1 | Using where; Using index; Not exists |
|    1 | SIMPLE      | recentchanges | ref    | rc_timestamp          | rc_timestamp  | 16      | commonswiki.image.img_timestamp  |    1 | Using where                          |
+------+-------------+---------------+--------+-----------------------+---------------+---------+----------------------------------+------+--------------------------------------+
3 rows in set (0.00 sec)

mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT * FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot'
 AND (ug_user = img_user)) STRAIGHT_JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_tim
estamp = img_timestamp))   WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0'
 ORDER BY img_timestamp DESC LIMIT 51;
51 rows in set (2.48 sec)

mysql:research@analytics-store.eqiad.wmnet [commonswiki]> SELECT * FROM `image` LEFT JOIN `user_groups` ON (ug_group =
'bot' AND (ug_user = img_user)) STRAIGHT_JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (r
c_timestamp = img_timestamp))   WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled
= '0' ORDER BY img_timestamp DESC LIMIT 51;
51 rows in set (1.08 sec)

Updated the proposed patch.

(There's apparently a lot of STRAIGHT_JOIN in MediaWiki, referencing tasks like T106682 and T113901.)

STRAIGHT_JOIN like FORCE_INDEX should be avoided in general. While I approved doing that on those cases, it was because there was no other way. Please let me attempt what I mentioned before. See my comment on T113901:

"to see if we can avoid the STRAIGHT_JOIN (because it may bite us in the future, see T101502)."

If it doesn't work, it is probably the same bug as T113901 and we should just apply it, but we should track all of this with #optimizer-workaround or something similar here, and heavily comment the code referring to this bug.

Regenerating stats does not work, this is probably https://mariadb.atlassian.net/browse/MDEV-8880 Go ahead with the STRAIGHT_JOIN.

Right, I realize it kind of sucks and I have very limited knowledge of MySQL internals. I'm just trying to solve the immediate problem of code that I merged and that I'm supposed to support generating queries slow enough to routinely time out, and hoping someone like you will step in to turn that into something maintainable later ;) Can you merge the patch?

matmarex lowered the priority of this task from High to Medium.Feb 18 2016, 5:01 PM

OK, this doesn't seem to be high priority for anyone else.

Also, I think part of the problem might be using * in the select field. In the worst case, that can select up to 800 mb of data (img_metadata sucks)

I left a comment on T11501 before reading this - I think maybe we should try to use the new_name_timestamp index - since we know that rc_new = 0 for log actions always, rc_namespace = 6 always, and then we're sorting by rc_timestamp

Change 269356 merged by jenkins-bot:
Special:NewFiles: Make 'hidepatrolled' query less slow

https://gerrit.wikimedia.org/r/269356

So alternatively, I think doing a query like:

SELECT /* IndexPager::buildQueryInfo (NewFilesPager) Bawolff */   img_name,img_user,img_timestamp  FROM `recentchanges` FORCE INDEX( new_name_timestamp) INNER JOIN `image` ON ((rc_title = img_name) )  WHERE rc_bot = 0 AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6' and rc_new = 0   ORDER BY rc_timestamp DESC LIMIT 51;

might be more optimal. But not 100% sure. (I mean, if the recently merged patch doesn't work out)

Also, I think part of the problem might be using * in the select field. In the worst case, that can select up to 800 mb of data (img_metadata sucks)

https://gerrit.wikimedia.org/r/#/c/269429/ (already merged).