Page MenuHomePhabricator

Error: 2013 Lost connection to MySQL server during query on NewFilesPager (again)
Open, HighPublic

Description

https://commons.wikimedia.org/wiki/Special:NewFiles?hidepatrolled=1&limit=50&offset=20160601 and https://commons.wikimedia.org/wiki/Special:NewFiles?hidepatrolled=1&limit=50&offset=20160601&dir=prev are producing the same error as
https://commons.wikimedia.org/wiki/Special:NewFiles?hidepatrolled=1&limit=50&offset= did in T124205.

Specifically, twice now consecutively I had received

A database query error has occurred. This may indicate a bug in the software.

    Function: IndexPager::buildQueryInfo (NewFilesPager)
    Error: 2013 Lost connection to MySQL server during query (10.64.16.102)

@jcrespo mentions causes and potential fixes: T141765#2511539

Event Timeline

10.64.16.102 is db1084. I do not see anything wrong with the server itself, although I have seen since 6:47 today triple the bytes_sent from the db server, with no significat increase in traffic (number of requests). That, however, is probably irrelevant.

This is most likely a query optimization issue:

EXPLAIN SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp)) WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6' AND (img_timestamp<'20160601') ORDER BY img_timestamp DESC LIMIT 51\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: image
         type: range
possible_keys: PRIMARY,img_timestamp
          key: img_timestamp
      key_len: 16
          ref: NULL
         rows: 27983996
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: user_groups
         type: eq_ref
possible_keys: PRIMARY,ug_group
          key: PRIMARY
      key_len: 261
          ref: commonswiki.image.img_user,const
         rows: 1
        Extra: Using where; Using index; Not exists
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: ref
possible_keys: rc_timestamp,rc_namespace_title,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp
          key: rc_name_type_patrolled_timestamp
      key_len: 22
          ref: const,const,const,commonswiki.image.img_timestamp
         rows: 1
        Extra: Using where
3 rows in set (0.00 sec)
jcrespo raised the priority of this task from Medium to High.Aug 1 2016, 1:44 PM

Indeed, AND (img_timestamp<'20160601') makes the query go from 0 seconds to infinity. The impact is low in numbers (not many requests/hour), but it should be fast to fix and it makes the query consistently fail, so I am marking it as high.

The reason being that results with rc_patrolled = '0' are very rare on old results, so the query ends up reading all (or a huge number) of rows in the images table.

One solution is to limit the amount of time/rows that can be queried like this, e.g. "New files with the given properties, but only withing the last 6 months/100000 last images".

SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp)) WHERE (ug_group IS NULL) AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6' AND (img_timestamp<'20160601') AND (img_timestamp>'20160501') ORDER BY img_timestamp DESC LIMIT 51;

takes 3 seconds, probably still too much, but at least is finishes.

A better solution is to drop the STRAIGHT join in this case:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: recentchanges
         type: range
possible_keys: rc_timestamp,rc_namespace_title,rc_ns_usertext,tmp_3,rc_name_type_patrolled_timestamp
          key: rc_timestamp
      key_len: 16
          ref: NULL
         rows: 1
        Extra: Using index condition; Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: image
         type: eq_ref
possible_keys: PRIMARY,img_timestamp
          key: PRIMARY
      key_len: 257
          ref: commonswiki.recentchanges.rc_title
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: user_groups
         type: eq_ref
possible_keys: PRIMARY,ug_group
          key: PRIMARY
      key_len: 261
          ref: commonswiki.recentchanges.rc_user,const
         rows: 1
        Extra: Using where; Using index; Not exists
3 rows in set (0.00 sec)

This takes more time to order, but it returns in 0 seconds.

A third option is to disallow hidepatrolled=1 + offset=* - you can hide patrolled edits on the latest changes, but not on old ones that would be mostly all patrolled.

A third option is to disallow hidepatrolled=1 + offset=* - you can hide patrolled edits on the latest changes, but not on old ones that would be mostly all patrolled.

It is a common usage pattern to patrol from the back of the queue, to pick up the ones that have slipped through the cracks of patrollers working on the front of the queue.

I'll note that there will not even be any results older than $wgRCMaxAge seconds, because recentchanges table entries are deleted after this time. I think this defaults to 30 days in WMF config.

A third option is to disallow hidepatrolled=1 + offset=* - you can hide patrolled edits on the latest changes, but not on old ones that would be mostly all patrolled.

It is a common usage pattern to patrol from the back of the queue, to pick up the ones that have slipped through the cracks of patrollers working on the front of the queue.

It is also how paging is implemented, so definitely not an option.

I want to hear @Anomie's take on this, he is aware of many issues regarding STRAIGHT_JOINs on mediawiki (when it is needed and when it is not, special cases found, etc.).

I'll note that there will not even be any results older than $wgRCMaxAge seconds, because recentchanges table entries are deleted after this time. I think this defaults to 30 days in WMF config.

Indeed. I was using &dir=prev initially, but forgot to mention that.

I'm on vacation at the moment, but off the top of my head I note the straight join here was just added for T124205: Error: 2013 Lost connection to MySQL server during query on NewFilesPager.

The thing to do might be to add a more targeted index (maybe rc_namespace,rc_timestamp or rc_type,rc_namespace,rc_timestamp, or maybe something even more targeted) to cut down the number of matching recentchanges rows, change it to order by rc_timestamp, and switch the table order so straight-join always does recentchanges first. That should at least fix both of these bugs.

After all the MySQL upgrades and all that, this bug is still valid and the two reported URLs are still failing.

kostajh subscribed.

Growth-Team discussed in triage this week and we won't have time to prioritize this in the near-to-medium term.