Page MenuHomePhabricator

SpecialWhatLinksHere::showIndirectLinks - Lost connection to MySQL server during query (10.64.16.28)
Closed, ResolvedPublic

Description

Going to throws https://meta.wikimedia.org/wiki/Special:WhatLinksHere/Meta:Babel

A database query error has occurred. This may indicate a bug in the software.

Function: SpecialWhatLinksHere::showIndirectLinks
Error: 2013 Lost connection to MySQL server during query (10.64.16.28)

Details

Related Gerrit Patches:

Event Timeline

Glaisher created this task.Jul 23 2015, 1:18 PM
Glaisher raised the priority of this task from to High.
Glaisher updated the task description. (Show Details)
Glaisher added a subscriber: Glaisher.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 23 2015, 1:18 PM

@Glaisher: I can not repeat it with a cachable request (please report if you can repeat it every time).

Sadly, there are certain combinations of queries, wikis and tables that sometimes run for too long and end up being killed- which is "self inflicted error". That doesn't mean that they are not problems to fix, but we have to classify them by frequency (as usually that means table structure changes). See for example, T101502.

The frequency of this error seems to be 462 in the last 7 days, with more incidence on metawiki, wikinews and commons. I will be able to provide the details of a sample of the errors at some point.

I agree that this derived table is a bit ugly:

SELECT page_id,page_namespace,page_title,rd_from,page_is_redirect FROM 
(
  SELECT pl_from,rd_from FROM `pagelinks` 
  INNER JOIN `page` ON ((pl_from = page_id)) 
  LEFT JOIN `redirect` ON ((rd_from = pl_from) AND 
  rd_title = 'Babel' AND 
  (rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = '4') 
  WHERE pl_namespace = '4' AND pl_title = 'Babel' ORDER BY pl_from LIMIT 102 ) `temp_backlink_range` 
INNER JOIN `page` ON ((pl_from = page_id)) 
ORDER BY page_id LIMIT 51

I could repeat it several times when I created this task but it doesn't seem to be happening now.

That is a very common thing: once a db server is "hot" or it is on varnish cache, it does not happen again. However, for the server it is faster to produce error as soon as possible and let the user try again. (Again, that doesn't mean it is not something to solve)

So the priority of this task can be lowered?

Umherirrender set Security to None.Jul 24 2015, 4:56 PM
Umherirrender added a subscriber: aaron.

Priority should be handled by devs in this case, but if you want my opinion, I would put it as higher than average for the following reason:

This is creating a full tablescan + filesort on page, which is incredible slow in almost every wiki:

mysql> EXPLAIN SELECT page_id,page_namespace,page_title,rd_from,page_is_redirect FROM 
(SELECT tl_from,rd_from FROM `templatelinks` INNER JOIN `page` ON ((tl_from = page_id)) 
LEFT JOIN `redirect` ON ((rd_from = tl_from) AND rd_title = 'Information' AND 
(rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = '10') 
WHERE tl_namespace = '10' AND tl_title = 'Information' AND (tl_from >= 317) 
ORDER BY tl_from LIMIT 102 ) `temp_backlink_range` 
INNER JOIN `page` ON ((tl_from = page_id)) 
ORDER BY page_id LIMIT 51 \G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 102
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: temp_backlink_range.tl_from
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: page
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 19423644
        Extra: Using where; Using index; Using temporary; Using filesort
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: templatelinks
         type: eq_ref
possible_keys: tl_from,tl_namespace
          key: tl_from
      key_len: 265
          ref: commonswiki.page.page_id,const,const
         rows: 1
        Extra: Using where; Using index
*************************** 5. row ***************************
           id: 2
  select_type: DERIVED
        table: redirect
         type: eq_ref
possible_keys: PRIMARY,rd_ns_title
          key: PRIMARY
      key_len: 4
          ref: commonswiki.page.page_id
         rows: 1
        Extra: Using where
5 rows in set (0.00 sec)

This only happens, however, on Templates (Pages?) with many transclusions (Links?). See, for comparison:

mysql> EXPLAIN SELECT page_id,page_namespace,page_title,rd_from,page_is_redirect FROM 
(SELECT tl_from,rd_from FROM `templatelinks` INNER JOIN `page` ON ((tl_from = page_id)) 
LEFT JOIN `redirect` ON ((rd_from = tl_from) AND rd_title = 'Abstract' AND 
(rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = '10') 
WHERE tl_namespace = '10' AND tl_title = 'Abstract' AND (tl_from >= 317) 
ORDER BY tl_from LIMIT 102 ) `temp_backlink_range`
INNER JOIN `page` ON ((tl_from = page_id)) 
ORDER BY page_id LIMIT 51 \G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: temp_backlink_range.tl_from
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: templatelinks
         type: range
possible_keys: tl_from,tl_namespace
          key: tl_namespace
      key_len: 265
          ref: NULL
         rows: 1
        Extra: Using where; Using index
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: redirect
         type: eq_ref
possible_keys: PRIMARY,rd_ns_title
          key: PRIMARY
      key_len: 4
          ref: commonswiki.templatelinks.tl_from
         rows: 1
        Extra: Using where
*************************** 5. row ***************************
           id: 2
  select_type: DERIVED
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: commonswiki.templatelinks.tl_from
         rows: 1
        Extra: Using index
5 rows in set (0.00 sec)

I will try to see if something can be done at pure MySQL level (statistics) or changing the query, but if not, I'm afraid that the refactoring on the job creating the table will not be easy.

So , I made some query rewriting magic and this is why the query plan change, if we force the regular query plan when there are more templatelinks per page, we get:

mysql> EXPLAIN SELECT page_id,page_namespace,page_title,rd_from,page_is_redirect FROM 
(SELECT STRAIGHT_JOIN tl_from,rd_from FROM `templatelinks` 
LEFT JOIN `redirect` ON ((rd_from = tl_from) AND rd_title = 'Information' AND 
(rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = '10' ) 
INNER JOIN `page` ON ((tl_from = page_id))  
WHERE tl_namespace = '10' AND tl_title = 'Information' AND (tl_from >= 317) 
ORDER BY tl_from LIMIT 102 ) `temp_backlink_range` 
INNER JOIN `page` ON ((tl_from = page_id)) 
ORDER BY page_id LIMIT 51 \G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 102
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: temp_backlink_range.tl_from
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: templatelinks
         type: range
possible_keys: tl_from,tl_namespace
          key: tl_namespace
      key_len: 265
          ref: NULL
         rows: 54103560 <------------------------------ OPTIMIZER DOES NOT LIKE THIS, BUT WE ARE FORCING IT
        Extra: Using where; Using index
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: redirect
         type: eq_ref
possible_keys: PRIMARY,rd_ns_title
          key: PRIMARY
      key_len: 4
          ref: commonswiki.templatelinks.tl_from
         rows: 1
        Extra: Using where
*************************** 5. row ***************************
           id: 2
  select_type: DERIVED
        table: page
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: commonswiki.templatelinks.tl_from
         rows: 1
        Extra: Using index
5 rows in set (0.01 sec)

However, this last execution does:

51 rows in set (0.00 sec)

While I had to kill the original query because it took more than 5 minutes.

I will talk with @Springle (as he may be more experienced than me on handling mediawiki index stats) to see if we can avoid the STRAIGHT_JOIN (because it may bite us in the future, see T101502).

jcrespo claimed this task.Jul 25 2015, 12:25 PM

FYI, most hits on this query seem to come from crawling bots (Google, Yahoo, Baidu).

After talking to @Springle, he agreed that the best fix is to use STRAIGHT_JOIN.

So, converting:

SELECT page_id,page_namespace,page_title,rd_from,page_is_redirect FROM 
(
  SELECT pl_from,rd_from 
  FROM `pagelinks` 
  INNER JOIN `page` ON ((pl_from = page_id)) 
  LEFT JOIN `redirect` ON ((rd_from = pl_from) AND rd_title = 'Writing_an_article' AND (rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = '4') 
  WHERE pl_namespace = '4' AND pl_title = 'Writing_an_article' ORDER BY pl_from LIMIT 102 
) `temp_backlink_range` 
INNER JOIN `page` ON ((pl_from = page_id)) 
ORDER BY page_id LIMIT 51;

into

SELECT page_id,page_namespace,page_title,rd_from,page_is_redirect FROM 
(
  SELECT STRAIGHT_JOIN pl_from,rd_from 
  FROM `pagelinks` 
  LEFT JOIN `redirect` ON ((rd_from = pl_from) AND rd_title = 'Writing_an_article' AND (rd_interwiki = '' OR rd_interwiki IS NULL) AND rd_namespace = '4') 
  INNER JOIN `page` ON ((pl_from = page_id))
  WHERE pl_namespace = '4' AND pl_title = 'Writing_an_article' ORDER BY pl_from LIMIT 102 
) `temp_backlink_range` 
INNER JOIN `page` ON ((pl_from = page_id)) 
ORDER BY page_id LIMIT 51;

We need a patch, although it will take me a while unless someone wants to help me.

Change 227878 had a related patch set uploaded (by Aaron Schulz):
Use STRAIGHT_JOIN in SpecialWhatlinkshere query

https://gerrit.wikimedia.org/r/227878

Change 227878 merged by jenkins-bot:
Use STRAIGHT_JOIN in SpecialWhatlinkshere query

https://gerrit.wikimedia.org/r/227878

Nice. Thanks all!