Page MenuHomePhabricator

A Query takes suddenly really much too long – something corrupt?
Open, MediumPublic

Description

in dewiki_p database (connected to labsdb1010) the following query seems to take 3 hours

SELECT DISTINCT pl_title FROM pagelinks LEFT JOIN page ON page_title = pl_title AND page_namespace = pl_namespace WHERE pl_from_namespace = 0 AND pl_namespace = 0 AND page_id IS NULL AND pl_title LIKE 'P%';

When I replace the last part to a different letter, like

SELECT DISTINCT pl_title FROM pagelinks LEFT JOIN page ON page_title = pl_title AND page_namespace = pl_namespace WHERE pl_from_namespace = 0 AND pl_namespace = 0 AND page_id IS NULL AND pl_title LIKE 'O%';

it is much faster:
72625 rows in set (1 min 29.13 sec)

Maybe it is important, that two days ago, on May 28th that long running query finished with in a minute. So something odd happend here

Event Timeline

jcrespo subscribed.

Please read the announcements made on cloud list. I am NOT saying it is due to that, but AFAIK some changes were mentioned regarding changes on the underlying tables that may require query tuning, but please verify those changes are unrelated.

Also you didn't put here the explain plan (obtainable with tools like Quarry) so we can help you research.

I just re-read your ticket, "When I replace the last part to a different letter" that is not unusual, please have a look at: https://www.slideshare.net/jynus/query-optimization-with-mysql-80-and-mariadb-103-the-basics/53 [WARNING, link to an external site].

The above mentions that because data is dynamic, query plans are dynamic too, which means the amount of rows selected may have increased recently and made the query have bad performance-- please consider using a smaller range instead.

Explain when the slow query runs:

MariaDB [dewiki_p]> show explain FOR 248311371;
+------+-------------+-----------+--------+-------------------------------------+--------------+---------+---------------------------------+----------+-----------------------------------------------------+

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra

+------+-------------+-----------+--------+-------------------------------------+--------------+---------+---------------------------------+----------+-----------------------------------------------------+

1SIMPLEpagelinksrefpl_namespace,pl_backlinks_namespacepl_namespace4const11090664Using index condition; Using where; Using temporary
1SIMPLEpageeq_refname_titlename_title261const,dewiki.pagelinks.pl_title1Using where; Using index; Not exists; Distinct

+------+-------------+-----------+--------+-------------------------------------+--------------+---------+---------------------------------+----------+-----------------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

Explain when a faster one runs "Like O%"

MariaDB [dewiki_p]> show explain FOR 290160886;
+------+-------------+-----------+--------+-------------------------------------+--------------+---------+---------------------------------+----------+-----------------------------------------------------+

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra

+------+-------------+-----------+--------+-------------------------------------+--------------+---------+---------------------------------+----------+-----------------------------------------------------+

1SIMPLEpagelinksrefpl_namespace,pl_backlinks_namespacepl_namespace4const13985414Using index condition; Using where; Using temporary
1SIMPLEpageeq_refname_titlename_title261const,dewiki.pagelinks.pl_title1Using where; Using index; Not exists; Distinct

+------+-------------+-----------+--------+-------------------------------------+--------------+---------+---------------------------------+----------+-----------------------------------------------------+
2 rows in set, 1 warning (0.01 sec)

I am fully aware that slightly different queries can take different times. These two have about the same number of rows to inspect. 11 millions for the extreme slow one and ~14 millions for the faster one.

However a difference between 1.5 Minutes and 3 hours is not in the range of expected variations.

That is strange indeed, I may be able to give it a deeper look later.

As promised, I had a look at this, the following queries take around the same:

[dewiki_p]> SELECT count(DISTINCT pl_title) FROM pagelinks LEFT JOIN page ON page_title = pl_title AND page_namespace = pl_namespace WHERE pl_from_namespace = 0 AND pl_namespace = 0 AND page_id IS NULL AND pl_title LIKE 'P%';
+--------------------------+
| count(DISTINCT pl_title) |
+--------------------------+
|                   202037 |
+--------------------------+
1 row in set (17.21 sec)

[dewiki_p]> SELECT count(DISTINCT pl_title) FROM pagelinks LEFT JOIN page ON page_title = pl_title AND page_namespace = pl_namespace WHERE pl_from_namespace = 0 AND pl_namespace = 0 AND page_id IS NULL AND pl_title LIKE 'O%'; 
+--------------------------+
| count(DISTINCT pl_title) |
+--------------------------+
|                    72635 |
+--------------------------+
1 row in set (5.82 sec)

Size is not dissimilar, so i wonder if there is a tipping point in query planning, or even an optimizer bug, or something else.

Actually I cannot reproduce the original query on labsdb1009 (it is normal it takes a few seconds more due to extra memcopy/io):

202038 rows in set (20.45 sec)

I will test on the other servers.

Labsdb1010:
202038 rows in set (16.99 sec)

labsdb1011 indeed takes longer:

ERROR 1317 (70100): Query execution was interrupted
Stuck on Copying to tmp table on disk

root@labsdb1011[(none)]> show explain for 13199292\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagelinks
         type: ref
possible_keys: pl_namespace,pl_backlinks_namespace
          key: pl_namespace
      key_len: 4
          ref: const
         rows: 10839430
        Extra: Using index condition; Using where; Using temporary
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: name_title
          key: name_title
      key_len: 261
          ref: const,dewiki.pagelinks.pl_title
         rows: 1
        Extra: Using where; Using index; Not exists; Distinct
2 rows in set, 1 warning (0.00 sec)

vs the other fast query:

root@labsdb1011[(none)]> show explain for 13230505\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagelinks
         type: range
possible_keys: pl_namespace,pl_backlinks_namespace
          key: pl_backlinks_namespace
      key_len: 265
          ref: NULL
         rows: 9359682
        Extra: Using where; Using index; Using temporary
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: name_title
          key: name_title
      key_len: 261
          ref: const,dewiki.pagelinks.pl_title
         rows: 1
        Extra: Using where; Using index; Not exists; Distinct
2 rows in set, 1 warning (0.00 sec)

Until I check what is different about labsdb1011, I suggest to use labsdb1009 (wikireplica-web entry point) for that query (or making sure you query labsdb1010 on the analytics one). But obviously something seems different there, I will keep researching.

I am going to depool labsdb1011 and run some index stats update.

After the above maintenance, this should work now as intended:

202038 rows in set (40.36 sec)

One piece of advice I would do is that returning so many rows may not be optimal for wikireplicas, where buffers are limited due to the large amount of users. Limiting results and paging by an index may speed up the overall latency, even if more queries may be needed (by getting better advantage of memory buffers).

Please confirm everything looks ok now and Close if resolved. Due to the IO, it is nor unusual for row/index stats to go stale and not refresh frequently, although it would likely have fixed by itself after some time. This is why pl_backlinks_namespace was not used for the P version of the query.

Some background: There are a lot of "redlinks" in articles. Like [[Ac Aceca]] where an article [[AC Aceca]] exists, this is an example of uppercase/lowercase difference, the same is true for letters with accents and similar additions, like a redlink [[Diana Castano Sarrias]] with an existing [[Diana Castaño Sarrias]].

Therefore I try to find all redlinks with that query, then match them with some magic PHP code with all existing articles and print out a list of possible resolvable redlinks. ~25.000 such redlinks in dewiki.

Doing all this in one single query seems to be impossible, I tried it, but timeout was the the only result. That would indeed return a lot fewer records. Maybe it does work in small wikis, but my interest is dewiki :-/

Until I started this phab-entry, I had one single query for all the data, so there was no " AND pl_title LIKE 'P%'"-phrase. That query took about half an hour, returned close to 4 million records and worked for about 2 month fine. No problem.

Suddenly it stopped working. I got a timeout after about 3 hours. So I did that split into single letters and found that astonishing difference in runtime. On the next day it could not be reproduced. One more day later the same query with "AND pl_title LIKE 'P%'" took again 1 hour 55 minutes. Another day later 99 seconds.

But! I did not print out to which host I was connected, so maybe the two slow runs are one and the fast ones are another. I added that to the logfile and will watch.

Off-topic, but since you recommended it: Explain with quarry usually does not work, at least on dewiki_p (did not try other wikipedias) . It does not, because there are at least two servers labsdb1010 and labsdb1011. When the query runs on one server and quarry looks on the second server for the explain statement, it fails – and it fails almost every time. Yes, some month (or more) ago, explain with quarry used to work fine, but things changed :-(

Interesting, but just to be clear, the issue right now is gone, right? Your other comments are useful, but I would file them separatelly (e.g. I cannot help with Quarry).

Just to be clear, there was indeed a problem with labsdb1011 (not on 9 or 10, that is why it may be intermittent) . It has now been solved on 11 too. However, I cannot guarantee it won't ever happen again-- we do have several bugs open with MariaDB regarding the optimizer. The best way it to file a new bug when you see something wrong and I can manually refresh the index statistics.

I will watch it for a few days and close this one when nothing unexpected happens.

Off-topic, but since you recommended it: Explain with quarry usually does not work, at least on dewiki_p (did not try other wikipedias) . It does not, because there are at least two servers labsdb1010 and labsdb1011. When the query runs on one server and quarry looks on the second server for the explain statement, it fails – and it fails almost every time. Yes, some month (or more) ago, explain with quarry used to work fine, but things changed :-(

That is T205214: EXPLAIN is broken because new analytics wiki replica cluster contains multiple servers.

the last days:
2019-06-05 connected to labsdb1010, 1 large query "SELECT DISTINCT pl_title FROM pagelinks LEFT JOIN page ON page_title = pl_title AND page_namespace = pl_namespace WHERE pl_from_namespace = 0 AND pl_namespace = 0 AND page_id IS NULL" took ~55 Minutes which is okay.

2019-06-06 labsdb1010, same query as above, 94 Minutes. Okay too

Fixed something in the script, so run it again:
2019-06-06 labsdb1010 45 Minutes, okay.

2019-06-07 this time labsdb1011 query as above, 100 minutes. okay too

2019-06-08 labsdb1011 got some timeout after 240 Minutes, so used as fallback single queries for each letter like "SELECT DISTINCT pl_title FROM pagelinks LEFT JOIN page ON page_title = pl_title AND page_namespace = pl_namespace WHERE pl_from_namespace = 0 AND pl_namespace = 0 AND page_id IS NULL AND pl_title LIKE 'A%'" each of those single queries was pretty fast, overall time for all 27 queries 30 Minutes, which is fine too

2019-06-09 labsdb1011 again the timeout after 240 Minutes, split in single queries, took 60 minutes.

Going on vacations, will not work on this at the moment.