Page MenuHomePhabricator

Quarry runs thousands times slower in last months
Closed, ResolvedPublic

Description

Could you fix this, please? Thank you.

Event Timeline

Could only be "fixed" when assuming that everybody else would share the same understanding of "slow" and everybody would run into the very same problem with any query.
See https://mediawiki.org/wiki/How_to_report_a_bug

Hi. Slow is just slow. Takes more time. I run every week dozens of queries. Each took less than 5 seconds in the past. It takes at least a couple of minutes each now.

I'm also noticing that most queries stay queued now. Running queries directly against LabsDBs results in far better performance so it might be something in Quarry. Celery maybe.

This task is essentially "Invalid" unless some clear steps-to-reproduce are provided.

How can I do this? I know that it happens, but can't bring you the same run before and after without a time machine.

@zhuyifei1999, did you do something now? The speed is better thousands of times. I hope it's not just for today. For example, I run every month two queries, each one many times for different namespaces. Today they run seconds in place of minutes, and for file namespace first query a second in place of 20 minutes, and the second run first time without be killed, and took a second.

Which query? Can you give a link?

Quarry switched to the new servers a while ago (T176694). It could be related.

Which query? Can you give a link?

Quarry switched to the new servers a while ago (T176694). It could be related.

4835 and 5845.

EXPLAINS:

4835 on analytics:

+------+--------------------+---------------+----------------+--------------------------------------------------+--------------+---------+---------------------------------------------------+-------+-------------------------------------------+
| id   | select_type        | table         | type           | possible_keys                                    | key          | key_len | ref                                               | rows  | Extra                                     |
+------+--------------------+---------------+----------------+--------------------------------------------------+--------------+---------+---------------------------------------------------+-------+-------------------------------------------+
|    1 | PRIMARY            | categorylinks | ref            | PRIMARY,cl_timestamp,cl_sortkey                  | cl_timestamp | 257     | const                                             | 28746 | Using where; Using index; Using temporary |
|    1 | PRIMARY            | page          | eq_ref         | PRIMARY,name_title                               | PRIMARY      | 4       | hewiki.categorylinks.cl_from                      |     1 | Using where                               |
|    1 | PRIMARY            | pagelinks     | ref            | PRIMARY,pl_namespace                             | pl_namespace | 261     | hewiki.page.page_namespace,hewiki.page.page_title |     6 | Using where; Using index                  |
|    1 | PRIMARY            | page          | eq_ref         | PRIMARY,name_title,page_redirect_namespace_len   | PRIMARY      | 4       | hewiki.pagelinks.pl_from                          |     1 | Using where                               |
|    4 | DEPENDENT SUBQUERY | templatelinks | index_subquery | PRIMARY,tl_namespace,tl_backlinks_namespace      | PRIMARY      | 4       | func                                              |     7 | Using index; Using where                  |
|    3 | MATERIALIZED       | categorylinks | ref            | PRIMARY,cl_timestamp,cl_sortkey,cl_collation_ext | cl_timestamp | 257     | const                                             | 28746 | Using where; Using index                  |
+------+--------------------+---------------+----------------+--------------------------------------------------+--------------+---------+---------------------------------------------------+-------+-------------------------------------------+

on labsdb:

+------+--------------+---------------+--------+--------------------------------------------------+--------------+---------+---------------------------------------------------------+----------+-------------------------------------------+
| id   | select_type  | table         | type   | possible_keys                                    | key          | key_len | ref                                                     | rows     | Extra                                     |
+------+--------------+---------------+--------+--------------------------------------------------+--------------+---------+---------------------------------------------------------+----------+-------------------------------------------+
|    1 | PRIMARY      | pagelinks     | index  | PRIMARY,pl_namespace                             | pl_namespace | 265     | NULL                                                    | 30242335 | Using where; Using index; Using temporary |
|    1 | PRIMARY      | page          | eq_ref | PRIMARY,name_title,page_redirect_namespace_len   | PRIMARY      | 4       | hewiki.pagelinks.pl_from                                |        1 | Using where                               |
|    1 | PRIMARY      | page          | eq_ref | PRIMARY,name_title                               | name_title   | 261     | hewiki.pagelinks.pl_namespace,hewiki.pagelinks.pl_title |        1 | Using index                               |
|    1 | PRIMARY      | categorylinks | eq_ref | PRIMARY,cl_timestamp,cl_sortkey                  | PRIMARY      | 261     | hewiki.page.page_id,const                               |        1 | Using where; Using index; Distinct        |
|    4 | MATERIALIZED | templatelinks | index  | PRIMARY,tl_namespace,tl_backlinks_namespace      | tl_namespace | 265     | NULL                                                    |  8646389 | Using where; Using index                  |
|    3 | MATERIALIZED | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey,cl_collation_ext | cl_timestamp | 257     | const                                                   |     8280 | Using where; Using index                  |
+------+--------------+---------------+--------+--------------------------------------------------+--------------+---------+---------------------------------------------------------+----------+-------------------------------------------+

5845 on analytics:

+------+--------------------+---------------+----------------+--------------------------------------------------+--------------+---------+------------------------------+-------+-------------------------------------------+
| id   | select_type        | table         | type           | possible_keys                                    | key          | key_len | ref                          | rows  | Extra                                     |
+------+--------------------+---------------+----------------+--------------------------------------------------+--------------+---------+------------------------------+-------+-------------------------------------------+
|    1 | PRIMARY            | categorylinks | ref            | PRIMARY,cl_timestamp,cl_sortkey                  | cl_timestamp | 257     | const                        | 28746 | Using where; Using index; Using temporary |
|    1 | PRIMARY            | page          | eq_ref         | PRIMARY,name_title                               | PRIMARY      | 4       | hewiki.categorylinks.cl_from |     1 | Using where                               |
|    1 | PRIMARY            | redirect      | ref            | PRIMARY,rd_ns_title                              | rd_ns_title  | 261     | const,hewiki.page.page_title |     1 | Using index                               |
|    1 | PRIMARY            | page          | eq_ref         | PRIMARY                                          | PRIMARY      | 4       | hewiki.redirect.rd_from      |     1 | Using where                               |
|    1 | PRIMARY            | pagelinks     | ref            | PRIMARY,pl_namespace                             | pl_namespace | 261     | const,hewiki.page.page_title |     6 | Using where; Using index                  |
|    1 | PRIMARY            | page          | eq_ref         | PRIMARY,name_title,page_redirect_namespace_len   | PRIMARY      | 4       | hewiki.pagelinks.pl_from     |     1 | Using where                               |
|    4 | DEPENDENT SUBQUERY | templatelinks | index_subquery | PRIMARY,tl_namespace,tl_backlinks_namespace      | PRIMARY      | 4       | func                         |     7 | Using index; Using where                  |
|    3 | MATERIALIZED       | categorylinks | ref            | PRIMARY,cl_timestamp,cl_sortkey,cl_collation_ext | cl_timestamp | 257     | const                        | 28746 | Using where; Using index                  |
+------+--------------------+---------------+----------------+--------------------------------------------------+--------------+---------+------------------------------+-------+-------------------------------------------+

on labsdb:

+------+--------------+---------------+--------+--------------------------------------------------+--------------+---------+--------------------------------------------------+---------+-------------------------------------------+
| id   | select_type  | table         | type   | possible_keys                                    | key          | key_len | ref                                              | rows    | Extra                                     |
+------+--------------+---------------+--------+--------------------------------------------------+--------------+---------+--------------------------------------------------+---------+-------------------------------------------+
|    1 | PRIMARY      | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey                  | cl_timestamp | 257     | const                                            |    8064 | Using where; Using index; Using temporary |
|    1 | PRIMARY      | page          | eq_ref | PRIMARY,name_title                               | PRIMARY      | 4       | hewiki.categorylinks.cl_from                     |       1 | Using where                               |
|    1 | PRIMARY      | redirect      | ref    | PRIMARY,rd_ns_title                              | rd_ns_title  | 261     | const,hewiki.page.page_title                     |       1 | Using index                               |
|    1 | PRIMARY      | page          | eq_ref | PRIMARY                                          | PRIMARY      | 4       | hewiki.redirect.rd_from                          |       1 | Using where                               |
|    1 | PRIMARY      | page          | ref    | PRIMARY,name_title,page_redirect_namespace_len   | name_title   | 4       | const                                            |   49515 | Using where                               |
|    1 | PRIMARY      | pagelinks     | eq_ref | PRIMARY,pl_namespace                             | PRIMARY      | 265     | hewiki.page.page_id,const,hewiki.page.page_title |       1 | Using index; Distinct                     |
|    4 | MATERIALIZED | templatelinks | index  | PRIMARY,tl_namespace,tl_backlinks_namespace      | tl_namespace | 265     | NULL                                             | 8646389 | Using where; Using index                  |
|    3 | MATERIALIZED | categorylinks | ref    | PRIMARY,cl_timestamp,cl_sortkey,cl_collation_ext | cl_timestamp | 257     | const                                            |    8064 | Using where; Using index                  |
+------+--------------+---------------+--------+--------------------------------------------------+--------------+---------+--------------------------------------------------+---------+-------------------------------------------+

This looks like some terrible optimization plans to materialize 8646389 selected rows in a subquery.

The optimizer switch on the labsdb servers are:

index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=on,index_condition_pushdown=off,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=on,mrr_cost_based=on,mrr_sort_keys=on,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on

and the new analytics:

index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off

The diff is https://commons.wikimedia.org/w/index.php?title=User:Zhuyifei1999/sandbox&oldid=262113567. However, I'm not too familiar with all the switches to determine if they are relevant to these queries. Maybe @jcrespo knows?

Quarry switched to the new servers on Tue, Sep 26 (less than 2 weeks ago), so if that has been happening "in last months", it is not that. The only change I can see, which applies to both servers, is the structure change of *links tables- which was done months ago. You will have to adapt the code to the new structure, which is the one appearing on the tables.sql mediawiki (old structure is unmaintained) OR propose a compatible but addition to the structure like in T177096#3647040.

optimizer_switch can be changed per session, even if there are global defaults, have you tried that? Have you tried using index hints such as force index or STRAIGHT_JOIN, as the tables seem to be using different indexes with different join order? If you have not tried that, the problem first lies on your query performance, not on infrastructure- a server restart or rows being written can modify the efficiency of your queries at any time without warning- you have to be prepared for that (we are on a dynamic system in which the proportion of rows changes, and queries have to be adapted for that).

Actually, running your first query,
takes:

8 rows in set (7.85 sec)

on the new analytics server,
while I it took >200 seconds on labsdb1001:

8 rows in set (3 min 52.05 sec)

You should make sure you just use the new servers, not the old ones to run your queries.

MariaDB 10.1 optimized probably got a bit better materializing subqueries. In any case, you should be the one in control of the query execution, adding hints/doing query rewrites where necessary.

I had to kill the second query after 1000 seconds on labsdb1001:

| Query   |     983 | Queried about 131240000 rows                                                | select distinct(con

It took only 1 second on new analytics:

7 rows in set (0.94 sec)

Hi, @jcrespo. All that you told me, I don't think you should it to me. I just use quarry, don't choose between analytics and something, I don't know what they are. You don't need to compare the queries, they do different things. And the improvement happened between Sep 11 and Oct 8, so Sep 26 fits exactly. All the rest I didn't understand. Thank you.

@IKhitron It is ok, quarry maintainers will know what to do with the information I have written :-)

Sure, @jcrespo, but you talked to me ("your query"), so I answered.

Dunno if anything needs to be done here.

Dunno if anything needs to be done here.

Do you mean it will stay work quickly like this for good?

Do you mean it will stay work quickly like this for good?

Quarry will use the new faster analytic servers, yes. But how the query executes is dependent on how you write your query and the optimizer switches, see T160188#3668422, and future changes may or may not impact the performance of your query.

Do you mean it will stay work quickly like this for good?

Quarry will use the new faster analytic servers, yes. But how the query executes is dependent on how you write your query and the optimizer switches, see T160188#3668422, and future changes may or may not impact the performance of your query.

May impact can be "may make slower"?

Or make faster :)

Make faster it's great, but if it's possible that it will make slower, then this task isn't closed forever, I'm affraid.

Make faster it's great, but if it's possible that it will make slower, then this task isn't closed forever, I'm affraid.

In that case, if one query runs slower than usual, it is possibly:

  • Some issue in Quarry itself with communicating with the databases, storing the results, scheduling the query runs, etc. -- This case is not this ticket; if that happens, please create another ticket.
  • Natural growth in required processing time due to increased data. -- There is literally nothing we can do about it except to ask the Ops for new / better servers. Data grows with time, but processing speed does not without a hardware upgrade.
  • Changes to default query plans / optimization. -- The query optimizer will change across different database server versions. If a new query plan affects you negatively, you are expected to either set the session optimizer switches or make the query plan more explicit in your SQL. In extreme cases, if some changes to default optimizer switches negatively affect near all queries and a global change should be done, you can file a ticket in Data-Services asking for so.
  • Something really bad / unexpected happens to the database replica servers (eg: the servers melt). -- This is unfortunately not foreseeable.

In any case, I don't see why this ticket should be kept open. There's nothing that can be done; and the only two reason your queries got faster is faster processing speeds with the new servers (around 5 times?), and different optimizer plans, which, if you want to stay fast, should make the plan more explicit, and not let the query optimizer be smart.

Not 5 times, at least 1800 times.

I see, thankmyou for this explanation, and for your help, @zhuyifei1999.

Query 4835 works in around 6 seconds for me. What's wrong?

See five top queries on my profile. They usually run 1-5 second. Now half of them run dozens of seconds.

Could be related to the load increase of the servers as a result of the recent *.labsdb redirection to the analytics database servers (rOPUP9c4e161).

Can't understand, but nevermind. Is this permanent?