Page MenuHomePhabricator

Odd performance issue on a toolforge DB replica
Closed, ResolvedPublicBUG REPORT

Description

I'm running into an odd issue with performance of some queries on the replica database from the toolforge servers (known to happen both from the sge bastions and the grid itself). The following query:

SELECT bp.page_title, bp.page_id, pr_index.pr_count, ifnull(sum(pp_value=0 and pp_propname='proofread_page_quality_level'), 0) as q_0
FROM page AS sp
INNER JOIN page AS bp ON(sp.page_title like concat(bp.page_title, '/%') and sp.page_namespace=104)
INNER JOIN page_props ON(sp.page_id=pp_page)
INNER JOIN pr_index ON(pr_page_id=bp.page_id)
WHERE bp.page_id=?

Works correctly and swiftly (in miliseconds) for roughly 95% of pages in the database, but stalls for anywhere from 15 to 60 seconds for the remaining 5%. There seems to be no pattern to which 5%, but it is consistent (that is, the same page_id will either be always quick or always slow). For instance:

[...] WHERE bp.page_id=2894727;

returns in miliseconds, whereas

[...] WHERE bp.page_id=2894947;

always takes over 16s and sometimes as much as 40s.

The query always returns exactly one row, and the stalled queries do not appear to relate to the number of subpages that are involved. Stalled queries are shown in the processlist as being in state 'Sending data'.

@Bstorm found that while the issue can be replicated from the toolforge bastions, making the same queries directly on the DBs have all queries return nearly instantly.

Event Timeline

(Forgot to mention that those queries are being run on the frwikisource database)

These would be hitting labsdb1010 and labsdb1011. On both servers, the queries were blazing fast and used indexes. Nothing seems wrong at the proxy level either for the servers. At the same time, I could easily confirm that the latter query took me 24 seconds from a Toolforge server.

My mind goes to:

  • Possible weird permissions thing on the DB servers related to indexing? This seems pretty unlikely. @Marostegui do you have any ideas?
  • DNS issues since this is hitting the cloud DNS setup before it hits wikimedia production DNS, and there have been issues there (but why would the other query work fast?)
  • A network issue? I still don't like the idea because of the above.
  • Something about the client side processing around the aggregate functions being limited in Toolforge -- cgroup limits are not across the environment (only on bastions), but we do limit vm overcommit. (possibly more likely).
  • Weird processing difference that happens on the dbproxy level in haproxy?

From an exec node (so no cgroup limits), you can clearly see the difference between the two queries:

MariaDB [frwikisource_p]> select bp.page_title, bp.page_id, pr_index.pr_count, sum(pp_value=0) as q_0, sum(pp_value=1) as q_1, sum(pp_value=2) as q_2, sum(pp_value=3) as q_3, sum(pp_value=4) as q_4 from page as sp inner join page as bp on (sp.page_title like concat(bp.page_title, '/%') and sp.page_namespace=104) inner join page_props on(sp.page_id=pp_page and pp_propname='proofread_page_quality_level') inner join pr_index on(pr_page_id=bp.page_id) where bp.page_id=2894947;
+------------------------------------+---------+----------+------+------+------+------+------+
| page_title                         | page_id | pr_count | q_0  | q_1  | q_2  | q_3  | q_4  |
+------------------------------------+---------+----------+------+------+------+------+------+
| La_Messaline_française,_1789.djvu  | 2894947 |       73 |    2 |    0 |    0 |    0 |   71 |
+------------------------------------+---------+----------+------+------+------+------+------+
1 row in set (28.84 sec)

MariaDB [frwikisource_p]> select bp.page_title, bp.page_id, pr_index.pr_count, sum(pp_value=0) as q_0, sum(pp_value=1) as q_1, sum(pp_value=2) as q_2, sum(pp_value=3) as q_3, sum(pp_value=4) as q_4 from page as sp inner join page as bp on (sp.page_title like concat(bp.page_title, '/%') and sp.page_namespace=104) inner join page_props on(sp.page_id=pp_page and pp_propname='proofread_page_quality_level') inner join pr_index on(pr_page_id=bp.page_id) where bp.page_id=2894727;
+------------------------------------+---------+----------+------+------+------+------+------+
| page_title                         | page_id | pr_count | q_0  | q_1  | q_2  | q_3  | q_4  |
+------------------------------------+---------+----------+------+------+------+------+------+
| Renard_-_Le_Péril_Bleu,_1911.djvu  | 2894727 |      374 |    7 |    0 |    0 |  178 |  189 |
+------------------------------------+---------+----------+------+------+------+------+------+
1 row in set (0.01 sec)

I can't run an explain from there, but on the server, where I can, the problem literally doesn't happen! They are both nearly instant. If the aggregate functions are seriously hampered by memory settings on the client side somehow (don't know that much of mysql internals), then I can tweak those settings. This doesn't seem to make sense, though.

It looks as if the query I ran second (which is the fast one always--it's the one listed first in the description of the ticket), has even more relations? Odd.

Explain via sql-optimizer for page_id=2894947 (slow in T222768#5166017):

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1.1SIMPLEpageconstPRIMARYPRIMARY4const1
1.2SIMPLEpr_indexconstPRIMARYPRIMARY4const1
1.3SIMPLEpage_propsrefPRIMARY, pp_propname_page, pp_propname_sortkey_pagepp_propname_page62const102996Using index condition
1.4SIMPLEpageeq_refPRIMARY, name_titlePRIMARY4frwikisource.page_props.pp_page1Using where

Explain via sql-optimizer for page_id=2894727 (fast in T222768#5166017):

idselect_typetabletypepossible_keyskeykey_lenrefrowsExtra
1.1SIMPLEpageconstPRIMARYPRIMARY4const1
1.2SIMPLEpr_indexconstPRIMARYPRIMARY4const1
1.3SIMPLEpagerangePRIMARY, name_titlename_title2617146Using where; Using index
1.4SIMPLEpage_propseq_refPRIMARY, pp_propname_page, pp_propname_sortkey_pagePRIMARY66frwikisource.page.page_id,const1Using where

It looks like the query optimizer is flipping back and forth between two different plans and this makes a lot of difference in the wall clock time. The fast plan uses the name_title index on the page table first, then filters on page_props. The slow plan reverses these tables putting page_props first and then page. This second slow plan looks like it ends up dealing with 2 orders of magnitude more rows.

Ah, I had suspected it did not use an index, rather than just different ordering. But the most interesting question in my mind is: why is this happening only when going through the proxy and not when doing the queries directly on the DB behind it?

(In both cases on the views and not on the underlying tables).

@bd808 I got the same result running explains on the server. The difference was that, while the query plan differs between the two for some reason, both were nearly instant in response. That part is what made me really scratch my head.

That said, I'd love to know why the plan switches too; there's an obvious inefficiency there even if it doesn't explain the odd discrepancy when the proxy is involved.

Unless the client is doing part of the work and gets 100K rows transmitted? But that seems completely implausible.

By looking at the explain for both queries (run locally), looks like the optimizer was choosing different indexes (which could be because: 1) the optimizer is being silly again - which is not new or 2) the indexes stats makes it think that it is faster to use a different one).
I have ran an analyze table on page and page_props to update it and now both queries are using the same plan and are equally fast:

root@labsdb1010.eqiad.wmnet[frwikisource]> select bp.page_title, bp.page_id, pr_index.pr_count, sum(pp_value=0) as q_0, sum(pp_value=1) as q_1, sum(pp_value=2) as q_2, sum(pp_value=3) as q_3, sum(pp_value=4) as q_4 from page as sp inner join page as bp on (sp.page_title like concat(bp.page_title, '/%') and sp.page_namespace=104) inner join page_props on(sp.page_id=pp_page and pp_propname='proofread_page_quality_level') inner join pr_index on(pr_page_id=bp.page_id) where bp.page_id=2894947;
+------------------------------------+---------+----------+------+------+------+------+------+
| page_title                         | page_id | pr_count | q_0  | q_1  | q_2  | q_3  | q_4  |
+------------------------------------+---------+----------+------+------+------+------+------+
| La_Messaline_française,_1789.djvu  | 2894947 |       73 |    2 |    0 |    0 |    0 |   71 |
+------------------------------------+---------+----------+------+------+------+------+------+
1 row in set (0.06 sec)

root@labsdb1010.eqiad.wmnet[frwikisource]> select bp.page_title, bp.page_id, pr_index.pr_count, sum(pp_value=0) as q_0, sum(pp_value=1) as q_1, sum(pp_value=2) as q_2, sum(pp_value=3) as q_3, sum(pp_value=4) as q_4 from page as sp inner join page as bp on (sp.page_title like concat(bp.page_title, '/%') and sp.page_namespace=104) inner join page_props on(sp.page_id=pp_page and pp_propname='proofread_page_quality_level') inner join pr_index on(pr_page_id=bp.page_id) where bp.page_id=2894727;
+------------------------------------+---------+----------+------+------+------+------+------+
| page_title                         | page_id | pr_count | q_0  | q_1  | q_2  | q_3  | q_4  |
+------------------------------------+---------+----------+------+------+------+------+------+
| Renard_-_Le_Péril_Bleu,_1911.djvu  | 2894727 |      374 |    7 |    0 |    0 |  178 |  189 |
+------------------------------------+---------+----------+------+------+------+------+------+
1 row in set (0.00 sec)

root@labsdb1010.eqiad.wmnet[frwikisource]> explain select bp.page_title, bp.page_id, pr_index.pr_count, sum(pp_value=0) as q_0, sum(pp_value=1) as q_1, sum(pp_value=2) as q_2, sum(pp_value=3) as q_3, sum(pp_value=4) as q_4 from page as sp inner join page as bp on (sp.page_title like concat(bp.page_title, '/%') and sp.page_namespace=104) inner join page_props on(sp.page_id=pp_page and pp_propname='proofread_page_quality_level') inner join pr_index on(pr_page_id=bp.page_id) where bp.page_id=2894727;
+------+-------------+------------+--------+---------------------------------------------------+------------+---------+-------------------------------+------+--------------------------+
| id   | select_type | table      | type   | possible_keys                                     | key        | key_len | ref                           | rows | Extra                    |
+------+-------------+------------+--------+---------------------------------------------------+------------+---------+-------------------------------+------+--------------------------+
|    1 | SIMPLE      | bp         | const  | PRIMARY                                           | PRIMARY    | 4       | const                         |    1 |                          |
|    1 | SIMPLE      | pr_index   | const  | PRIMARY                                           | PRIMARY    | 4       | const                         |    1 |                          |
|    1 | SIMPLE      | sp         | range  | PRIMARY,name_title                                | name_title | 261     | NULL                          | 7730 | Using where; Using index |
|    1 | SIMPLE      | page_props | eq_ref | PRIMARY,pp_propname_page,pp_propname_sortkey_page | PRIMARY    | 66      | frwikisource.sp.page_id,const |    1 | Using where              |
+------+-------------+------------+--------+---------------------------------------------------+------------+---------+-------------------------------+------+--------------------------+
4 rows in set (0.03 sec)

root@labsdb1010.eqiad.wmnet[frwikisource]> explain select bp.page_title, bp.page_id, pr_index.pr_count, sum(pp_value=0) as q_0, sum(pp_value=1) as q_1, sum(pp_value=2) as q_2, sum(pp_value=3) as q_3, sum(pp_value=4) as q_4 from page as sp inner join page as bp on (sp.page_title like concat(bp.page_title, '/%') and sp.page_namespace=104) inner join page_props on(sp.page_id=pp_page and pp_propname='proofread_page_quality_level') inner join pr_index on(pr_page_id=bp.page_id) where bp.page_id=2894727;
+------+-------------+------------+--------+---------------------------------------------------+------------+---------+-------------------------------+------+--------------------------+
| id   | select_type | table      | type   | possible_keys                                     | key        | key_len | ref                           | rows | Extra                    |
+------+-------------+------------+--------+---------------------------------------------------+------------+---------+-------------------------------+------+--------------------------+
|    1 | SIMPLE      | bp         | const  | PRIMARY                                           | PRIMARY    | 4       | const                         |    1 |                          |
|    1 | SIMPLE      | pr_index   | const  | PRIMARY                                           | PRIMARY    | 4       | const                         |    1 |                          |
|    1 | SIMPLE      | sp         | range  | PRIMARY,name_title                                | name_title | 261     | NULL                          | 7730 | Using where; Using index |
|    1 | SIMPLE      | page_props | eq_ref | PRIMARY,pp_propname_page,pp_propname_sortkey_page | PRIMARY    | 66      | frwikisource.sp.page_id,const |    1 | Using where              |
+------+-------------+------------+--------+---------------------------------------------------+------------+---------+-------------------------------+------+--------------------------+
4 rows in set (0.00 sec)

I have run the analyze on all the hosts

coren claimed this task.

Indeed, that solved the immediate issue entirely: the bot's run went from 195m down to its expected 5m or so.

My sysadmin fiber is still staring at the proxy with deep suspicion since I don't really understand why it would magnify the effect by over four orders of magnitude, but that's another story. :-)

I suspect that may have been related to how data transfer plays into it, perhaps. Running on the server itself, it doesn't need to. 🤔
If it's fixed, I'm happy, though!
Thanks @Marostegui !