Page MenuHomePhabricator

SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change
Closed, ResolvedPublic

Description

While deploying a schema change (T284375) on s8 master, with replication enabled, a few of the replicas reached too many connections.
The cause is the following queries piling up on codfw s8 replicas

1mysql:root@localhost [wikidatawiki]> explain SELECT /* LinkCache::fetchPageRow */ page_id,page_len,page_is_redirect,page_latest,page_restrictions,page_content_model,page_lang FROM `page` WHERE page_namespace = 0 AND page_title = 'REDIR' LIMIT 1;
2+------+-------------+-------+------+---------------+------+---------+------+----------+-------------+
3| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
4+------+-------------+-------+------+---------------+------+---------+------+----------+-------------+
5| 1 | SIMPLE | page | ALL | NULL | NULL | NULL | NULL | 92858311 | Using where |
6+------+-------------+-------+------+---------------+------+---------+------+----------+-------------+

These queries seem to be queries to monitor the mw status of the hosts, requesting Special:Blankpage from LVS

There are several things that probably need fixing here:

  • This query isn't cheap
  • We might want to rate limit this query somehow.
  • This makes impossible to deploy schema changes with replication on the standby DC for certain big tables.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The query seems to come from https://gerrit.wikimedia.org/g/mediawiki/core/+/873118723cbe3c78e631bea44a66fb3659b9beab/includes/MediaWiki.php#1018
This code path is triggered as the health checks use http
Switching to https would avoid this code path and query for the health checks.
Though I guess we also want to look at why the query tries to scan the whole table?

This is not a Wikidata specific issue (nothing special here) and would occur on any mediawiki instance in theroy.

Though I guess we also want to look at why the query tries to scan the whole table?

It technically doesn't but s8 is under index rename, making this problematic in that case. It happens only in the standby DC.

Addshore renamed this task from SELECT query arriving to wikidatawiki db codfw hosts causing pile ups to SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change.Jun 15 2021, 8:30 AM

9:27 AM <marostegui> addshore: Maybe the issue is the schema change isn't being done o nthe same transaction, so while we drop the index and the queries arrive, there's no index until the new one is created
9:27 AM <marostegui> I can definitely change that for the next iterations

My guess is that as the schema change isn't made on the same transaction, the query arrives in between the drop+create and gets stuck with that huge full scan.
I have depooled the hosts that are giving too many connections (as they others have the schema change already done) and see if that helps reducing the load and get the schema change done.

I will change T284375 to make sure it is done on the same transaction for the next iterations.

All the hosts have been recovered.

jbond triaged this task as Medium priority.Jun 21 2021, 2:27 PM

We chose S:BP for those queries on the assumption that, by its nature, it would be a cheap page to monitor. Is there a better option we should be using, or is this ticket more about fixing inefficiencies in it?

Marostegui claimed this task.

I am going to consider this fixed as it never happened again.