SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Jun 15 2021, 8:24 AM

Description

While deploying a schema change (T284375) on s8 master, with replication enabled, a few of the replicas reached too many connections.
The cause is the following queries piling up on codfw s8 replicas

P16523 (An Untitled Masterwork)

1	mysql:root@localhost [wikidatawiki]> explain SELECT /* LinkCache::fetchPageRow */ page_id,page_len,page_is_redirect,page_latest,page_restrictions,page_content_model,page_lang FROM `page` WHERE page_namespace = 0 AND page_title = 'REDIR' LIMIT 1;
2	+------+-------------+-------+------+---------------+------+---------+------+----------+-------------+
3	\| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \|
4	+------+-------------+-------+------+---------------+------+---------+------+----------+-------------+
5	\| 1 \| SIMPLE \| page \| ALL \| NULL \| NULL \| NULL \| NULL \| 92858311 \| Using where \|
6	+------+-------------+-------+------+---------------+------+---------+------+----------+-------------+

These queries seem to be queries to monitor the mw status of the hosts, requesting Special:Blankpage from LVS

There are several things that probably need fixing here:

This query isn't cheap
We might want to rate limit this query somehow.
This makes impossible to deploy schema changes with replication on the standby DC for certain big tables.

Related Objects

Mentioned Here: P16523 (An Untitled Masterwork)
T284375: Rename name_title index on page to page_name_title

Event Timeline

• Marostegui created this task.Jun 15 2021, 8:24 AM

Restricted Application added a project: [DEPRECATED] wdwb-tech. · View Herald TranscriptJun 15 2021, 8:24 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• Marostegui added subscribers: Addshore, Ladsgroup.Jun 15 2021, 8:24 AM

The query seems to come from https://gerrit.wikimedia.org/g/mediawiki/core/+/873118723cbe3c78e631bea44a66fb3659b9beab/includes/MediaWiki.php#1018
This code path is triggered as the health checks use http
Switching to https would avoid this code path and query for the health checks.
Though I guess we also want to look at why the query tries to scan the whole table?

This is not a Wikidata specific issue (nothing special here) and would occur on any mediawiki instance in theroy.

Addshore moved this task from Inbox to External Realm on the [DEPRECATED] wdwb-tech board.Jun 15 2021, 8:29 AM

Addshore moved this task from incoming to monitoring on the Wikidata board.

Though I guess we also want to look at why the query tries to scan the whole table?

It technically doesn't but s8 is under index rename, making this problematic in that case. It happens only in the standby DC.

9:27 AM <marostegui> addshore: Maybe the issue is the schema change isn't being done o nthe same transaction, so while we drop the index and the queries arrive, there's no index until the new one is created
9:27 AM <marostegui> I can definitely change that for the next iterations

My guess is that as the schema change isn't made on the same transaction, the query arrives in between the drop+create and gets stuck with that huge full scan.
I have depooled the hosts that are giving too many connections (as they others have the schema change already done) and see if that helps reducing the load and get the schema change done.

I will change T284375 to make sure it is done on the same transaction for the next iterations.

Maintenance_bot added a project: SRE.Jun 15 2021, 8:45 AM

All the hosts have been recovered.

jbond triaged this task as Medium priority.Jun 21 2021, 2:27 PM

BPirkle removed a project: Platform Engineering.Jun 22 2021, 9:15 PM

ArielGlenn subscribed.Jun 22 2021, 9:17 PM

We chose S:BP for those queries on the assumption that, by its nature, it would be a cheap page to monitor. Is there a better option we should be using, or is this ticket more about fixing inefficiencies in it?

I am going to consider this fixed as it never happened again.

• Marostegui edited projects, added DBA; removed SRE.May 19 2022, 8:31 AM

SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema changeClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change
Closed, ResolvedPublic
Actions