Wiki Replicas are very slow and timing out
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Mike_Peel
	Jun 18 2019, 8:21 PM

Description

The replica databases are currently very slow - https://quarry.wmflabs.org/query/28096 for instance was running in ~8-15 minutes last month, but the exact same query has been timing out and being killed for over two weeks now.

There were recent schema changes following the actor and comment storage refactoring. However even simple queries that do not touch these tables, and that have an efficient plan (according to EXPLAIN), are running much slower than they used to.

In addition, replication lag is often quite bad, some ~45 hours at the time of writing.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T226050 Wiki Replicas are very slow and timing out
Resolved	Marostegui	T222978 Compress and defragment tables on labsdb hosts
Open	None	T224850 Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas

Event Timeline

Mike_Peel created this task.Jun 18 2019, 8:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 18 2019, 8:21 PM

jcrespo edited projects, added Data-Services; removed MediaWiki-libs-Rdbms.Jun 18 2019, 11:08 PM

This doesn't seem to be limited to quarry. Queries (such as https://quarry.wmflabs.org/query/36665) that had been running in about 15 minutes both on quarry and through a bot running https://wikitech.wikimedia.org/wiki/User:Legoktm/toolforge_library (using the toolforge DB replica) are now timing out on both. The issue appears to be with the database servers themselves.

Ahecht added a project: Tool-Database-Queries.Jun 20 2019, 4:44 PM

Ahecht added a project: Toolforge.Jun 20 2019, 4:50 PM

JJMC89 removed projects: Toolforge, Tool-Database-Queries.Jun 20 2019, 5:00 PM

zhuyifei1999 subscribed.Jun 20 2019, 5:49 PM

Looks not restricted to Quarry, hence removing this tag.

MusikAnimal renamed this task from Quarry running slow and timing out to Toolforge replicas are very slow and timing out.Jun 27 2019, 12:25 AM

MusikAnimal updated the task description. (Show Details)

MusikAnimal awarded a token.

MusikAnimal subscribed.

Jheald subscribed.Jun 27 2019, 6:23 PM

bd808 renamed this task from Toolforge replicas are very slow and timing out to Wiki Replicas are very slow and timing out.Jul 1 2019, 3:59 AM

Timeouts may have been at least partially related to T226297: ERROR 2013 (HY000): Lost connection to MySQL server during query on replicas.

In T226297#5276628, @Marostegui wrote:

Yeah, essentially we have 3 hosts. Usually only one of them is dedicated to the long queries (analytics) and 2 of the to the web service (fast queries), but due to the maintenance (T222978) we have now 1 host serving analytics which also serves a portion of web, and hence it is more loaded than normal.
This is the change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518029/

I think the balance of the 3 servers has changed again since @Marostegui wrote that comment on 2019-06-23.

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.Jul 1 2019, 4:01 AM

Yes, the host that was out for maintenance, labsdb1011 was repooled. However, we still need to continue with the maintenance for T222978: Compress and defragment tables on labsdb hosts so I am going to depool labsdb1011 again. I know this is unfortunate, but there is nothing else we can do to reduce disk space, and we have to do that no matter what, or else, the replicas will get fully filled.

Note that the original issue I filed this ticket for now seems to have been fixed - the quarry query has run successfully for the last few days.

bd808 merged a task: T226949: Query in Toolforge couldn't complete (timeout).Jul 1 2019, 7:41 PM

bd808 added subtasks: T222978: Compress and defragment tables on labsdb hosts, T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas.

bd808 added subscribers: • Jar, matej_suchanek, alaa.

How long do you expect labsdb1011 to be depooled for. Is this going to be a regular thing?

In T226050#5298926, @Ahecht wrote:

How long do you expect labsdb1011 to be depooled for. Is this going to be a regular thing?

It is not going to be a regular thing. labsdb1011 is about to finish, but then we will have to do the same with labsdb1009 and labsdb1010. Unfortunately, this needs to happen as otherwise the replicas will get their disk filled up (T222978) and they'll be fully unusable.

Wikiscan subscribed.Jul 8 2019, 12:25 PM

I just wanted to say that over the past week or so, the replicas are back to being amazingly fast. I actually don't remember it ever being this fast. Queries that would normally time out aren't anymore. Whatever you did, thank you for doing it! :)

Glad to hear @MusikAnimal - we are trying a different approach whilst still compressing tables, which requires less depooling time. We will, still, however, require depooling once it is time for the biggest wikis to be compressed (enwiki, commons, wikidata..), but hopefully hours instead of days :)

Wikiscan mentioned this in T227462: Request increased quota for 'wikiscan' Toolforge user for database access.Jul 30 2019, 9:15 AM

Marostegui closed subtask T222978: Compress and defragment tables on labsdb hosts as Resolved.Aug 6 2019, 5:34 AM

Nothing actionable here.

matej_suchanek unsubscribed.Jan 10 2020, 6:53 PM

Meno25 removed a subscriber: • Jar.Jan 14 2023, 10:25 PM

Wiki Replicas are very slow and timing outClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Wiki Replicas are very slow and timing out
Closed, DeclinedPublic
Actions

Related Objects
Search...