Page MenuHomePhabricator

Wiki Replicas are very slow and timing out
Closed, DeclinedPublic

Description

The replica databases are currently very slow - https://quarry.wmflabs.org/query/28096 for instance was running in ~8-15 minutes last month, but the exact same query has been timing out and being killed for over two weeks now.

There were recent schema changes following the actor and comment storage refactoring. However even simple queries that do not touch these tables, and that have an efficient plan (according to EXPLAIN), are running much slower than they used to.

In addition, replication lag is often quite bad, some ~45 hours at the time of writing.

Event Timeline

This doesn't seem to be limited to quarry. Queries (such as https://quarry.wmflabs.org/query/36665) that had been running in about 15 minutes both on quarry and through a bot running https://wikitech.wikimedia.org/wiki/User:Legoktm/toolforge_library (using the toolforge DB replica) are now timing out on both. The issue appears to be with the database servers themselves.

Framawiki subscribed.

Looks not restricted to Quarry, hence removing this tag.

MusikAnimal renamed this task from Quarry running slow and timing out to Toolforge replicas are very slow and timing out.Jun 27 2019, 12:25 AM
MusikAnimal updated the task description. (Show Details)
MusikAnimal awarded a token.
MusikAnimal subscribed.
bd808 renamed this task from Toolforge replicas are very slow and timing out to Wiki Replicas are very slow and timing out.Jul 1 2019, 3:59 AM

Timeouts may have been at least partially related to T226297: ERROR 2013 (HY000): Lost connection to MySQL server during query on replicas.

Yeah, essentially we have 3 hosts. Usually only one of them is dedicated to the long queries (analytics) and 2 of the to the web service (fast queries), but due to the maintenance (T222978) we have now 1 host serving analytics which also serves a portion of web, and hence it is more loaded than normal.
This is the change: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518029/

I think the balance of the 3 servers has changed again since @Marostegui wrote that comment on 2019-06-23.

Yes, the host that was out for maintenance, labsdb1011 was repooled. However, we still need to continue with the maintenance for T222978: Compress and defragment tables on labsdb hosts so I am going to depool labsdb1011 again. I know this is unfortunate, but there is nothing else we can do to reduce disk space, and we have to do that no matter what, or else, the replicas will get fully filled.

Note that the original issue I filed this ticket for now seems to have been fixed - the quarry query has run successfully for the last few days.

How long do you expect labsdb1011 to be depooled for. Is this going to be a regular thing?

How long do you expect labsdb1011 to be depooled for. Is this going to be a regular thing?

It is not going to be a regular thing. labsdb1011 is about to finish, but then we will have to do the same with labsdb1009 and labsdb1010. Unfortunately, this needs to happen as otherwise the replicas will get their disk filled up (T222978) and they'll be fully unusable.

I just wanted to say that over the past week or so, the replicas are back to being amazingly fast. I actually don't remember it ever being this fast. Queries that would normally time out aren't anymore. Whatever you did, thank you for doing it! :)

Glad to hear @MusikAnimal - we are trying a different approach whilst still compressing tables, which requires less depooling time. We will, still, however, require depooling once it is time for the biggest wikis to be compressed (enwiki, commons, wikidata..), but hopefully hours instead of days :)

Nothing actionable here.