Page MenuHomePhabricator

Quarry or the Analytics wikireplicas role creates lots of InnoDB Purge Lag
Closed, ResolvedPublic

Description

Recently we've had reports where Quarry was misbehaving (T247978 T246970).
We have decided to reimage labsdb1011 to Buster and 10.4 to see if that helps the CPU usage.
We depooled labsdb1011 and placed labsdb1010 on the Analytics role (which also servers Quarry): https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/592516/

Before the reimage, we noticed that labsdb1011 had a huuuuuge backlog of pending InnoDB purges (https://planet.mysql.com/entry/?id=5991415):

We placed labsdb1010 to replace it temporarily, and we observed how its pending purges started to increase as soon as it took over the analytics role:

We don't know whether that endless growing purge lag was the cause of labsdb1011 slowness and crashes, but it is definitely something not good and probably not something we can afford long-term.
It seems very specific to this slow role and/or the fact that they serve Quarry, as we can see with labsdb1010.

For what is worth, the query killer on labsdb1010 is now 3600 seconds - the normal query killer on the analytics role (labsdb1011) is 14400.

Event Timeline

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptMay 4 2020, 5:33 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Huji moved this task from Backlog to Wiki replicas on the Data-Services board.May 4 2020, 6:27 PM
JHedden triaged this task as Medium priority.May 5 2020, 4:04 PM
bd808 added a comment.May 5 2020, 4:58 PM

From https://mariadb.com/kb/en/innodb-purge/

The old versions are kept at least until all transactions older than the transaction that updated the row are no longer open.

IANADBA, but I think this means that the SELECT volume against whatever instance handles the *.analytics Wiki Replica traffic and the very common slow queries that implies are a likely cause for the backlog. Grepping ops/puppet, I don't see innodb_max_purge_lag set anywhere. Does that mean that our instances never force queries to pause if things are getting too far behind?

The ideal solution, (other than modify client query patterns, which we are likely cannot) is not innodb_max_purge_lag (we already have a big slowdown and backlog) but by increasing the purging intensity with the number of parallel purgue_threads. We didn't want to add new configs while labsdb1011 was crashing hard, though.

In reality, probably the only good option is to load balance analytics queries so there is enough cpu for background processes. Making the server slower will only move the problem to the backlog (and break the amount of simultaneous connections, overload in that direction), not solve it. But we'll see. In other words, throwing hw at it.

Change 597098 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] wikireplica_analytics: Query killer decreased to 7200

https://gerrit.wikimedia.org/r/597098

Change 597098 merged by Marostegui:
[operations/puppet@production] wikireplica_analytics: Query killer decreased to 7200

https://gerrit.wikimedia.org/r/597098

This is a proof on how much CPU intensive the Analytics role is.
This is a CPU graph from labsdb1010 as soon as I depooled it from the analytics role

And the effect on purge and lag is huge too after it got rid of that role:

So far, having placed 2 hosts (1010 and 1011) on Analytics role seems to be keeping the InnoDB purge lag (as well as replication lag) under control.
Going to give it some more days before closing this.

Marostegui closed this task as Resolved.Jun 15 2020, 5:52 AM
Marostegui claimed this task.

I am going to close this as resolved, as looks like that placing labsdb1010 and labsdb1011 on the Analytics role, splitting the load a bit, has solved the InnoDB purge lag and the replication lag.

labsdb1010 (7 last days):


labsdb1011 (7 last days):