Page MenuHomePhabricator

Ignore one lagging replica in waitForReplication
Closed, ResolvedPublic

Description

With the current number of replicas in production, there is a high chance that one replica might go down for any reason or become inaccessible. Sometimes they recover without intervention, sometime they don't. MediaWiki should stop caring if one replica is down or lagging if total number of replicas is above five. Sure it can panic as much as it wants if that number reaches two.

LB automatically ignores lagging replica when it tries to pick one (same as inaccessible replica, it just moves on to the next), it could probably reduce the log level as well in case total number of replicas is above five.

But the biggest problem is that waitForReplication() has timeout of five seconds and if that's reached, mediawiki logspams (and potentially trigger exceptions, TODO: double check). It should simply ignore if only one failed to reach that.

Putting percentage as threshold doesn't make sense. e.g. if you have ten replicas, it doesn't mean losing two is okay, the read pressure on s1 (with 10-11 pooled replica is in each dc) is much higher than s5 (~7 pooled replicas). So tolerating loss of one replica if total number of them is above five should be just fine. And losing two replicas for hardware reasons in one section is really rare.

Event Timeline

You're framing this as a tolerant response to downtime, but the trouble is, the maintenance script itself can drive an unlimited amount of replication lag. If the maintenance script goes at the speed of the slowest replica, they all stay in the pool. If it goes at the speed of the second-slowest replica, the slowest one will gain replication lag, potentially becoming lagged by a significant fraction of the total runtime of the script. If your script runs for a week, the slowest replica could end up lagged by a day or two.

The goal of waitForReplication() is not to reduce general load on the cluster, it's meant to reduce the write rate to match the speed of the slowest replica.

If we want to declare failure of a replica and remove it from the pool, that's fine, but I think that it's not easy for a maintenance script to know whether a replica has failed for external reasons or whether the maintenance script is driving the failure itself by its high write rate. That's why we rely on human operators to depool slow servers.

I think my proposal didn't come through, probably I didn't write it better.

I'm not asking the slowest replica to be ignored. I'm asking that it should be ignored if that slowest one is lagging more than five seconds (it would still for five seconds between batch,the proposal is to just stop complaining about it in logs). In the maint script I've been running the past years, worst case they cause 0.1s replag and if the slowest one takes even one full second to pick up, that's still being respected and the maint script continues to go as fast as the slowest replica (thus that's not what I'm proposing here).

If a maint script is causing 5s lag on the slowest replica, it means 1- the second slowest will be quite high anyway 2- due to semi-sync the whole cluster will be on stand-still regardless because if the heavy writes are the source of replag then most replicas gonna lag heavily too.

Do you know a case that in the past couple of years a maint script doing +5s writes in each batch constantly? If so, than we need a different solution for that (transaction profiler, etc.) not waitForReplication.

That's why we rely on human operators to depool slow servers.

We are out-scaling that. We have three-four DBAs (who won't be around in weekends) and 300 DBs. That's why I made this proposal.

I actually realized something: waitForReplication() doesn't take into account the secondary datacenter and WMCS and other replicas which has led in many cases for maint scripts to cause significant lag there when they ran for long period of time and as such if a maint script is running for long-enough period they must also have --sleep option between each batch (I introduced several one of them myself) too which means the idea of "The goal of waitForReplication() is [...] to reduce the write rate to match the speed of the slowest replica." is moot because it hasn't been doing that for years (including replicas that serve production traffic but just in the secondary datacenter).

the proposal is to just stop complaining about it in logs

OK, you can change the log messages, I don't care about those.

mediawiki logspams (and potentially trigger exceptions, TODO: double check).

MysqlReplicationReporter::primaryPosWait() just logs a message and returns -1, so LoadBalancer::waitForAll() returns false, so LBFactory::waitForReplication() returns false, so Maintenance::waitForReplication() returns false. Then nothing ever checks that return value. There's no exception as far as I can see.

if a maint script is running for long-enough period they must also have --sleep option between each batch (I introduced several one of them myself)

A hand-tuned sleep interval seems very tedious -- that's why we have waitForReplication in the first place. If I was responsible for hand-tuning those intervals, I probably would have automated it long ago. Have an HTTP endpoint on toolforge and have MW fetch it via a proxy or something.

the proposal is to just stop complaining about it in logs

OK, you can change the log messages, I don't care about those.

Awesome. Gonna make some changes there.

mediawiki logspams (and potentially trigger exceptions, TODO: double check).

MysqlReplicationReporter::primaryPosWait() just logs a message and returns -1, so LoadBalancer::waitForAll() returns false, so LBFactory::waitForReplication() returns false, so Maintenance::waitForReplication() returns false. Then nothing ever checks that return value. There's no exception as far as I can see.

Awesome. I will make it return true for one replica cases just in case.

if a maint script is running for long-enough period they must also have --sleep option between each batch (I introduced several one of them myself)

A hand-tuned sleep interval seems very tedious -- that's why we have waitForReplication in the first place. If I was responsible for hand-tuning those intervals, I probably would have automated it long ago. Have an HTTP endpoint on toolforge and have MW fetch it via a proxy or something.

Yeah but OTOH we had a case that wikireplicas went down for a week or the fact that the secondary datacenter might get depooled for maint. I think we should have two wait for replication. One for jobs, short maint scripts, etc. that wouldn't care about secondary dc and one for maint scripts that take weeks or months to finish and those need to take the replication in secondary dc into account. OTOH, just the latency between the two DCs is enough sleep interval.

Change #1014086 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Stop sending error logs when only one replica is lagging

https://gerrit.wikimedia.org/r/1014086

Change #1014086 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Stop sending error logs when only one replica is lagging

https://gerrit.wikimedia.org/r/1014086