Page MenuHomePhabricator

Some slaves are impossible to check for replication lag in MediaWiki
Closed, InvalidPublic

Description

As described in T187516#4033863, I ran a script on s7 that performed a lot of REPLACEs. Even though it's supposed to wait for slaves properly, there was a replication lag alert for dbstore1002. This server is not listed as a slave in mediawiki-config and normal MW database credentials don't work on it, therefore it's hard to perform bulk writes safely.

Event Timeline

dbstore1002 is an analytics slave, it replicates all the shards, so its performance isn't great.
It has different credentials on purpose. I don't know how wfWaitForSlaves works, but this host isn't listed on any file under wmf-config so I don't know why your script is waiting for it.
I don't know if it is hardcoded somewhere and being checked for historic reasons, but I don't think it should be checked for delay as this host usually gets some delays.

No, the issue is opposite - lag checks can't see it so more writes get piled than it can handle.

No, the issue is opposite - lag checks can't see it so more writes get piled than it can handle.

Ah right - I didn't get that from the original task description. So you saw a bunch of "waiting for GTID etc etc etc"

Which user is used to check for this? wikiadmin? wikiuser?

jcrespo subscribed.

dbstore1002 is not a MediaWiki replica- it is not part of the production infrastructure. While it is nice to wait for it, it currently has a different engine, which means that the same queries could have really bad performance- and that is ok, replication there is a best-effort process. The alerts on IRC can indicate problems like breakages, or other issues- lately it lags because T175790, which is a wikidata proces that has nothing to do with replication.

If you need access to dbstore1002 for analytics/research purposes, you can ask for it, and you can check replication there using the heartbeat tables, but that is another process.

lag checks can't see it so more writes get piled than it can handle.

And that is ok. If many writes are piled up- we could have a look because something really wrong is going on, but I wouldn't check if until it has been for a week like that.

To give more context, until it broke, dbstore1001 was behind exactly 1 day for recovery purposes- and that was ok.