While in normal operation the query should return very quickly, there are some conditions by which the query could get stuck (e.g. the replica is pending to be stopped, but it is itself pending on a large write to finish (which could be itself be blocked due to metadata locking)). While this scenario is very unlikely, it literally happened on codfw while performing maintenanance on 1 pooled wikidata servers (making all mediawikis, that were checking only enwiki's home fail).
There are 3 things that could be done to mitigate that:
- make sure show slave status has an adequate timeout, in seconds, not in minutes, to avoid pileups. Consider the server dead (delayed) if the timeout happens.
- Use pt-heartbeat for replication checks exclusively- this will allow to avoid problems with show slave status, which is not "100% safe" as it requires some locking
- Avoid hard dependency between wikis and wikidata, allowing to see "some content", or fail quickly if wikidata db is unavailable (that is not trivial and probably out of the scope of this ticket, but it is worth mentioning it)