Page MenuHomePhabricator

For switchovers: A way to check if slaves are up to date
Closed, ResolvedPublic

Description

When doing a DB master switchover, it is required to check if all the slaves are up-to-date with the current master before we can proceed with the switchover.
This is not as easy as running show slave status as even if the new master is on read_only, the pt-heartbeat will still be able to insert and thus the binlog position keeps changing.

What was initially discussed as a way to check whether the slaves caught up with the master is to check whether they Exec_Master_Log_Pos is equal or higher to the master show master status output.

Having a script to be able to check that for us instead of a one liner would be nice, if it can be done in parallel it is even better.
There will be snowflakes which would make this a bit more difficult.
Examples:

  • dbstore servers are multi source, so we would need to pass an option to say which shard to check on the show slave status
  • dbstore delayed replicas would need to be ignored.
  • db1069 (until it remains alive) have different instances, so once we pass the shard we want to check, it needs to connect to the specific port for that shard

Event Timeline

Technically, volans Implemented already a way: https://gerrit.wikimedia.org/r/343270. We only need to steal it and focus on general master switchover, and optionally, use mysql as a transport rather than clusterssh.

In the medium term I've in mind a bunch of things that should help towards this direction. Feel free to ping me to talk about it.

Marostegui assigned this task to jcrespo.

This is now done with the first version of the script described at T199224 and the tracking task for future improvements here T200306

To add more context, this is implemented on the library WMFReplication.py, with the replica.is_caught_up_to_master(master) method. Will document all things on wikitech soon as part of that improvements tickets.