Page MenuHomePhabricator

Implement (or refactor) a script to move slaves when the master is not available
Open, MediumPublic

Description

Right now we use repl.pl to move slaves around:

ie: when a master failover is needed, we use it to move all the slaves under the new master.

However, this script doesn't work when the master is unavailable.

It would be a good start to either refactor repl.pl or create a new script that could move slaves under a different host when the master is unavailable.

ie: master has crashed and we have to move all the slaves to replicate from the candidate master during an emergency.

Event Timeline

Marostegui triaged this task as Medium priority.Jun 4 2018, 1:12 PM
Marostegui created this task.
Marostegui moved this task from Triage to Backlog on the DBA board.
Vvjjkkii renamed this task from Implement (or refactor) a script to move slaves when the master is not available to pobaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Marostegui renamed this task from pobaaaaaaa to Implement (or refactor) a script to move slaves when the master is not available.Jul 2 2018, 5:15 AM
Marostegui lowered the priority of this task from High to Medium.
Marostegui updated the task description. (Show Details)

@jcrespo - I have been thinking about this ticket lately.
Given that switchover.py works so well already, do you think it would be doable to do a --emergency-slave-switch $new_master (or whatever option) to be able to move the slaves under a given host without checking the master?
This would allow us to do emergency failovers if a master isn't reachable - obviously this needs to be execute carefully, but during an emergency, it can simplify the process of having to execute the change master host to the preferred host.
A human should still check:

  1. Which is the host that is most advanced in terms of replication to promote that one (in the case that all the hosts didn't stop in the same position)
  2. The preferred host is running STATEMENT.

Sadly switchover.py wouldn't be reusable or helpful (the replication and other libraries may be) for an emergency- it has to start from 0. Switchover.py assumes all hosts are reachable and have very low lag, replication is working, etc. which won't be the case on a failover. A failover is a much harder case where every possibility of breakage has to be contemplated separately and some safe compromises have to be taken (e.g. what to do if we detect X amount of data has been lost).

Ah, I see!.
Yeah, I was thinking about a very primitive way to do it (for now), which would require human intervention to decide which is the most suitable host to be the new master and then the script to actually execute the batch of change master to master host.

Ah, I see!.
Yeah, I was thinking about a very primitive way to do it (for now), which would require human intervention to decide which is the most suitable host to be the new master and then the script to actually execute the batch of change master to master host.

Yeah, I understood you that -not a fully automated and autonomous script- but even that is not easy and still not reusable, as it would have to make it without using the master, and the requires arbitrary master changes that neither gtid nor WMFReplication.move() allow yet. We would need to implement binlog position matching first, and a way to detect replicas from a master down (tendril replacement "zarcillo" database?). All doable, but not immediate or reusable from existing code.

and a way to detect replicas from a master down (tendril replacement "zarcillo" database?).

Good point - with the master down there is not a canonical place to detect which hosts are hanging apart from tendril/zarcillo indeed.

With the great work done by @Ladsgroup at T281249: Create or modify an existing tool that quickly shows the db replication status in case of master failure I think we are a step closer to get this done.
Once we have that script, we could implement another one based on that one (rather than refactor db-switchover) which would take care of, once passed, the right candidate master, simply configure replication on all the other replicas.

The safety measure the script should be to disallow hosts that have the following items:

  • Multi-instance
  • Other slaves hanging
  • binlog format not STATEMENT
  • Not in the active DC

@Ladsgroup would you be ok working on this task?

Definitely. I can start next week.