To make the switchover as smooth and quick as possible.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | jcrespo | T133337 Automate database datacenter switchover steps | |||
Resolved | Volans | T133338 Create salt database groups | |||
Resolved | jcrespo | T133339 Improve lag detection mechanism's reliability and agility | |||
Resolved | jcrespo | T134480 Improve replication lag detection for multi-dc environment | |||
Resolved | Volans | T134481 Refactor $master in mariadb::core Puppet class |
Event Timeline
Shouldn't this be true for all of ops? Maybe converting it into a tracking ticket for all ops-related tasks?
Change 286303 had a related patch set uploaded (by Volans):
MariaDB: Set additional salt grains for core DBs
Additional grains added, documentation for them added here:
https://wikitech.wikimedia.org/wiki/MariaDB#Salt
Updated switchover documentation here:
https://wikitech.wikimedia.org/wiki/Switch_Datacenter
Change 287088 had a related patch set uploaded (by Volans):
MariaDB: set mysql_role to standalone for es1
Parsercaches are now full R/W on both datacenters at the same time, so I will delete the steps related to that from the documentation.
@Krinkle @Volans Now parsercaches are always RW, I have moved the warmup earlier, but I would love someone else to check it is right (maybe I am missing some blockers): https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Per-service_switchover_instructions
The next step is to eliminate the pt-heatbeat failover, I think we have a solution but let me present it in person to others to see if you agree.
Change 289177 had a related patch set uploaded (by Jcrespo):
Add interval parameter, and change the default to 1 beat per second
Change 289178 had a related patch set uploaded (by Jcrespo):
Enable heartbeat on all masters, even on the pasive datacenter
Change 289177 merged by Jcrespo:
Add interval parameter, and change the default to 1 beat per second
Mentioned in SAL [2016-05-18T13:55:28Z] <jynus> disabling puppet on all database masters to test replication monitoring change T133337
Change 289178 merged by Jcrespo:
Enable heartbeat on all masters, even on the pasive datacenter
Change 289442 had a related patch set uploaded (by Jcrespo):
Remove unneeded $heartbeat_enable variables
SPOF solving is desired, but I do not think a hard blocker for this. The only thing pending is to automate the READ ONLY steps, and documente everything (which I am doing now).
@faidon All steps related to the database except one (Setting old masters in read write; setting new ones in read-only) have been eliminated: https://wikitech.wikimedia.org/wiki/Switch_Datacenter No more dependency on puppet or "silencing alerts".
It is my intention to create a script that does that, and checks replication is up to date. However, aside from that (which may not be needed at all, if we plan to setup an active-active datacenter, or we just leave it in RW all the time as we already do with parsercache), that is almost as "automated" than the current documented salt one-liner, which thanks to @Volans already uses the right salt groups.
The next step of making it 100% automated would be do it 100% depending on orchestration, but that is out of the bounds of pure database work. Aside from the previously mentioned script, should we close this and open a different ticket for confd-database integration (as we already planed to do with @Joe ) or what should be the scope for closing this ticket?