To make the switchover as smooth and quick as possible.
|Resolved||jcrespo||T133337 Automate database datacenter switchover steps|
|Resolved||Volans||T133338 Create salt database groups|
|Resolved||jcrespo||T133339 Improve lag detection mechanism's reliability and agility|
|Resolved||jcrespo||T134480 Improve replication lag detection for multi-dc environment|
|Resolved||Volans||T134481 Refactor $master in mariadb::core Puppet class|
- Mentioned In
- rOPUP6c96ad5d623a: MariaDB: Set additional salt grains for core DBs
rOPUPb4978bb6845f: MariaDB: Set additional salt grains for core DBs
rOPUPb9a3910cb104: MariaDB: Set additional salt grains for core DBs
rOPUP08973bbadc56: MariaDB: set mysql_role to standalone for es1
rOPUP4a53d6d07fa4: Enable heartbeat on all masters, even on the pasive datacenter
rOPUPca0fad1bb863: Enable heartbeat on all masters, even on the pasive datacenter
rOPUP7bd372c8c8a5: Remove unneeded $heartbeat_enable variables
rOPUP1e6e8596255c: Remove unneeded $heartbeat_enabled variables
T134480: Improve replication lag detection for multi-dc environment
rOPUP203579b1a172: Enable heartbeat on all masters, even on the pasive datacenter
T133523: Decide how to improve parsercache replication, sharding and HA
T133338: Create salt database groups
- Mentioned Here
- T111266: Make LoadBalancer slave lag check and read-only mode more robust (for example, using pt-heartbeat)
T24923: MediaWiki should assume read-only mode if MySQL says DB is read-only
T134481: Refactor $master in mariadb::core Puppet class
Additional grains added, documentation for them added here:
Updated switchover documentation here:
@Krinkle @Volans Now parsercaches are always RW, I have moved the warmup earlier, but I would love someone else to check it is right (maybe I am missing some blockers): https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Per-service_switchover_instructions
@faidon All steps related to the database except one (Setting old masters in read write; setting new ones in read-only) have been eliminated: https://wikitech.wikimedia.org/wiki/Switch_Datacenter No more dependency on puppet or "silencing alerts".
It is my intention to create a script that does that, and checks replication is up to date. However, aside from that (which may not be needed at all, if we plan to setup an active-active datacenter, or we just leave it in RW all the time as we already do with parsercache), that is almost as "automated" than the current documented salt one-liner, which thanks to @Volans already uses the right salt groups.
The next step of making it 100% automated would be do it 100% depending on orchestration, but that is out of the bounds of pure database work. Aside from the previously mentioned script, should we close this and open a different ticket for confd-database integration (as we already planed to do with @Joe ) or what should be the scope for closing this ticket?