Page MenuHomePhabricator

Automate database datacenter switchover steps
Closed, ResolvedPublic

Description

To make the switchover as smooth and quick as possible.

Event Timeline

Volans created this task.Apr 21 2016, 8:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2016, 8:39 PM

Shouldn't this be true for all of ops? Maybe converting it into a tracking ticket for all ops-related tasks?

jcrespo triaged this task as Normal priority.Apr 22 2016, 1:32 PM
jcrespo moved this task from Triage to Backlog on the DBA board.

Change 286303 had a related patch set uploaded (by Volans):
MariaDB: Set additional salt grains for core DBs

https://gerrit.wikimedia.org/r/286303

faidon renamed this task from Automate datacenter switchover steps to Automate database datacenter switchover steps.May 4 2016, 3:22 PM
faidon added a project: codfw-rollout.

Change 286303 merged by Volans:
MariaDB: Set additional salt grains for core DBs

https://gerrit.wikimedia.org/r/286303

Additional grains added, documentation for them added here:
https://wikitech.wikimedia.org/wiki/MariaDB#Salt

Updated switchover documentation here:
https://wikitech.wikimedia.org/wiki/Switch_Datacenter

Change 287088 had a related patch set uploaded (by Volans):
MariaDB: set mysql_role to standalone for es1

https://gerrit.wikimedia.org/r/287088

Change 287088 merged by Volans:
MariaDB: set mysql_role to standalone for es1

https://gerrit.wikimedia.org/r/287088

Documentation updated given that T134481 is now resolved.

Parsercaches are now full R/W on both datacenters at the same time, so I will delete the steps related to that from the documentation.

@Krinkle @Volans Now parsercaches are always RW, I have moved the warmup earlier, but I would love someone else to check it is right (maybe I am missing some blockers): https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Per-service_switchover_instructions

The next step is to eliminate the pt-heatbeat failover, I think we have a solution but let me present it in person to others to see if you agree.

jcrespo claimed this task.May 16 2016, 5:46 PM
jcrespo moved this task from Backlog to In progress on the DBA board.

Change 289177 had a related patch set uploaded (by Jcrespo):
Add interval parameter, and change the default to 1 beat per second

https://gerrit.wikimedia.org/r/289177

Change 289178 had a related patch set uploaded (by Jcrespo):
Enable heartbeat on all masters, even on the pasive datacenter

https://gerrit.wikimedia.org/r/289178

Change 289177 merged by Jcrespo:
Add interval parameter, and change the default to 1 beat per second

https://gerrit.wikimedia.org/r/289177

Mentioned in SAL [2016-05-18T13:55:28Z] <jynus> disabling puppet on all database masters to test replication monitoring change T133337

Change 289178 merged by Jcrespo:
Enable heartbeat on all masters, even on the pasive datacenter

https://gerrit.wikimedia.org/r/289178

Change 289442 had a related patch set uploaded (by Jcrespo):
Remove unneeded $heartbeat_enable variables

https://gerrit.wikimedia.org/r/289442

Change 289442 merged by Jcrespo:
Remove unneeded $heartbeat_enabled variables

https://gerrit.wikimedia.org/r/289442

SPOF solving is desired, but I do not think a hard blocker for this. The only thing pending is to automate the READ ONLY steps, and documente everything (which I am doing now).

jcrespo added a subscriber: Joe.EditedMay 19 2016, 8:12 AM

@faidon All steps related to the database except one (Setting old masters in read write; setting new ones in read-only) have been eliminated: https://wikitech.wikimedia.org/wiki/Switch_Datacenter No more dependency on puppet or "silencing alerts".

It is my intention to create a script that does that, and checks replication is up to date. However, aside from that (which may not be needed at all, if we plan to setup an active-active datacenter, or we just leave it in RW all the time as we already do with parsercache), that is almost as "automated" than the current documented salt one-liner, which thanks to @Volans already uses the right salt groups.

The next step of making it 100% automated would be do it 100% depending on orchestration, but that is out of the bounds of pure database work. Aside from the previously mentioned script, should we close this and open a different ticket for confd-database integration (as we already planed to do with @Joe ) or what should be the scope for closing this ticket?

Krinkle removed a subscriber: Krinkle.May 20 2016, 6:46 PM
jcrespo closed this task as Resolved.EditedJul 14 2016, 2:28 PM

The pending steps will be tracked on T24923 and T111266.