⚓ T133337 Automate database datacenter switchover steps

Subject	Repo	Branch	Lines +/-
Remove unneeded $heartbeat_enabled variables	operations/puppet	production	+2 -4
Enable heartbeat on all masters, even on the pasive datacenter	operations/puppet	production	+3 -6
Add interval parameter, and change the default to 1 beat per second	operations/puppet/mariadb	master	+5 -3
MariaDB: set mysql_role to standalone for es1	operations/puppet	production	+8 -1
MariaDB: Set additional salt grains for core DBs	operations/puppet	production	+15 -1

Status	Assigned	Task
Resolved	jcrespo	T133337 Automate database datacenter switchover steps
Resolved	Volans	T133338 Create salt database groups
Resolved	jcrespo	T133339 Improve lag detection mechanism's reliability and agility
Resolved	jcrespo	T134480 Improve replication lag detection for multi-dc environment
Resolved	Volans	T134481 Refactor $master in mariadb::core Puppet class

Volans created this task.Apr 21 2016, 8:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2016, 8:39 PM

Volans created subtask T133338: Create salt database groups.Apr 21 2016, 8:42 PM

Volans created subtask T133339: Improve lag detection mechanism's reliability and agility.Apr 21 2016, 8:47 PM

Volans added a subscriber: faidon.

Shouldn't this be true for all of ops? Maybe converting it into a tracking ticket for all ops-related tasks?

jcrespo triaged this task as Medium priority.Apr 22 2016, 1:32 PM

jcrespo added a subtask: T119626: Eliminate SPOF at the main database infrastructure.

jcrespo moved this task from Triage to Backlog on the DBA board.

Change 286303 had a related patch set uploaded (by Volans):
MariaDB: Set additional salt grains for core DBs

https://gerrit.wikimedia.org/r/286303

gerritbot added a project: Patch-For-Review.May 3 2016, 10:12 AM

faidon renamed this task from Automate datacenter switchover steps to Automate database datacenter switchover steps.May 4 2016, 3:22 PM

faidon added a project: codfw-rollout.

Change 286303 merged by Volans:
MariaDB: Set additional salt grains for core DBs

https://gerrit.wikimedia.org/r/286303

Volans created subtask T134480: Improve replication lag detection for multi-dc environment.May 5 2016, 12:17 PM

Volans created subtask T134481: Refactor $master in mariadb::core Puppet class.May 5 2016, 12:27 PM

Additional grains added, documentation for them added here:
https://wikitech.wikimedia.org/wiki/MariaDB#Salt

Updated switchover documentation here:
https://wikitech.wikimedia.org/wiki/Switch_Datacenter

Change 287088 had a related patch set uploaded (by Volans):
MariaDB: set mysql_role to standalone for es1

https://gerrit.wikimedia.org/r/287088

Change 287088 merged by Volans:
MariaDB: set mysql_role to standalone for es1

https://gerrit.wikimedia.org/r/287088

Volans mentioned this in T133338: Create salt database groups.May 10 2016, 11:04 AM

Volans closed subtask T133338: Create salt database groups as Resolved.

Volans closed subtask T134481: Refactor $master in mariadb::core Puppet class as Resolved.May 11 2016, 1:23 PM

Documentation updated given that T134481 is now resolved.

jcrespo mentioned this in T133523: Decide how to improve parsercache replication, sharding and HA.May 12 2016, 11:33 AM

Parsercaches are now full R/W on both datacenters at the same time, so I will delete the steps related to that from the documentation.

@Krinkle @Volans Now parsercaches are always RW, I have moved the warmup earlier, but I would love someone else to check it is right (maybe I am missing some blockers): https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Per-service_switchover_instructions

Great work, @jcrespo!

The next step is to eliminate the pt-heatbeat failover, I think we have a solution but let me present it in person to others to see if you agree.

jcrespo claimed this task.May 16 2016, 5:46 PM

jcrespo moved this task from Backlog to In progress on the DBA board.

Change 289177 had a related patch set uploaded (by Jcrespo):
Add interval parameter, and change the default to 1 beat per second

https://gerrit.wikimedia.org/r/289177

Change 289178 had a related patch set uploaded (by Jcrespo):
Enable heartbeat on all masters, even on the pasive datacenter

https://gerrit.wikimedia.org/r/289178

Change 289177 merged by Jcrespo:
Add interval parameter, and change the default to 1 beat per second

https://gerrit.wikimedia.org/r/289177

Mentioned in SAL [2016-05-18T13:55:28Z] <jynus> disabling puppet on all database masters to test replication monitoring change T133337

Change 289178 merged by Jcrespo:
Enable heartbeat on all masters, even on the pasive datacenter

https://gerrit.wikimedia.org/r/289178

jcrespo mentioned this in rOPUP203579b1a172: Enable heartbeat on all masters, even on the pasive datacenter.May 18 2016, 1:59 PM

Change 289442 had a related patch set uploaded (by Jcrespo):
Remove unneeded $heartbeat_enable variables

https://gerrit.wikimedia.org/r/289442

jcrespo mentioned this in T134480: Improve replication lag detection for multi-dc environment.May 18 2016, 4:18 PM

jcrespo closed subtask T134480: Improve replication lag detection for multi-dc environment as Resolved.

Change 289442 merged by Jcrespo:
Remove unneeded $heartbeat_enabled variables

https://gerrit.wikimedia.org/r/289442

jcrespo mentioned this in rOPUP1e6e8596255c: Remove unneeded $heartbeat_enabled variables.May 18 2016, 4:27 PM

jcrespo closed subtask T133339: Improve lag detection mechanism's reliability and agility as Resolved.May 18 2016, 4:53 PM

SPOF solving is desired, but I do not think a hard blocker for this. The only thing pending is to automate the READ ONLY steps, and documente everything (which I am doing now).

@faidon All steps related to the database except one (Setting old masters in read write; setting new ones in read-only) have been eliminated: https://wikitech.wikimedia.org/wiki/Switch_Datacenter No more dependency on puppet or "silencing alerts".

It is my intention to create a script that does that, and checks replication is up to date. However, aside from that (which may not be needed at all, if we plan to setup an active-active datacenter, or we just leave it in RW all the time as we already do with parsercache), that is almost as "automated" than the current documented salt one-liner, which thanks to @Volans already uses the right salt groups.

The next step of making it 100% automated would be do it 100% depending on orchestration, but that is out of the bounds of pure database work. Aside from the previously mentioned script, should we close this and open a different ticket for confd-database integration (as we already planed to do with @Joe ) or what should be the scope for closing this ticket?

Krinkle unsubscribed.May 20 2016, 6:46 PM

jcrespo mentioned this in rOPUP7bd372c8c8a5: Remove unneeded $heartbeat_enable variables.Jun 17 2016, 6:08 PM

jcrespo mentioned this in rOPUPca0fad1bb863: Enable heartbeat on all masters, even on the pasive datacenter.

jcrespo mentioned this in rOPUP4a53d6d07fa4: Enable heartbeat on all masters, even on the pasive datacenter.

Volans mentioned this in rOPUP08973bbadc56: MariaDB: set mysql_role to standalone for es1.Jun 17 2016, 6:10 PM

Volans mentioned this in rOPUPb9a3910cb104: MariaDB: Set additional salt grains for core DBs.

Volans mentioned this in rOPUPb4978bb6845f: MariaDB: Set additional salt grains for core DBs.

Volans mentioned this in rOPUP6c96ad5d623a: MariaDB: Set additional salt grains for core DBs.