Page MenuHomePhabricator

m1-master switch from db1001 to db1016
Closed, ResolvedPublic

Description

db1001 needs maintenance, or even decommissioning like db100[2-7]. To do that we need to perform an m1-master rotation to db1016.

In the past this was a painful prospect due to the sheer number of services connecting directly to db1001, but m1-master CNAME and dbproxy1001 were added and services migrated to use them, so it should now be easier.

However, big stuff is affected:

  • puppet
  • bacula
  • etherpad
  • racktables
  • librenms

Needs planning or input from other Opsen.

Event Timeline

Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle added projects: acl*sre-team, DBA.
akosiaris triaged this task as Medium priority.Aug 25 2015, 12:40 PM
akosiaris subscribed.

We already recently tried that for etherpad with @jcrespo. It failed due to db1016 not having the same rights as db1001. On the next effort everything went smoothly. I suppose that is one area we need to check before everything else. Also how about we pool db1016 in db1001 proxy ?

Just a side note: the issues we experience probably were caused by actually overpassing the proxy, and doing the failover "manually", so that it only affected etherpad, and not the rest of the services.

Thanks for commenting, but I believe that there are services here that you do not "own", and we still need feedback from other ops to prepare this, is that right?

After the clean up, these are the databases that are still there (aside from the mysql ones mysql, information_schema and performance schema):

bacula
etherpadlite
heartbeat
librenms
puppet
racktables
reviewdb
rt

The rest have been temporarily archived on es2001 /srv/backup/m1 and will be either permanently deleted or archived properly on bacula.

@akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC.

I do not really need you to do anything, and this should be a trivial process that needs no attention, but it would be great if you could check every service listed up here continues working after that with no issues.

jcrespo moved this task from Triage to In progress on the DBA board.

I can check on RT ('rt') and racktables (the other rt) on Wednesday, maybe around 18UTC but i dont worry about it since these services are just used by ops themselves.

That said, please note that "reviewdb" is Gerrit, affects quite a few users and the releng team owns that one. Etherpad/Bacula/Librenms i believe are covered by Alex (?). I dont know about heartbeat and puppet.

@akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC.

I do not really need you to do anything, and this should be a trivial process that needs no attention, but it would be great if you could check every service listed up here continues working after that with no issues.

I 'd be happy to check the various services I 've signed up for. But can we delay this for an hour or so ? That is 17:00 UTC, just to make sure I 'll be around.

@akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC.

I do not really need you to do anything, and this should be a trivial process that needs no attention, but it would be great if you could check every service listed up here continues working after that with no issues.

I 'd be happy to check the various services I 've signed up for. But can we delay this for an hour or so ? That is 17:00 UTC, just to make sure I 'll be around.

17.00 UTC works for me, if that's finalized please grab the slot at https://wikitech.wikimedia.org/wiki/Deployments

These are the notes from the migration, to be documented on wiki:

  • bacula ; sudo service bacula-director restart after the migration. I had already made sure no jobs were running with status director. Tested after with a list media
  • etherpadlite ; seems like etherpad-lite crashed after the migration and systemd took care of restarting it. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
  • heartbeat needs "manual migration"- change master role on puppet
  • librenms - required manuall kill of its connections
  • puppet - required manuall kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections as soon as the failover happens.
  • racktables - went fine, no problems
  • reviewdb - not really on m1 anymore (it was migrated to m2). To delete.
  • rt - required manuall kill of its connections ; restarted for good measure apache ; tested by looking at a sample RT ticket

Change 295562 had a related patch set uploaded (by Jcrespo):
Set temporarilly m1 haproxy to failover to itself (db1016)

https://gerrit.wikimedia.org/r/295562

Change 295562 merged by Jcrespo:
Set temporarily m1 haproxy to failover to itself (db1016)

https://gerrit.wikimedia.org/r/295562

Change 295563 had a related patch set uploaded (by Jcrespo):
Promote db1016 as the m1 shard master, set db1001 as a m1 slave

https://gerrit.wikimedia.org/r/295563

Change 295563 merged by Jcrespo:
Promote db1016 as the m1 shard master, set db1001 as a m1 slave

https://gerrit.wikimedia.org/r/295563

db1016 is the new master of m1.