m1-master switch from db1001 to db1016
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Springle
	Jul 20 2015, 2:49 AM

Description

db1001 needs maintenance, or even decommissioning like db100[2-7]. To do that we need to perform an m1-master rotation to db1016.

In the past this was a painful prospect due to the sheer number of services connecting directly to db1001, but m1-master CNAME and dbproxy1001 were added and services migrated to use them, so it should now be easier.

However, big stuff is affected:

puppet
bacula
etherpad
racktables
librenms

Needs planning or input from other Opsen.

Details

	Subject	Repo	Branch	Lines +/-
	Promote db1016 as the m1 shard master, set db1001 as a m1 slave	operations/puppet	production	+8 -5
	Set temporarily m1 haproxy to failover to itself (db1016)	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production
Resolved	• jcrespo	T125028 reimage or decom db servers on precise
Resolved	• jcrespo	T135973 Upgrade m1 db servers
Resolved	• jcrespo	T106312 m1-master switch from db1001 to db1016

Event Timeline

• Springle created this task.Jul 20 2015, 2:49 AM

• Springle raised the priority of this task from to Needs Triage.

• Springle updated the task description. (Show Details)

• Springle added projects: acl*sre-team, DBA.

• Springle added subscribers: • Springle, • jcrespo.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptJul 20 2015, 2:49 AM

• Springle updated the task description. (Show Details)Jul 20 2015, 2:49 AM

• Springle set Security to None.

We already recently tried that for etherpad with @jcrespo. It failed due to db1016 not having the same rights as db1001. On the next effort everything went smoothly. I suppose that is one area we need to check before everything else. Also how about we pool db1016 in db1001 proxy ?

Just a side note: the issues we experience probably were caused by actually overpassing the proxy, and doing the failover "manually", so that it only affected etherpad, and not the rest of the services.

Thanks for commenting, but I believe that there are services here that you do not "own", and we still need feedback from other ops to prepare this, is that right?

After the clean up, these are the databases that are still there (aside from the mysql ones mysql, information_schema and performance schema):

bacula
etherpadlite
heartbeat
librenms
puppet
racktables
reviewdb
rt

The rest have been temporarily archived on es2001 /srv/backup/m1 and will be either permanently deleted or archived properly on bacula.

@akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC.

I do not really need you to do anything, and this should be a trivial process that needs no attention, but it would be great if you could check every service listed up here continues working after that with no issues.

• jcrespo claimed this task.Jun 20 2016, 6:03 PM

• jcrespo moved this task from Triage to In progress on the DBA board.

I can check on RT ('rt') and racktables (the other rt) on Wednesday, maybe around 18UTC but i dont worry about it since these services are just used by ops themselves.

That said, please note that "reviewdb" is Gerrit, affects quite a few users and the releng team owns that one. Etherpad/Bacula/Librenms i believe are covered by Alex (?). I dont know about heartbeat and puppet.

In T106312#2394393, @jcrespo wrote:

@akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC.

I do not really need you to do anything, and this should be a trivial process that needs no attention, but it would be great if you could check every service listed up here continues working after that with no issues.

I 'd be happy to check the various services I 've signed up for. But can we delay this for an hour or so ? That is 17:00 UTC, just to make sure I 'll be around.

In T106312#2395761, @akosiaris wrote:

In T106312#2394393, @jcrespo wrote:

@akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC.

I do not really need you to do anything, and this should be a trivial process that needs no attention, but it would be great if you could check every service listed up here continues working after that with no issues.

I 'd be happy to check the various services I 've signed up for. But can we delay this for an hour or so ? That is 17:00 UTC, just to make sure I 'll be around.

17.00 UTC works for me, if that's finalized please grab the slot at https://wikitech.wikimedia.org/wiki/Deployments

17 UTC also fine with me

Let's go with 17 UTC instead

These are the notes from the migration, to be documented on wiki:

bacula ; sudo service bacula-director restart after the migration. I had already made sure no jobs were running with status director. Tested after with a list media
etherpadlite ; seems like etherpad-lite crashed after the migration and systemd took care of restarting it. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad
heartbeat needs "manual migration"- change master role on puppet
librenms - required manuall kill of its connections
puppet - required manuall kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections as soon as the failover happens.
racktables - went fine, no problems
reviewdb - not really on m1 anymore (it was migrated to m2). To delete.
rt - required manuall kill of its connections ; restarted for good measure apache ; tested by looking at a sample RT ticket