Page MenuHomePhabricator

Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC
Closed, ResolvedPublic

Description

db1063, the current m1 master is having broken disk quite often already (1 per week) and Chris has let us know that there are two more about to fail (T231199#5443385)
We need to find a date to swap this host with db1135.

m1 currently holds the following active databases:

bacula
etherpadlite
librenms
puppet (not in use - to be deleted)
racktables
rt

We don't have much dates without switchovers scheduled already, so I am proposing the following day:
Tuesday 10th September at 16:00 UTC
@akosiaris @Dzahn @ayounsi you've helped in the past to verify/restart services affected by this failover (mostly restarting services).
The failover shouldn't take longer than a few seconds (it is a matter of restarting an HAProxy to point to the new host).

Event Timeline

Marostegui moved this task from Triage to Pending comment on the DBA board.

@akosiaris @Dzahn @jcrespo @ayounsi let me know if that proposed day and time would work for you.
Thanks!

Fine with me, speaking for RT and racktables. Note that RT is separate from OTRS which is more critical.

Fine with me, speaking for RT and racktables. Note that RT is separate from OTRS which is more critical.

Correct! My bad, sorry! Amending...
OTRs is in m2 :)

Sep 10 16:00UTC sounds okish to me as far as bacula goes. We will probably have a couple of full backups (it's the start of the month, when full backups happen and they might not have finished by then) failing, but we can pick them out from the list, and reschedule them manually.

Note that the puppet db can be dropped now that servermon has been killed.

Thanks @akosiaris!
As spoken on IRC, no need to re-schedule because of bacula.

Marostegui renamed this task from Switchover m1 primary master: db1063 to db1135 to Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC.Aug 29 2019, 10:54 AM

Change 534386 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1135 to m1 master

https://gerrit.wikimedia.org/r/534386

Mentioned in SAL (#wikimedia-operations) [2019-09-04T08:26:06Z] <marostegui> Reboot db1135 to pick up new kernel - T231403

Trizek-WMF subscribed.

Added for Tech News, since Etherpad service is quite used, and 16:00 UTC is a common meetings hour.

Added for Tech News, since Etherpad service is quite used, and 16:00 UTC is a common meetings hour.

Oh - thank you!

Will enwiki the only wiki affected to this failover?

Will enwiki the only wiki affected to this failover?

enwiki will not be affected by this failover. No wikis will be affected in fact.

I have reserved the window on the Deployments page.

Mentioned in SAL (#wikimedia-operations) [2019-09-10T15:37:05Z] <marostegui> Start pre-switchover for m1 steps T231403

Change 534386 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1135 to m1 master

https://gerrit.wikimedia.org/r/534386

Mentioned in SAL (#wikimedia-operations) [2019-09-10T16:10:18Z] <marostegui> Failover m1 from db1063 to db1135 - T231403

This was done.
Read-only starts: Tue Sep 10 16:10:39 UTC 2019
Read-only stops: Tue Sep 10 16:10:45 UTC 2019

Total read-only time: 6 seconds

Thanks everyone who helped out, closing this!