Page MenuHomePhabricator

Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC
Closed, ResolvedPublic

Description

db1063, the current m1 master is having broken disk quite often already (1 per week) and Chris has let us know that there are two more about to fail (T231199#5443385)
We need to find a date to swap this host with db1135.

m1 currently holds the following active databases:

bacula
etherpadlite
librenms
puppet (not in use - to be deleted)
racktables
rt

We don't have much dates without switchovers scheduled already, so I am proposing the following day:
Tuesday 10th September at 16:00 UTC
@akosiaris @Dzahn @ayounsi you've helped in the past to verify/restart services affected by this failover (mostly restarting services).
The failover shouldn't take longer than a few seconds (it is a matter of restarting an HAProxy to point to the new host).

Details

Related Gerrit Patches:
operations/puppet : productionmariadb: Promote db1135 to m1 master

Event Timeline

Restricted Application added a subscriber: Scoopfinder. · View Herald TranscriptAug 28 2019, 5:15 AM
Marostegui triaged this task as Medium priority.Aug 28 2019, 5:15 AM
Marostegui moved this task from Triage to Next on the DBA board.

@akosiaris @Dzahn @jcrespo @ayounsi let me know if that proposed day and time would work for you.
Thanks!

Fine with me, speaking for RT and racktables. Note that RT is separate from OTRS which is more critical.

Fine with me, speaking for RT and racktables. Note that RT is separate from OTRS which is more critical.

Correct! My bad, sorry! Amending...
OTRs is in m2 :)

Marostegui updated the task description. (Show Details)
Marostegui removed subscribers: Krenair, Scoopfinder.

Sep 10 16:00UTC sounds okish to me as far as bacula goes. We will probably have a couple of full backups (it's the start of the month, when full backups happen and they might not have finished by then) failing, but we can pick them out from the list, and reschedule them manually.

Note that the puppet db can be dropped now that servermon has been killed.

Thanks @akosiaris!
As spoken on IRC, no need to re-schedule because of bacula.

Marostegui renamed this task from Switchover m1 primary master: db1063 to db1135 to Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC.Aug 29 2019, 10:54 AM
Marostegui moved this task from Next to In progress on the DBA board.Sep 3 2019, 2:56 PM

Change 534386 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1135 to m1 master

https://gerrit.wikimedia.org/r/534386

Mentioned in SAL (#wikimedia-operations) [2019-09-04T08:26:06Z] <marostegui> Reboot db1135 to pick up new kernel - T231403

Trizek-WMF added a subscriber: Trizek-WMF.

Added for Tech News, since Etherpad service is quite used, and 16:00 UTC is a common meetings hour.

Added for Tech News, since Etherpad service is quite used, and 16:00 UTC is a common meetings hour.

Oh - thank you!

Will enwiki the only wiki affected to this failover?

Will enwiki the only wiki affected to this failover?

enwiki will not be affected by this failover. No wikis will be affected in fact.

I have reserved the window on the Deployments page.

Marostegui updated the task description. (Show Details)Sep 6 2019, 5:37 AM

Mentioned in SAL (#wikimedia-operations) [2019-09-10T15:37:05Z] <marostegui> Start pre-switchover for m1 steps T231403

Change 534386 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1135 to m1 master

https://gerrit.wikimedia.org/r/534386

Mentioned in SAL (#wikimedia-operations) [2019-09-10T16:10:18Z] <marostegui> Failover m1 from db1063 to db1135 - T231403

This was done.
Read-only starts: Tue Sep 10 16:10:39 UTC 2019
Read-only stops: Tue Sep 10 16:10:45 UTC 2019

Total read-only time: 6 seconds

Marostegui closed this task as Resolved.Sep 10 2019, 4:22 PM

Thanks everyone who helped out, closing this!