⚓ T288720 Failover m5 master (db1128) to db1132 to upgrade its kernel

Subject	Repo	Branch	Lines +/-
wmnet: Restore TTL back to 5M for m5-master	operations/dns	master	+1 -1
dbproxy10{17,21}: Change m5 standby host	operations/puppet	production	+4 -4
mariadb: Promote db1132 to m5 master	operations/puppet	production	+14 -15
db1132: Enable notifications	operations/puppet	production	+0 -1
mariadb: Move db1132 to m5.	operations/puppet	production	+7 -6

Status	Assigned	Task
Resolved	• Marostegui	T288720 Failover m5 master (db1128) to db1132 to upgrade its kernel
Resolved	• Marostegui	T288093 Place m5 proxies in codfw and eqiad
Resolved	bd808	T294437 Add egress rules for dbproxy1017 & dbproxy1021
Resolved	• Marostegui	T295524 Upgrade m5 hosts to 10.4.21

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2021, 6:11 AM

• Marostegui triaged this task as Medium priority.Aug 12 2021, 6:11 AM

• Marostegui moved this task from Triage to Blocked on the DBA board.

• Marostegui mentioned this in T288197: Failover m3 (phabricator) master (db1132) to a different host to upgrade its kernel.

This switchover needs to be done AFTER we have moved wikitech out (T167973) of m5.
So I don't expect this switchover to be done before October.

• Marostegui added a project: cloud-services-team.Aug 12 2021, 6:15 AM

• Marostegui updated the task description. (Show Details)

• Marostegui added subscribers: Ladsgroup, Legoktm.

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptAug 12 2021, 6:15 AM

@Legoktm @Ladsgroup cloud-services-team adding you to this tag as you have services running here that will be affected by this failover.
The failover means around 1-2 minutes of read only time (the DNS TTL for m5-master.eqiad.wmnet will be 1 minute).
Keep in mind that right now m5 doesn't use the proxies, and that's why it needs to be done via DNS change. I am in process of re-enabling back the proxies on m5 (T288093) but it cannot happen until wikitech is moved to s6, and it won't happen in between that and this anyways.

The wikitech move is scheduled for 16th of Sept, and after that I am planning to take long holidays so I expect this not to happen before October. I will try to define a date closer to the time.

Sounds good. We probably want to restart Mailman so it reconnects to the new database server. I don't really know how Mailman will behave when receiving emails if the DB is read-only, we definitely have enough time to test it in Cloud beforehand. Worst case we just turn off Mailman for a minute or two.

Keep in mind that after the switch, I will kill all the connections on the old host, so it might reconnect to the new one itself. But yeah, a controlled restart doesn't hurt if it can be done.

Change 712849 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1132 to m5.

https://gerrit.wikimedia.org/r/712849

gerritbot added a project: Patch-For-Review.Aug 13 2021, 5:07 AM

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1132.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108130509_marostegui_17104.log.

Completed auto-reimage of hosts:

['db1132.eqiad.wmnet']

and were ALL successful.

• Marostegui updated the task description. (Show Details)Aug 13 2021, 5:47 AM

db1132 is now replicating from db1128, if all goes ok, I will change it on the haproxy config, in case db1128 goes down, we'll get a "free" swap instead of failing over to db1117:3325

Change 712849 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1132 to m5.

https://gerrit.wikimedia.org/r/712849

Maintenance_bot removed a project: Patch-For-Review.Aug 13 2021, 8:10 AM

Change 713090 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1132: Enable notifications

https://gerrit.wikimedia.org/r/713090

Change 713090 merged by Marostegui:

[operations/puppet@production] db1132: Enable notifications

https://gerrit.wikimedia.org/r/713090

Maintenance_bot removed a project: Patch-For-Review.Aug 16 2021, 5:10 AM

Added db1132 as standby host for m5 proxies (unused for now)

Let me know when the date and time is set so I write a quick announcement and do a restart of mailman services.

Will do! Thank you

Let's do this once T288093: Place m5 proxies in codfw and eqiad is completed, as it will be a lot easier and with less downtime.

• Marostegui added a subtask: T288093: Place m5 proxies in codfw and eqiad.Oct 27 2021, 7:43 AM

• Marostegui closed subtask T288093: Place m5 proxies in codfw and eqiad as Resolved.Nov 10 2021, 2:14 PM

• Marostegui moved this task from Blocked to Ready on the DBA board.Nov 10 2021, 2:32 PM

db1132 needs restarting as it has the old my.cnf values.

Mentioned in SAL (#wikimedia-operations) [2021-11-11T08:13:06Z] <marostegui> Restart db1132 T288720

Mentioned in SAL (#wikimedia-operations) [2021-11-11T08:17:10Z] <marostegui> Upgrade db2078 T288720

• Marostegui mentioned this in T295524: Upgrade m5 hosts to 10.4.21.Nov 11 2021, 8:22 AM

• Marostegui updated the task description. (Show Details)Nov 11 2021, 8:26 AM

@bd808 @Andrew @Legoktm @Ladsgroup now that we have the proxy in place for m5 we can proceed with this. It should only be a few seconds of read-only time. I don't know if you want/need to be present for this maintenance. I was thinking about Tuesday 23rd at 14:00 UTC?

Let me know your thoughts!
Thanks.

I can be around for that timee

• Marostegui claimed this task.Nov 12 2021, 7:06 AM

• Marostegui moved this task from Ready to In progress on the DBA board.

In T288720#7497969, @Marostegui wrote:

I was thinking about Tuesday 23rd at 14:00 UTC?

This date/time works for me too.

In T288720#7500354, @bd808 wrote:

In T288720#7497969, @Marostegui wrote:

I was thinking about Tuesday 23rd at 14:00 UTC?

This date/time works for me too.