Page MenuHomePhabricator

Failover m5 master (db1128) to db1132 to upgrade its kernel
Closed, ResolvedPublic

Description

db1128 (m5 - master) needs its kernel upgraded, let's failover it to db1132

  • Reimage db1132
  • Move db1132 to m5 as slave

Databases on m5 (excluding labswiki, as hopefully it won't be there once this is ready to happen):

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub

When: Tuesday 23rd Nov at 14:00 UTC

Failover process

OLD MASTER: db1128

NEW MASTER: db1132

  • Decrease m5-master TTL to 1M
  • Check configuration differences between new and old master

$ pt-config-diff h=db1128.eqiad.wmnet,F=/root/.my.cnf h=db1132.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1132

db-switchover --timeout=15 --only-slave-move db1128.eqiad.wmnet db1132.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1128 to db1132 - T288720
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1128 db1132

  • Reload haproxies
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1128)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysql.sock

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This switchover needs to be done AFTER we have moved wikitech out (T167973) of m5.
So I don't expect this switchover to be done before October.

@Legoktm @Ladsgroup cloud-services-team adding you to this tag as you have services running here that will be affected by this failover.
The failover means around 1-2 minutes of read only time (the DNS TTL for m5-master.eqiad.wmnet will be 1 minute).
Keep in mind that right now m5 doesn't use the proxies, and that's why it needs to be done via DNS change. I am in process of re-enabling back the proxies on m5 (T288093) but it cannot happen until wikitech is moved to s6, and it won't happen in between that and this anyways.

The wikitech move is scheduled for 16th of Sept, and after that I am planning to take long holidays so I expect this not to happen before October. I will try to define a date closer to the time.

Sounds good. We probably want to restart Mailman so it reconnects to the new database server. I don't really know how Mailman will behave when receiving emails if the DB is read-only, we definitely have enough time to test it in Cloud beforehand. Worst case we just turn off Mailman for a minute or two.

Keep in mind that after the switch, I will kill all the connections on the old host, so it might reconnect to the new one itself. But yeah, a controlled restart doesn't hurt if it can be done.

Change 712849 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1132 to m5.

https://gerrit.wikimedia.org/r/712849

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1132.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108130509_marostegui_17104.log.

Completed auto-reimage of hosts:

['db1132.eqiad.wmnet']

and were ALL successful.

db1132 is now replicating from db1128, if all goes ok, I will change it on the haproxy config, in case db1128 goes down, we'll get a "free" swap instead of failing over to db1117:3325

Change 712849 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1132 to m5.

https://gerrit.wikimedia.org/r/712849

Change 713090 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1132: Enable notifications

https://gerrit.wikimedia.org/r/713090

Change 713090 merged by Marostegui:

[operations/puppet@production] db1132: Enable notifications

https://gerrit.wikimedia.org/r/713090

Added db1132 as standby host for m5 proxies (unused for now)

Let me know when the date and time is set so I write a quick announcement and do a restart of mailman services.

Let's do this once T288093: Place m5 proxies in codfw and eqiad is completed, as it will be a lot easier and with less downtime.

db1132 needs restarting as it has the old my.cnf values.

Marostegui added subscribers: Andrew, bd808.

@bd808 @Andrew @Legoktm @Ladsgroup now that we have the proxy in place for m5 we can proceed with this. It should only be a few seconds of read-only time. I don't know if you want/need to be present for this maintenance. I was thinking about Tuesday 23rd at 14:00 UTC?

Let me know your thoughts!
Thanks.

I was thinking about Tuesday 23rd at 14:00 UTC?

This date/time works for me too.

I was thinking about Tuesday 23rd at 14:00 UTC?

This date/time works for me too.

works for me!

I'll let Amir take my spot this time, and sleep in a bit :)

Thank you all! I will get this scheduled!

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2021-11-19T06:55:33Z] <marostegui> Reboot db1132 to pick up new kernel T288720

Rebooted db1132 (future master) to pick up the latest kernel.

Change 740714 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1132 to m5 master

https://gerrit.wikimedia.org/r/740714

Change 740714 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1132 to m5 master

https://gerrit.wikimedia.org/r/740714

Mentioned in SAL (#wikimedia-operations) [2021-11-23T14:00:33Z] <marostegui> Failover m5 from db1128 to db1132 - T288720

The switchover was done
RO started: 14:01:00
RO finished: 14:01:17

Total RO time: 17 seconds

Change 740839 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy10{17,21}: Change m5 standby host

https://gerrit.wikimedia.org/r/740839

Change 740839 merged by Marostegui:

[operations/puppet@production] dbproxy10{17,21}: Change m5 standby host

https://gerrit.wikimedia.org/r/740839

Change 740964 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Restore TTL back to 5M for m5-master

https://gerrit.wikimedia.org/r/740964

Change 740964 merged by Marostegui:

[operations/dns@master] wmnet: Restore TTL back to 5M for m5-master

https://gerrit.wikimedia.org/r/740964

Mentioned in SAL (#wikimedia-operations) [2021-11-24T06:05:54Z] <marostegui> Upgrade db1128's kernel T288720

Marostegui updated the task description. (Show Details)

This is all done!