Page MenuHomePhabricator

Failover m2 master (db1107) to a different host to upgrade its kernel
Closed, ResolvedPublic

Description

db1107 needs its kernel upgraded, let's failover it to a different host and move db1107 to m3 to failover that one too (this needs to happen on m3 and m5 too, which will have a task for each of them).

Floating host: db1183

Databases on m2:
When: Thursday 5th August 2021 - at 08:00 AM UTC.

debmonitor
iegreview
mwaddlink
mysql
otrs
recommendationapi
scholarships
sockpuppet
xhgui

Failover process

OLD MASTER: db1107

NEW MASTER: db1183

  • Check configuration differences between new and old master

$ pt-config-diff h=db1107.eqiad.wmnet,F=/root/.my.cnf h=db1183.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1183

db-switchover --timeout=15 --only-slave-move db1107.eqiad.wmnet db1183.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1107 to db1183 - T287852
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1107 db1183

  • Reload haproxies
dbproxy1013:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1015:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1107)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysql.sock

  • Restart puppet on old and new masters (for heartbeat):db1183 and db1107 puppet agent --enable && puppet agent -tv
  • Check services affected (otrs,debmonitor) DEBMONITOR and OTRS looking good
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag
  • Close this ticket and create a ticket to failover m3

Event Timeline

Marostegui renamed this task from Failover m1 master (db1159) to a different host to upgrade its kernel to Failover m2 master (db1107) to a different host to upgrade its kernel.Aug 2 2021, 11:11 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui added a parent task: Restricted Task.Aug 2 2021, 11:13 AM

Change 709410 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1183: Disable notifications

https://gerrit.wikimedia.org/r/709410

Change 709410 merged by Marostegui:

[operations/puppet@production] db1183: Disable notifications

https://gerrit.wikimedia.org/r/709410

Change 709412 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1183: Move it from m5 to m2

https://gerrit.wikimedia.org/r/709412

Change 709412 merged by Marostegui:

[operations/puppet@production] db1183: Move it from m5 to m2

https://gerrit.wikimedia.org/r/709412

Marostegui added subscribers: dpifke, hnowlan, kostajh and 3 others.

@MoritzMuehlenhoff @dpifke @Krinkle @bd808 @hnowlan @kostajh I am planning to failover this host on Thursday at 08:00 AM UTC.
This means there will be a few seconds of read only time (I expect between 5 and 10 seconds) if all goes fine.

Please let me know if there's something that requires action and Thursday isn't a doable date.
Thanks!

@MoritzMuehlenhoff @dpifke @Krinkle @bd808 @hnowlan @kostajh I am planning to failover this host on Thursday at 08:00 AM UTC.
This means there will be a few seconds of read only time (I expect between 5 and 10 seconds) if all goes fine.

Please let me know if there's something that requires action and Thursday isn't a doable date.
Thanks!

Sounds good, no action is needed for debmonitor.

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

iegreview and scholarships should handle the maintenance without major issue.

db1183 is now up and replicating from db1107

No objection for sockpuppet, thanks!

Thank you all for the fast replies!

Change 709673 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1013,dbproxy1015: Promote db1183 to master

https://gerrit.wikimedia.org/r/709673

All set for tomorrow's failover!

Change 710215 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1183: Enable notifications

https://gerrit.wikimedia.org/r/710215

Change 710215 merged by Marostegui:

[operations/puppet@production] db1183: Enable notifications

https://gerrit.wikimedia.org/r/710215

Change 710216 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1183 to m2 master.

https://gerrit.wikimedia.org/r/710216

Change 710216 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1183 to m2 master.

https://gerrit.wikimedia.org/r/710216

Change 709673 merged by Marostegui:

[operations/puppet@production] dbproxy1013,dbproxy1015: Promote db1183 to master

https://gerrit.wikimedia.org/r/709673

Mentioned in SAL (#wikimedia-operations) [2021-08-05T08:00:13Z] <marostegui> Failover m2 from db1107 to db1183 - T287852

Failover was done.
Read only time times:
Start: 08:00:29 AM UTC
Stop: 08:00:47 AM UTC

Total: 18 seconds

Change 710217 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1107: Disable notifications

https://gerrit.wikimedia.org/r/710217

Change 710217 merged by Marostegui:

[operations/puppet@production] db1107: Disable notifications

https://gerrit.wikimedia.org/r/710217

Marostegui updated the task description. (Show Details)