Page MenuHomePhabricator

Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC
Closed, ResolvedPublic

Description

db1159 has been cloned from db1117:3321.
db1080 needs to be decommissioned.

Let's give db1159 (runs 10.4.18) a full week to make sure it is ok and then scheduled a day to promote it to master.

Databases running on m1 master:

bacula9
cas
cas_staging
dbbackups
etherpadlite
librenms
pki
racktables
rddmarc
rt

Pre steps:

  • Upgrade all m1 hosts to 10.4.18
    • db2132
    • db2078
    • db1117

Switchover steps:

OLD MASTER: db1080

NEW MASTER: db1159

Check configuration differences between new and old master

  • $ pt-config-diff h=db1159.eqiad.wmnet,F=/root/.my.cnf h=db1080.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1159

db-switchover --timeout=1 --only-slave-move db1080.eqiad.wmnet db1159.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1080 to db1159 - T276448

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1080 db1159
  • Reload haproxies
dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1080)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysql.sock

  • Restart puppet on old and new masters (for heartbeat):db1080 and db1159

puppet agent --enable && puppet agent -tv

  • Check services affected (librenms, racktables, etherpad...)
  • change events for query killer: events_coredb_master.sql on the new master db1159 events_coredb_slave.sql on the new slave db1080
  • Cleaned orchestrator heartbeat to remove the old masters' one.
  • Create decommissioning ticket for db1080: T280121
  • Update/resolve phabricator ticket about failover https://phabricator.wikimedia.org/T276448

Event Timeline

Marostegui triaged this task as Medium priority.Mar 4 2021, 12:36 PM
Marostegui moved this task from Triage to Blocked on the DBA board.

Change 668449 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbbackups: Update backup metadata host db1080->db1159

https://gerrit.wikimedia.org/r/668449

Marostegui added a subscriber: jcrespo.

@jcrespo I would like to do this Wednesday 14th April - is this a good day or will it mess up with the backups? I have no problems in scheduling it any other day

@jcrespo I would like to do this Wednesday 14th April - is this a good day or will it mess up with the backups? I have no problems in scheduling it any other day

Once this is confirmed I will ping the rest of service owners

@jcrespo I would like to do this Wednesday 14th April - is this a good day or will it mess up with the backups? I have no problems in scheduling it any other day

What time is this happening?

On the last months, the latest the backups finish is 8 UTC (normally they finish by 5 UTC). As long as this happens after then, we will be ok.

@jcrespo I would like to do this Wednesday 14th April - is this a good day or will it mess up with the backups? I have no problems in scheduling it any other day

What time is this happening?

On the last months, the latest the backups finish is 8 UTC (normally the finish by 5 UTC). As long as this happens after then, we will be ok.

I can adapt to whatever works best for the backups

As long as it is not too early in the morning, 14 will be ok. We may want to do it late in the morning so etherpad and other owners are around? So it should be ok as long we merge the patch I prepared after switchover.

What about 10UTC? Would that work for backups? I will ping other owners if this works for you

What about 10UTC? Would that work for backups? I will ping other owners if this works for you

Sure.

Thank you Jaime.

@akosiaris would be available tomorrow at around 10 AM UTC tomorrow 14th April in case we need to restart etherpad?
@jbond @MoritzMuehlenhoff ok to restart mysql from cas and pki point of view tomorrow 14th April?
@ayounsi ok to restart mysql from librenms point of view tomorrow 14th April?

Thank you Jaime.

@akosiaris would be available tomorrow at around 10 AM UTC tomorrow 14th April in case we need to restart etherpad?

Yes

@jbond @MoritzMuehlenhoff ok to restart mysql from cas and pki point of view tomorrow 14th April?

Sounds good

Marostegui renamed this task from Failover m1 master: db1080 -> db1159 to Failover m1 master: db1080 -> db1159 Wed 14th April at 10 AM UTC.Apr 13 2021, 9:01 AM

Change 678801 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1159 to m1 master

https://gerrit.wikimedia.org/r/678801

Moving this to 10:30 AM UTC as there's a power maintenance scheduled in my building which is supposed to end at 10:00 AM UTC, but just in case...

As pre step, everything moved under the new host.

Captura de pantalla 2021-04-14 a las 11.45.10.png (304×1 px, 56 KB)

Change 678801 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1159 to m1 master

https://gerrit.wikimedia.org/r/678801

Mentioned in SAL (#wikimedia-operations) [2021-04-14T10:30:33Z] <marostegui> Failover m1 from db1080 to db1159 - T276448

Change 668449 merged by Jcrespo:

[operations/puppet@production] dbbackups: Update backup metadata host db1080->db1159

https://gerrit.wikimedia.org/r/668449

Everything looks good, we are running some final checks to ensure backup infra is working fine after the swap.
The RO time was around 10 seconds.

Backup metadata looking good:

root@db1159.eqiad.wmnet[dbbackups]> select * FROM backups order by id desc limit 1\G
*************************** 1. row ***************************
        id: 11052
      name: snapshot.s7.2021-04-14--10-50-22
    status: ongoing
    source: db2100.codfw.wmnet:3317
      host: dbprov2002.codfw.wmnet
      type: snapshot
   section: s7
start_date: 2021-04-14 12:29:30
  end_date: NULL
total_size: 1151727406294
1 row in set (0.032 sec)

(end date is null because copy of files finished but it hasn't finished yet postprocessing -probably compressing- fully)

So everything looking great from my side!

Thanks!
Closing this!