Page MenuHomePhabricator

Upgrade s6 to Debian Buster and MariaDB 10.4
Open, MediumPublic

Description

s6 should be the first section to be upgraded entirely to 10.4 and Buster.

Steps to upgrade:

Please read the doc about procedure for more details.

Related Objects

Event Timeline

Marostegui triaged this task as Medium priority.Wed, Apr 21, 5:29 AM
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

@jcrespo this is the next section we are going to fully switch on both DCs. Following the procedure written on the doc.

I have db1140:s6 and db2141:s6 with buster already ready to substitute db1139:s6 and db2097:s6, respectively, stretch instances whenever you are ready. Will prepare a patch.

Thanks - Stevie will be driving this task, so let's coordinate with her.

Change 681621 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover s6 codfw database backups from db2097 to db2141

https://gerrit.wikimedia.org/r/681621

Change 681622 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover s6 codfw database backups from db2097 to db2141

https://gerrit.wikimedia.org/r/681622

db1165 is ready to take over db1085 in s6 as sanitarium master.
They both need to be stopped at the same time and move db1155 (sanitarium for s2) under it. db1125 (old sanitarium) doesn't need to be moved, as it will be repurposed somewhere else (T258361)

Kormat updated the task description. (Show Details)

Steps to swap db1085 -> db1165:

  • Silence db1085, db1165 & db1155, and wikireplicas:
sudo cookbook sre.hosts.downtime --minutes 60 -r "Replace db1085 with db1165 T280751"  'A:db-clouddb or A:db-labsdb or P{db1085* or db1155* or db1165*}'
  • Depool instances:
sudo dbctl instance db1085 depool
sudo dbctl instance db1165 depool
sudo dbctl config commit -m "Depooling for sanitarium master switch T280751"
  • db-stop-in-sync db1085 db1165, and wait for db1155:s6 to catch up.
  • Switch db1155 over:
master-pos db1165.eqiad.wmnet
sudo mysql.py -h db1155:s6
> stop slave;
> reset slave all;
> change master to ...;
> start slave;

Steps to swap db1085 -> db1165:

  • Silence db1085, db1165 & db1155.
  • Depool instances:
sudo dbctl instance db1085 depool
sudo dbctl instance db1165 depool
sudo dbctl config commit -m "Depooling for sanitarium master switch T280751"
  • db-stop-in-sync db1085 db1165, and wait for db1155:s6 to catch up.
  • Switch db1155 over:
master-pos db1165.eqiad.wmnet
sudo mysql.py -h db1155:s6
> stop slave;
> change master to ...;
> start slave;
  • Start replication again on db1165: sudo mysql.py -h db1165 -e 'start slave'
  • Wait for db1165 to catch up, then repool it.
  • Merge puppet commit with the heira changes
  • Remove db1085 from dbctl?

This works for me, small notes:

  • you might want to silence also the cloudb* hosts, as those will also alert on IRC.
  • After the stop slave you might want also to run reset slave all
  • Once db1155:s6 has caught up enable GTID: STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;
  • Start replication on db1085
  • I would leave db1085 depooled (but still on dbctl) and if after a few days everything goes well, decom it!.

Thanks!

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:34:36Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:34:45Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:35:38Z] <kormat@cumin1001> dbctl commit (dc=all): 'Depooling for sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15714 and previous config saved to /var/cache/conftool/dbconfig/20210504-123537-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:46:48Z] <kormat@cumin1001> dbctl commit (dc=all): 'Repooling after sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15715 and previous config saved to /var/cache/conftool/dbconfig/20210504-124647-kormat.json

Change 684895 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Remove obsolete comment.

https://gerrit.wikimedia.org/r/684895

Change 684895 merged by Kormat:

[operations/puppet@production] mariadb: Remove obsolete comment.

https://gerrit.wikimedia.org/r/684895

Turns out the candidate master for s6/codfw (db2114) is already running buster/10.4.

Change 685440 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: Switch db2129 to buster.

https://gerrit.wikimedia.org/r/685440

Change 685440 merged by Kormat:

[operations/puppet@production] install_server: Switch db2129 to buster.

https://gerrit.wikimedia.org/r/685440

Script wmf-auto-reimage was launched by kormat on cumin2001.codfw.wmnet for hosts:

['db2129.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105051315_kormat_6374.log.

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:19:49Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:19:57Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T14:28:47Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T14:28:56Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751

Completed auto-reimage of hosts:

['db2129.codfw.wmnet']

and were ALL successful.

Change 681621 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switchover s6 codfw database backups from db2097 to db2141

https://gerrit.wikimedia.org/r/681621

Mentioned in SAL (#wikimedia-operations) [2021-05-05T15:10:01Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T15:10:08Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751

Change 685494 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140

https://gerrit.wikimedia.org/r/685494

^I've prepared the backup failover for eqiad :-)

mysqlcheck --all-databases completed successfully on db2129.

Change 685726 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: remove db2097 s6 section for this codfw backup source

https://gerrit.wikimedia.org/r/685726

Change 685753 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: switch db1173 to buster

https://gerrit.wikimedia.org/r/685753

Change 685754 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/685754

Change 685753 merged by Kormat:

[operations/puppet@production] install_server: switch db1173 to buster

https://gerrit.wikimedia.org/r/685753

Change 685754 merged by Kormat:

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/685754

Mentioned in SAL (#wikimedia-operations) [2021-05-06T11:12:56Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 depooling: Reimage to buster T280751', diff saved to https://phabricator.wikimedia.org/P15824 and previous config saved to /var/cache/conftool/dbconfig/20210506-111256-kormat.json

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1173.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105061118_kormat_25433.log.

Completed auto-reimage of hosts:

['db1173.eqiad.wmnet']

and were ALL successful.

db1173 (candidate master in eqiad) reimaged to buster, mysqlcheck --all-databases running now.

Mentioned in SAL (#wikimedia-operations) [2021-05-07T11:33:56Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15853 and previous config saved to /var/cache/conftool/dbconfig/20210507-113355-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T11:49:01Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15854 and previous config saved to /var/cache/conftool/dbconfig/20210507-114859-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T12:04:04Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15855 and previous config saved to /var/cache/conftool/dbconfig/20210507-120404-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T12:19:08Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15856 and previous config saved to /var/cache/conftool/dbconfig/20210507-121908-kormat.json