Page MenuHomePhabricator

Upgrade s6 to Debian Buster and MariaDB 10.4
Closed, ResolvedPublic

Description

s6 should be the first section to be upgraded entirely to 10.4 and Buster.

Steps to upgrade:

Please read the doc about procedure for more details.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

db1165 is ready to take over db1085 in s6 as sanitarium master.
They both need to be stopped at the same time and move db1155 (sanitarium for s2) under it. db1125 (old sanitarium) doesn't need to be moved, as it will be repurposed somewhere else (T258361)

Kormat updated the task description. (Show Details)

Steps to swap db1085 -> db1165:

  • Silence db1085, db1165 & db1155, and wikireplicas:
sudo cookbook sre.hosts.downtime --minutes 60 -r "Replace db1085 with db1165 T280751"  'A:db-clouddb or A:db-labsdb or P{db1085* or db1155* or db1165*}'
  • Depool instances:
sudo dbctl instance db1085 depool
sudo dbctl instance db1165 depool
sudo dbctl config commit -m "Depooling for sanitarium master switch T280751"
  • db-stop-in-sync db1085 db1165, and wait for db1155:s6 to catch up.
  • Switch db1155 over:
master-pos db1165.eqiad.wmnet
sudo mysql.py -h db1155:s6
> stop slave;
> reset slave all;
> change master to ...;
> start slave;

Steps to swap db1085 -> db1165:

  • Silence db1085, db1165 & db1155.
  • Depool instances:
sudo dbctl instance db1085 depool
sudo dbctl instance db1165 depool
sudo dbctl config commit -m "Depooling for sanitarium master switch T280751"
  • db-stop-in-sync db1085 db1165, and wait for db1155:s6 to catch up.
  • Switch db1155 over:
master-pos db1165.eqiad.wmnet
sudo mysql.py -h db1155:s6
> stop slave;
> change master to ...;
> start slave;
  • Start replication again on db1165: sudo mysql.py -h db1165 -e 'start slave'
  • Wait for db1165 to catch up, then repool it.
  • Merge puppet commit with the heira changes
  • Remove db1085 from dbctl?

This works for me, small notes:

  • you might want to silence also the cloudb* hosts, as those will also alert on IRC.
  • After the stop slave you might want also to run reset slave all
  • Once db1155:s6 has caught up enable GTID: STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;
  • Start replication on db1085
  • I would leave db1085 depooled (but still on dbctl) and if after a few days everything goes well, decom it!.

Thanks!

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:34:36Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:34:45Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:35:38Z] <kormat@cumin1001> dbctl commit (dc=all): 'Depooling for sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15714 and previous config saved to /var/cache/conftool/dbconfig/20210504-123537-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:46:48Z] <kormat@cumin1001> dbctl commit (dc=all): 'Repooling after sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15715 and previous config saved to /var/cache/conftool/dbconfig/20210504-124647-kormat.json

Change 684895 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Remove obsolete comment.

https://gerrit.wikimedia.org/r/684895

Change 684895 merged by Kormat:

[operations/puppet@production] mariadb: Remove obsolete comment.

https://gerrit.wikimedia.org/r/684895

Turns out the candidate master for s6/codfw (db2114) is already running buster/10.4.

Change 685440 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: Switch db2129 to buster.

https://gerrit.wikimedia.org/r/685440

Change 685440 merged by Kormat:

[operations/puppet@production] install_server: Switch db2129 to buster.

https://gerrit.wikimedia.org/r/685440

Script wmf-auto-reimage was launched by kormat on cumin2001.codfw.wmnet for hosts:

['db2129.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105051315_kormat_6374.log.

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:19:49Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:19:57Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T14:28:47Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T14:28:56Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751

Completed auto-reimage of hosts:

['db2129.codfw.wmnet']

and were ALL successful.

Change 681621 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switchover s6 codfw database backups from db2097 to db2141

https://gerrit.wikimedia.org/r/681621

Mentioned in SAL (#wikimedia-operations) [2021-05-05T15:10:01Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T15:10:08Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751

Change 685494 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140

https://gerrit.wikimedia.org/r/685494

^I've prepared the backup failover for eqiad :-)

mysqlcheck --all-databases completed successfully on db2129.

Change 685726 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: remove db2097 s6 section for this codfw backup source

https://gerrit.wikimedia.org/r/685726

Change 685753 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: switch db1173 to buster

https://gerrit.wikimedia.org/r/685753

Change 685754 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/685754

Change 685753 merged by Kormat:

[operations/puppet@production] install_server: switch db1173 to buster

https://gerrit.wikimedia.org/r/685753

Change 685754 merged by Kormat:

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/685754

Mentioned in SAL (#wikimedia-operations) [2021-05-06T11:12:56Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 depooling: Reimage to buster T280751', diff saved to https://phabricator.wikimedia.org/P15824 and previous config saved to /var/cache/conftool/dbconfig/20210506-111256-kormat.json

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1173.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105061118_kormat_25433.log.

Completed auto-reimage of hosts:

['db1173.eqiad.wmnet']

and were ALL successful.

db1173 (candidate master in eqiad) reimaged to buster, mysqlcheck --all-databases running now.

Mentioned in SAL (#wikimedia-operations) [2021-05-07T11:33:56Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15853 and previous config saved to /var/cache/conftool/dbconfig/20210507-113355-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T11:49:01Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15854 and previous config saved to /var/cache/conftool/dbconfig/20210507-114859-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T12:04:04Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15855 and previous config saved to /var/cache/conftool/dbconfig/20210507-120404-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T12:19:08Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15856 and previous config saved to /var/cache/conftool/dbconfig/20210507-121908-kormat.json

Change 685494 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140

https://gerrit.wikimedia.org/r/685494

Kormat updated the task description. (Show Details)

Change 681622 abandoned by Jcrespo:

[operations/puppet@production] dbbackups: Switchover s6 eqiad database backups from db1139 to db1140

Reason:

duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/ /685494

https://gerrit.wikimedia.org/r/681622

Change 692314 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: Switch db1131 to buster.

https://gerrit.wikimedia.org/r/692314

Change 692315 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1131: Disable notifications.

https://gerrit.wikimedia.org/r/692315

Change 692314 merged by Kormat:

[operations/puppet@production] install_server: Switch db1131 to buster.

https://gerrit.wikimedia.org/r/692314

Change 692315 merged by Kormat:

[operations/puppet@production] db1131: Disable notifications.

https://gerrit.wikimedia.org/r/692315

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1131.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105171309_kormat_32150.log.

Completed auto-reimage of hosts:

['db1131.eqiad.wmnet']

and were ALL successful.

db1131 (old s6 eqiad primary instance) has been reimaged to buster. It's now running mariadb-check -A.

Change 692341 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Remove s6 stretch backup source instance on eqiad

https://gerrit.wikimedia.org/r/692341

Change 685726 merged by Jcrespo:

[operations/puppet@production] dbbackups: remove db2097 s6 section for this codfw backup source

https://gerrit.wikimedia.org/r/685726

db2097:s6 removed, only db1139:s6 left (to do next week).

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:01:57Z] <kormat@cumin1001> dbctl commit (dc=all): 'Remove s6 eqiad primary from 'api' group T280751', diff saved to https://phabricator.wikimedia.org/P16043 and previous config saved to /var/cache/conftool/dbconfig/20210518-090156-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:02:16Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16045 and previous config saved to /var/cache/conftool/dbconfig/20210518-090215-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:04:50Z] <kormat@cumin1001> dbctl commit (dc=all): 'Set db1131 to weight 400 in s6/eqiad T280751', diff saved to https://phabricator.wikimedia.org/P16046 and previous config saved to /var/cache/conftool/dbconfig/20210518-090449-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:17:26Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16049 and previous config saved to /var/cache/conftool/dbconfig/20210518-091725-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:32:29Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16050 and previous config saved to /var/cache/conftool/dbconfig/20210518-093228-kormat.json

db1131 completed the databases check successfully, and is now being repooled.

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:47:33Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16053 and previous config saved to /var/cache/conftool/dbconfig/20210518-094732-kormat.json

Change 692341 merged by Jcrespo:

[operations/puppet@production] dbbackups: Remove s6 stretch backup source instance on eqiad

https://gerrit.wikimedia.org/r/692341

AFAICS, all of s6 is in buster/10.4:

Screenshot from 2021-05-26 10-38-40.png (868×2 px, 587 KB)