Upgrade s6 to Debian Buster and MariaDB 10.4
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Apr 21 2021, 5:29 AM

Details

Subject	Repo	Branch	Lines +/-
dbbackups: Remove s6 stretch backup source instance on eqiad	operations/puppet	production	+1 -2
db1131: Disable notifications.	operations/puppet	production	+1 -0
dbbackups: remove db2097 s6 section for this codfw backup source	operations/puppet	production	+1 -2
install_server: Switch db1131 to buster.	operations/puppet	production	+0 -1
dbbackups: Switchover s6 eqiad database backups from db1139 to db1140	operations/puppet	production	+5 -5
dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140	operations/puppet	production	+5 -5
db1173: Disable notifications	operations/puppet	production	+1 -0
install_server: switch db1173 to buster	operations/puppet	production	+0 -1
dbbackups: Switchover s6 codfw database backups from db2097 to db2141	operations/puppet	production	+5 -5
install_server: Switch db2129 to buster.	operations/puppet	production	+0 -1
mariadb: Remove obsolete comment.	operations/puppet	production	+0 -1

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T302086 Set scap minimum python version to 3.7
Resolved	None	T247045 Migrate all of production metal and VMs to Buster or later
Resolved	• Marostegui	T250666 Upgrade WMF database-and-backup-related hosts to buster
Resolved	Kormat	T280751 Upgrade s6 to Debian Buster and MariaDB 10.4
Resolved	Kormat	T282124 Switchover s6 from db1131 to db1173
Resolved	sgrabarczuk	T282144 Read-only window needed for s6 (frwiki, jawiki, ruwiki)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

db1165 is ready to take over db1085 in s6 as sanitarium master.
They both need to be stopped at the same time and move db1155 (sanitarium for s2) under it. db1125 (old sanitarium) doesn't need to be moved, as it will be repurposed somewhere else (T258361)

Kormat claimed this task.Apr 26 2021, 9:17 AM

Kormat updated the task description. (Show Details)

Steps to swap db1085 -> db1165:

Silence db1085, db1165 & db1155, and wikireplicas:

sudo cookbook sre.hosts.downtime --minutes 60 -r "Replace db1085 with db1165 T280751"  'A:db-clouddb or A:db-labsdb or P{db1085* or db1155* or db1165*}'

Depool instances:

sudo dbctl instance db1085 depool
sudo dbctl instance db1165 depool
sudo dbctl config commit -m "Depooling for sanitarium master switch T280751"

db-stop-in-sync db1085 db1165, and wait for db1155:s6 to catch up.
Switch db1155 over:

master-pos db1165.eqiad.wmnet
sudo mysql.py -h db1155:s6
> stop slave;
> reset slave all;
> change master to ...;
> start slave;

Start replication again on db1165: sudo mysql.py -h db1165 -e 'start slave'
Start replication on db1085, but leave depooled.
Wait for db1165 to catch up, then repool it.
Merge puppet commit with the heira changes: https://gerrit.wikimedia.org/r/c/operations/puppet/+/684895

In T280751#7052806, @Kormat wrote:
Steps to swap db1085 -> db1165:

Silence db1085, db1165 & db1155.

Depool instances:
sudo dbctl instance db1085 depool
sudo dbctl instance db1165 depool
sudo dbctl config commit -m "Depooling for sanitarium master switch T280751"
db-stop-in-sync db1085 db1165, and wait for db1155:s6 to catch up.

Switch db1155 over:
master-pos db1165.eqiad.wmnet
sudo mysql.py -h db1155:s6
> stop slave;
> change master to ...;
> start slave;
Start replication again on db1165: sudo mysql.py -h db1165 -e 'start slave'

Wait for db1165 to catch up, then repool it.

Merge puppet commit with the heira changes

Remove db1085 from dbctl?

This works for me, small notes:

you might want to silence also the cloudb* hosts, as those will also alert on IRC.
After the stop slave you might want also to run reset slave all
Once db1155:s6 has caught up enable GTID: STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;
Start replication on db1085
I would leave db1085 depooled (but still on dbctl) and if after a few days everything goes well, decom it!.

Thanks!

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:34:36Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:34:45Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Replace db1085 with db1165 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:35:38Z] <kormat@cumin1001> dbctl commit (dc=all): 'Depooling for sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15714 and previous config saved to /var/cache/conftool/dbconfig/20210504-123537-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-04T12:46:48Z] <kormat@cumin1001> dbctl commit (dc=all): 'Repooling after sanitarium master switch T280751', diff saved to https://phabricator.wikimedia.org/P15715 and previous config saved to /var/cache/conftool/dbconfig/20210504-124647-kormat.json

Change 684895 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Remove obsolete comment.

https://gerrit.wikimedia.org/r/684895

Change 684895 merged by Kormat:

[operations/puppet@production] mariadb: Remove obsolete comment.

https://gerrit.wikimedia.org/r/684895

Kormat updated the task description. (Show Details)May 4 2021, 12:51 PM

jcrespo updated the task description. (Show Details)May 5 2021, 12:28 PM

Turns out the candidate master for s6/codfw (db2114) is already running buster/10.4.

jcrespo updated the task description. (Show Details)May 5 2021, 12:30 PM

Change 685440 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: Switch db2129 to buster.

https://gerrit.wikimedia.org/r/685440

Change 685440 merged by Kormat:

[operations/puppet@production] install_server: Switch db2129 to buster.

https://gerrit.wikimedia.org/r/685440

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:12:30Z] <kormat> reimaging db2129 to buster T280751

Script wmf-auto-reimage was launched by kormat on cumin2001.codfw.wmnet for hosts:

['db2129.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105051315_kormat_6374.log.

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:19:49Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T13:19:57Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Reimage db2129 T280751

• Marostegui mentioned this in T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14).May 5 2021, 1:57 PM

• Marostegui mentioned this in T273360: Schema change for dropping default of img_timestamp and making it binary(14).

• Marostegui mentioned this in T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14).

Mentioned in SAL (#wikimedia-operations) [2021-05-05T14:28:47Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T14:28:56Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Reimage db2129 T280751

Completed auto-reimage of hosts:

['db2129.codfw.wmnet']

and were ALL successful.

Kormat updated the task description. (Show Details)May 5 2021, 2:57 PM

Change 681621 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switchover s6 codfw database backups from db2097 to db2141

https://gerrit.wikimedia.org/r/681621

Mentioned in SAL (#wikimedia-operations) [2021-05-05T15:10:01Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-05T15:10:08Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 10 hosts with reason: Table check on db2129 T280751

jcrespo updated the task description. (Show Details)May 5 2021, 3:12 PM

Change 685494 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140

https://gerrit.wikimedia.org/r/685494

^I've prepared the backup failover for eqiad :-)

mysqlcheck --all-databases completed successfully on db2129.

Change 685726 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: remove db2097 s6 section for this codfw backup source

https://gerrit.wikimedia.org/r/685726

Change 685753 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: switch db1173 to buster

https://gerrit.wikimedia.org/r/685753

Change 685754 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/685754

Change 685753 merged by Kormat:

[operations/puppet@production] install_server: switch db1173 to buster

https://gerrit.wikimedia.org/r/685753

Change 685754 merged by Kormat:

[operations/puppet@production] db1173: Disable notifications

https://gerrit.wikimedia.org/r/685754

Mentioned in SAL (#wikimedia-operations) [2021-05-06T11:12:25Z] <kormat> reimaging db1173 to buster T280751

Mentioned in SAL (#wikimedia-operations) [2021-05-06T11:12:56Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 depooling: Reimage to buster T280751', diff saved to https://phabricator.wikimedia.org/P15824 and previous config saved to /var/cache/conftool/dbconfig/20210506-111256-kormat.json

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1173.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105061118_kormat_25433.log.

Completed auto-reimage of hosts:

['db1173.eqiad.wmnet']

and were ALL successful.

db1173 (candidate master in eqiad) reimaged to buster, mysqlcheck --all-databases running now.

Kormat moved this task from Ready to In progress on the DBA board.May 6 2021, 12:24 PM

Mentioned in SAL (#wikimedia-operations) [2021-05-07T11:33:56Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15853 and previous config saved to /var/cache/conftool/dbconfig/20210507-113355-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T11:49:01Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15854 and previous config saved to /var/cache/conftool/dbconfig/20210507-114859-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T12:04:04Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15855 and previous config saved to /var/cache/conftool/dbconfig/20210507-120404-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-07T12:19:08Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15856 and previous config saved to /var/cache/conftool/dbconfig/20210507-121908-kormat.json

Kormat closed subtask T282124: Switchover s6 from db1131 to db1173 as Resolved.May 17 2021, 5:26 AM

Change 685494 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switchover backup generation for s6 on eqiad from db1139 to db1140

https://gerrit.wikimedia.org/r/685494

Kormat updated the task description. (Show Details)May 17 2021, 9:40 AM

Kormat updated the task description. (Show Details)

Change 681622 abandoned by Jcrespo:

[operations/puppet@production] dbbackups: Switchover s6 eqiad database backups from db1139 to db1140

Reason:

duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/ /685494

https://gerrit.wikimedia.org/r/681622

Change 692314 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] install_server: Switch db1131 to buster.

https://gerrit.wikimedia.org/r/692314

Change 692315 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] db1131: Disable notifications.

https://gerrit.wikimedia.org/r/692315

Change 692314 merged by Kormat:

[operations/puppet@production] install_server: Switch db1131 to buster.

https://gerrit.wikimedia.org/r/692314

Change 692315 merged by Kormat:

[operations/puppet@production] db1131: Disable notifications.

https://gerrit.wikimedia.org/r/692315

Script wmf-auto-reimage was launched by kormat on cumin1001.eqiad.wmnet for hosts:

['db1131.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105171309_kormat_32150.log.

Completed auto-reimage of hosts:

['db1131.eqiad.wmnet']

and were ALL successful.

db1131 (old s6 eqiad primary instance) has been reimaged to buster. It's now running mariadb-check -A.

Change 692341 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Remove s6 stretch backup source instance on eqiad

https://gerrit.wikimedia.org/r/692341

Change 685726 merged by Jcrespo:

[operations/puppet@production] dbbackups: remove db2097 s6 section for this codfw backup source

https://gerrit.wikimedia.org/r/685726

db2097:s6 removed, only db1139:s6 left (to do next week).

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:01:57Z] <kormat@cumin1001> dbctl commit (dc=all): 'Remove s6 eqiad primary from 'api' group T280751', diff saved to https://phabricator.wikimedia.org/P16043 and previous config saved to /var/cache/conftool/dbconfig/20210518-090156-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:02:16Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 25%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16045 and previous config saved to /var/cache/conftool/dbconfig/20210518-090215-kormat.json

Kormat updated the task description. (Show Details)May 18 2021, 9:03 AM

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:04:50Z] <kormat@cumin1001> dbctl commit (dc=all): 'Set db1131 to weight 400 in s6/eqiad T280751', diff saved to https://phabricator.wikimedia.org/P16046 and previous config saved to /var/cache/conftool/dbconfig/20210518-090449-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:17:26Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 50%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16049 and previous config saved to /var/cache/conftool/dbconfig/20210518-091725-kormat.json

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:32:29Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 75%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16050 and previous config saved to /var/cache/conftool/dbconfig/20210518-093228-kormat.json

db1131 completed the databases check successfully, and is now being repooled.

Mentioned in SAL (#wikimedia-operations) [2021-05-18T09:47:33Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1131 (re)pooling @ 100%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P16053 and previous config saved to /var/cache/conftool/dbconfig/20210518-094732-kormat.json