We have finished our testing of Debian Bullseye and everything has been fine (T295965). dbstore100* hosts can be migrated to Bullseye (mariadb version isn't changing and we are keeping 10.4).
- dbstore1003
- dbstore1005
- dbstore1007
We have finished our testing of Debian Bullseye and everything has been fine (T295965). dbstore100* hosts can be migrated to Bullseye (mariadb version isn't changing and we are keeping 10.4).
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Resolved | Marostegui | T298585 Upgrade WMF database-and-backup-related hosts to bullseye | |||
Resolved | • razzi | T299481 Upgrade dbstore100* hosts to Bullseye |
@odimitrijevic it is really up to you all. It only requires stopping all mariadb instances, doing the reimage and then starting them back.
I can provide the commands in detail if you need them so you can proceed as needed with the reimage at your most convenient day/time.
I can get started on this one. Here's my plan; if it looks good we can announce downtime; I vote to do the upgrade next Tuesday the 29th of March; I think all the reimages could be done in 1 day, and we'd have 3 days until Friday April 1 when the next round of monthly statistics are computed. If that timeline is too short, we can wait until the week of April 4.
Here's what I'd do using as example dbstore1003:
for sock in /var/run/mysqld/mysqld.s?.sock; do sudo mysql -S $sock -e 'stop slave' done
tmux sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye
ssh root@dbstore1003.mgmt.eqiad.wmnet /admin1-> console com2
@elukey and I discussed using the reuse-parts-test.cfg recipe to confirm the /srv partition would be preserved, but netboot already uses partman/custom/reuse-db.cfg which I think will work. If it doesn't we can always restore the data from another database host.
After the stop slave you also need to run systemctl stop mariadb@s* as stop slave will only stop replication but won't stop the daemon.
- Run the reimage cookbook on cumin1001
tmux sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye
- Connect to the management console, watch the installation
ssh root@dbstore1003.mgmt.eqiad.wmnet /admin1-> console com2
- Once the installation has finished and ssh is back up and running, ssh in, start each mysql section systemd unit, and restart replication on each instance.
You'll also need to change permissions on /srv for them to be mysql:mysql
This would do the trick:
chown -R mysql. /srv/* ; systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; systemctl disable prometheus-mysqld-exporter.service ; systemctl reset-failed
Once mysql is started on all sections, please run mysql_upgrade -S /run/mysqld/mysqld.sX.sock (X being the section), just in case.
@elukey and I discussed using the reuse-parts-test.cfg recipe to confirm the /srv partition would be preserved, but netboot already uses partman/custom/reuse-db.cfg which I think will work. If it doesn't we can always restore the data from another database host.
It should work but we should only reimage one host first to double check it is indeed working, before going for the rest as we can always restore the data from another database host is very painful operation to do, so we should avoid it at all costs if we can.
FWIW I wrote this script (P23031) that did more than 100 bullseye upgrade in production. It works basically on any db except codfw masters or hosts that are bool(multinstance and have replicas) (i.e. it works on dbs that are multinstance but don't have replicas or have replicas but are not multiinstance).
You run it, it shuts down mysql, etc. and then gives you the cookbook to run, you copy-paste and run the cookbook and then re-run it again with --after which handles bringing back mysql and rest. The code is based on auto_schema so I'm not sure if you want to run this code per se but I assume looking at it would give you some ideas on how this needs to be done. Later I will migrate most of this to a cookbook.
Ok thanks for chiming in @Marostegui and @Ladsgroup. Here is my updated plan, and I'm planning to kick this off a week from today on April 5 at 15:00 UTC.
Here's my updated plan with the steps you commented marked (new). For each host:
for sock in /var/run/mysqld/mysqld.s?.sock; do sudo mysql -S $sock -e 'stop slave' done
systemctl stop 'mariadb@s*'
tmux sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye
ssh root@dbstore1003.mgmt.eqiad.wmnet
/admin1-> console com2
Once the installation has finished and ssh is back up and running:
chown -R mysql. /srv/* ; systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; systemctl disable prometheus-mysqld-exporter.service ; systemctl reset-failed
systemctl start 'mariadb@s*'
for sock in /var/run/mysqld/mysqld.s?.sock; do mysql_upgrade -S $sock done
for sock in /var/run/mysqld/mysqld.s?.sock; do sudo mysql -S $sock -e 'start slave' done
Let me know what you think; I think we're ready but we can always postpone.
- start mysql service
systemctl start 'mariadb@s*'
I don't think this will work, you'll probably need to start each service.
systemctl start mariadb@s1 etc
Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:02:05Z] <razzi> set dbstore1003.eqiad.wmnet to downtime for upgrade T299481
Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:10:11Z] <razzi> razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1003
Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1003.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1003.eqiad.wmnet with OS bullseye completed:
Looks like reimage went fine; the warning about icinga status is that the replication has not caught up, but I see the replication Seconds_Behind_Master decreasing over time.
Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1005.eqiad.wmnet with OS bullseye
Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:54:06Z] <razzi> razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005
Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1005.eqiad.wmnet with OS bullseye completed:
Icinga downtime and Alertmanager silence (ID=27c2b587-9114-435a-8894-b5c96a8ee85b) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade dbstore1007 to bullseye
dbstore1007.eqiad.wmnet
Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1007.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1007.eqiad.wmnet with OS bullseye completed:
All the reimages are done. Thanks for your input @Marostegui and @Ladsgroup .
You were right @Marostegui that the start command doesn't work with globs :) I ran each start separately
For future reference, here's a full log of the commands to upgrade dbstore1005.eqiad.wmnet; a couple of the commands I had were missing sudo, since there were sections that didn't follow the S<number> pattern I had to change a couple commands.
razzi@cumin1001:~$ sudo cookbook sre.hosts.downtime dbstore1005.eqiad.wmnet -D 1 -r 'Upgrade dbstore1005 to bullseye' -t T299481 # on dbstore1005.eqiad.wmnet for sock in /var/run/mysqld/mysqld.*.sock; do sudo mysql -S $sock -e 'stop slave' done sudo systemctl stop 'mariadb@*' sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005 razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005 # on dbstore1005.eqiad.wmnet sudo chown -R mysql. /srv/* ; sudo systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; sudo systemctl disable prometheus-mysqld-exporter.service ; sudo systemctl reset-failed sudo systemctl start mariadb@s6 sudo systemctl start mariadb@s8 sudo systemctl start mariadb@staging sudo systemctl start mariadb@x1 for sock in /var/run/mysqld/mysqld.*.sock; do sudo mysql_upgrade -S $sock done for sock in /var/run/mysqld/mysqld.*.sock; do sudo mysql -S $sock -e 'start slave' done
2 notes:
MariaDB Replica Lag: staging: OK slave_sql_state not a slave, so this is not a problem
That is exactly why it is needed to get disabled. As they are multinstance, that unit will never get to start and thus, would alert on icinga if not disabled.