Page MenuHomePhabricator

Upgrade dbstore100* hosts to Bullseye
Closed, ResolvedPublic

Description

We have finished our testing of Debian Bullseye and everything has been fine (T295965). dbstore100* hosts can be migrated to Bullseye (mariadb version isn't changing and we are keeping 10.4).

  • dbstore1003
  • dbstore1005
  • dbstore1007

Event Timeline

@Marostegui let's coordinate downtime is the same as our cloud host

cc @BTullis @razzi

@odimitrijevic it is really up to you all. It only requires stopping all mariadb instances, doing the reimage and then starting them back.
I can provide the commands in detail if you need them so you can proceed as needed with the reimage at your most convenient day/time.

I can get started on this one. Here's my plan; if it looks good we can announce downtime; I vote to do the upgrade next Tuesday the 29th of March; I think all the reimages could be done in 1 day, and we'd have 3 days until Friday April 1 when the next round of monthly statistics are computed. If that timeline is too short, we can wait until the week of April 4.

Here's what I'd do using as example dbstore1003:

  • Turn off mysql replication for every section on the host
for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'stop slave'
done
  • Run the reimage cookbook on cumin1001
tmux
sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye
  • Connect to the management console, watch the installation
ssh root@dbstore1003.mgmt.eqiad.wmnet
/admin1-> console com2
  • Once the installation has finished and ssh is back up and running, ssh in, start each mysql section systemd unit, and restart replication on each instance.

@elukey and I discussed using the reuse-parts-test.cfg recipe to confirm the /srv partition would be preserved, but netboot already uses partman/custom/reuse-db.cfg which I think will work. If it doesn't we can always restore the data from another database host.

I can get started on this one. Here's my plan; if it looks good we can announce downtime; I vote to do the upgrade next Tuesday the 29th of March; I think all the reimages could be done in 1 day, and we'd have 3 days until Friday April 1 when the next round of monthly statistics are computed. If that timeline is too short, we can wait until the week of April 4.

Here's what I'd do using as example dbstore1003:

  • Turn off mysql replication for every section on the host
for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'stop slave'
done

After the stop slave you also need to run systemctl stop mariadb@s* as stop slave will only stop replication but won't stop the daemon.

  • Run the reimage cookbook on cumin1001
tmux
sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye
  • Connect to the management console, watch the installation
ssh root@dbstore1003.mgmt.eqiad.wmnet
/admin1-> console com2
  • Once the installation has finished and ssh is back up and running, ssh in, start each mysql section systemd unit, and restart replication on each instance.

You'll also need to change permissions on /srv for them to be mysql:mysql
This would do the trick:

chown -R mysql. /srv/* ; systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; systemctl disable prometheus-mysqld-exporter.service ; systemctl reset-failed

Once mysql is started on all sections, please run mysql_upgrade -S /run/mysqld/mysqld.sX.sock (X being the section), just in case.

@elukey and I discussed using the reuse-parts-test.cfg recipe to confirm the /srv partition would be preserved, but netboot already uses partman/custom/reuse-db.cfg which I think will work. If it doesn't we can always restore the data from another database host.

It should work but we should only reimage one host first to double check it is indeed working, before going for the rest as we can always restore the data from another database host is very painful operation to do, so we should avoid it at all costs if we can.

FWIW I wrote this script (P23031) that did more than 100 bullseye upgrade in production. It works basically on any db except codfw masters or hosts that are bool(multinstance and have replicas) (i.e. it works on dbs that are multinstance but don't have replicas or have replicas but are not multiinstance).

You run it, it shuts down mysql, etc. and then gives you the cookbook to run, you copy-paste and run the cookbook and then re-run it again with --after which handles bringing back mysql and rest. The code is based on auto_schema so I'm not sure if you want to run this code per se but I assume looking at it would give you some ideas on how this needs to be done. Later I will migrate most of this to a cookbook.

Ok thanks for chiming in @Marostegui and @Ladsgroup. Here is my updated plan, and I'm planning to kick this off a week from today on April 5 at 15:00 UTC.

Here's my updated plan with the steps you commented marked (new). For each host:

  • Turn off mysql replication for every section on the host
for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'stop slave'
done
  • (new) Stop mysql services
systemctl stop 'mariadb@s*'
  • Run the reimage cookbook on cumin1001
tmux
sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye
  • Connect to the management console, watch the installation

ssh root@dbstore1003.mgmt.eqiad.wmnet
/admin1-> console com2


Once the installation has finished and ssh is back up and running:

  • (new) change permissions for mariadb
chown -R mysql. /srv/* ; systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; systemctl disable prometheus-mysqld-exporter.service ; systemctl reset-failed
  • start mysql service
systemctl start 'mariadb@s*'
  • (new) run mysql_upgrade just in case
for sock in /var/run/mysqld/mysqld.s?.sock; do
  mysql_upgrade -S $sock
done
  • re-enable replication
for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'start slave'
done

Let me know what you think; I think we're ready but we can always postpone.

  • start mysql service
systemctl start 'mariadb@s*'

I don't think this will work, you'll probably need to start each service.
systemctl start mariadb@s1 etc

Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:02:05Z] <razzi> set dbstore1003.eqiad.wmnet to downtime for upgrade T299481

Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:10:11Z] <razzi> razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1003

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1003.eqiad.wmnet with OS bullseye completed:

  • dbstore1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204051510_razzi_3407779_dbstore1003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Looks like reimage went fine; the warning about icinga status is that the replication has not caught up, but I see the replication Seconds_Behind_Master decreasing over time.

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1005.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:54:06Z] <razzi> razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1005.eqiad.wmnet with OS bullseye completed:

  • dbstore1005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204051552_razzi_3454861_dbstore1005.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=27c2b587-9114-435a-8894-b5c96a8ee85b) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade dbstore1007 to bullseye

dbstore1007.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1007.eqiad.wmnet with OS bullseye completed:

  • dbstore1007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204051636_razzi_3489480_dbstore1007.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

All the reimages are done. Thanks for your input @Marostegui and @Ladsgroup .

You were right @Marostegui that the start command doesn't work with globs :) I ran each start separately

For future reference, here's a full log of the commands to upgrade dbstore1005.eqiad.wmnet; a couple of the commands I had were missing sudo, since there were sections that didn't follow the S<number> pattern I had to change a couple commands.

razzi@cumin1001:~$ sudo cookbook sre.hosts.downtime dbstore1005.eqiad.wmnet -D 1 -r 'Upgrade dbstore1005 to bullseye' -t T299481

# on dbstore1005.eqiad.wmnet
for sock in /var/run/mysqld/mysqld.*.sock; do
  sudo mysql -S $sock -e 'stop slave'
done

sudo systemctl stop 'mariadb@*'

sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005

razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005

# on dbstore1005.eqiad.wmnet
sudo chown -R mysql. /srv/* ; sudo systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; sudo systemctl disable prometheus-mysqld-exporter.service ; sudo systemctl reset-failed

sudo systemctl start mariadb@s6
sudo systemctl start mariadb@s8
sudo systemctl start mariadb@staging
sudo systemctl start mariadb@x1

for sock in /var/run/mysqld/mysqld.*.sock; do
  sudo mysql_upgrade -S $sock
done

for sock in /var/run/mysqld/mysqld.*.sock; do
  sudo mysql -S $sock -e 'start slave'
done

2 notes:

  • staging gave an error when I tried to start replication, but I see on icinga under

MariaDB Replica Lag: staging: OK slave_sql_state not a slave, so this is not a problem

  • since these databases were multiinstance, I'm not sure that the sudo systemctl disable prometheus-mysqld-exporter.service was appropriate, but following the reimage all the prometheus-mysqld-exporter@s6.service etc are running currently looks like everything is in good shape.
  • since these databases were multiinstance, I'm not sure that the sudo systemctl disable prometheus-mysqld-exporter.service was appropriate, but following the reimage all the prometheus-mysqld-exporter@s6.service etc are running currently looks like everything is in good shape.

That is exactly why it is needed to get disabled. As they are multinstance, that unit will never get to start and thus, would alert on icinga if not disabled.