⚓ T299481 Upgrade dbstore100* hosts to Bullseye

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Marostegui	T298585 Upgrade WMF database-and-backup-related hosts to bullseye
Resolved	• razzi	T299481 Upgrade dbstore100* hosts to Bullseye

Marostegui created this task.Jan 19 2022, 6:35 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 19 2022, 6:35 AM

Marostegui mentioned this in T299479: Upgrade s6 to Bullseye.Jan 19 2022, 6:36 AM

Marostegui mentioned this in T300099: Upgrade x1 to Bullseye.Jan 26 2022, 6:37 AM

Marostegui mentioned this in T300473: Upgrade s5 to Bullseye.Jan 31 2022, 6:47 AM

Ladsgroup mentioned this in T300510: Upgrade s2 to Bullseye.Jan 31 2022, 2:07 PM

Marostegui mentioned this in T300600: Upgrade s3 to Bullseye.Feb 1 2022, 8:03 AM

odimitrijevic added a project: Data-Engineering.Feb 10 2022, 6:14 PM

Marostegui mentioned this in T301653: Upgrade s7 to Bullseye.Feb 14 2022, 6:43 AM

@Marostegui let's coordinate downtime is the same as our cloud host

cc @BTullis @razzi

odimitrijevic moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.Feb 14 2022, 5:15 PM

@odimitrijevic it is really up to you all. It only requires stopping all mariadb instances, doing the reimage and then starting them back.
I can provide the commands in detail if you need them so you can proceed as needed with the reimage at your most convenient day/time.

Ladsgroup mentioned this in T302185: Upgrade s8 to Bullseye.Feb 21 2022, 1:24 AM

Ladsgroup mentioned this in T302363: Upgrade s7 to bullseye.Feb 23 2022, 4:23 AM

Ladsgroup mentioned this in T302950: Upgrade s4 to bullseye.Mar 3 2022, 3:01 AM

Marostegui mentioned this in T303171: Upgrade s1 to Bullseye.Mar 7 2022, 12:06 PM

@odimitrijevic any ETA on this? Thanks!

I can get started on this one. Here's my plan; if it looks good we can announce downtime; I vote to do the upgrade next Tuesday the 29th of March; I think all the reimages could be done in 1 day, and we'd have 3 days until Friday April 1 when the next round of monthly statistics are computed. If that timeline is too short, we can wait until the week of April 4.

Here's what I'd do using as example dbstore1003:

Turn off mysql replication for every section on the host

for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'stop slave'
done

Run the reimage cookbook on cumin1001

tmux
sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye

Connect to the management console, watch the installation

ssh root@dbstore1003.mgmt.eqiad.wmnet
/admin1-> console com2

Once the installation has finished and ssh is back up and running, ssh in, start each mysql section systemd unit, and restart replication on each instance.

@elukey and I discussed using the reuse-parts-test.cfg recipe to confirm the /srv partition would be preserved, but netboot already uses partman/custom/reuse-db.cfg which I think will work. If it doesn't we can always restore the data from another database host.

In T299481#7802420, @razzi wrote:
I can get started on this one. Here's my plan; if it looks good we can announce downtime; I vote to do the upgrade next Tuesday the 29th of March; I think all the reimages could be done in 1 day, and we'd have 3 days until Friday April 1 when the next round of monthly statistics are computed. If that timeline is too short, we can wait until the week of April 4.

Here's what I'd do using as example dbstore1003:

Turn off mysql replication for every section on the host
for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'stop slave'
done

After the stop slave you also need to run systemctl stop mariadb@s* as stop slave will only stop replication but won't stop the daemon.

Run the reimage cookbook on cumin1001
tmux
sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye
Connect to the management console, watch the installation
ssh root@dbstore1003.mgmt.eqiad.wmnet
/admin1-> console com2
Once the installation has finished and ssh is back up and running, ssh in, start each mysql section systemd unit, and restart replication on each instance.

You'll also need to change permissions on /srv for them to be mysql:mysql
This would do the trick:

chown -R mysql. /srv/* ; systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; systemctl disable prometheus-mysqld-exporter.service ; systemctl reset-failed

Once mysql is started on all sections, please run mysql_upgrade -S /run/mysqld/mysqld.sX.sock (X being the section), just in case.

@elukey and I discussed using the reuse-parts-test.cfg recipe to confirm the /srv partition would be preserved, but netboot already uses partman/custom/reuse-db.cfg which I think will work. If it doesn't we can always restore the data from another database host.

It should work but we should only reimage one host first to double check it is indeed working, before going for the rest as we can always restore the data from another database host is very painful operation to do, so we should avoid it at all costs if we can.

FWIW I wrote this script (P23031) that did more than 100 bullseye upgrade in production. It works basically on any db except codfw masters or hosts that are bool(multinstance and have replicas) (i.e. it works on dbs that are multinstance but don't have replicas or have replicas but are not multiinstance).

You run it, it shuts down mysql, etc. and then gives you the cookbook to run, you copy-paste and run the cookbook and then re-run it again with --after which handles bringing back mysql and rest. The code is based on auto_schema so I'm not sure if you want to run this code per se but I assume looking at it would give you some ideas on how this needs to be done. Later I will migrate most of this to a cookbook.

Ok thanks for chiming in @Marostegui and @Ladsgroup. Here is my updated plan, and I'm planning to kick this off a week from today on April 5 at 15:00 UTC.

Here's my updated plan with the steps you commented marked (new). For each host:

Turn off mysql replication for every section on the host

for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'stop slave'
done

(new) Stop mysql services

systemctl stop 'mariadb@s*'

Run the reimage cookbook on cumin1001

tmux
sudo -i wmf-auto-reimage-host -p T299481 dbstore1003.eqiad.wmnet --os bullseye

Connect to the management console, watch the installation

ssh root@dbstore1003.mgmt.eqiad.wmnet
/admin1-> console com2

Once the installation has finished and ssh is back up and running:

(new) change permissions for mariadb

chown -R mysql. /srv/* ; systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; systemctl disable prometheus-mysqld-exporter.service ; systemctl reset-failed

start mysql service

systemctl start 'mariadb@s*'

(new) run mysql_upgrade just in case

for sock in /var/run/mysqld/mysqld.s?.sock; do
  mysql_upgrade -S $sock
done

re-enable replication

for sock in /var/run/mysqld/mysqld.s?.sock; do
  sudo mysql -S $sock -e 'start slave'
done

Let me know what you think; I think we're ready but we can always postpone.

• razzi claimed this task.Mar 29 2022, 9:49 PM

In T299481#7816378, @razzi wrote:

start mysql service
systemctl start 'mariadb@s*'

I don't think this will work, you'll probably need to start each service.
systemctl start mariadb@s1 etc

• razzi added a project: Data-Engineering-Kanban.Mar 30 2022, 4:15 PM

• razzi moved this task from Next Up to Ready to Deploy on the Data-Engineering-Kanban board.

Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:02:05Z] <razzi> set dbstore1003.eqiad.wmnet to downtime for upgrade T299481

Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:10:11Z] <razzi> razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1003

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1003.eqiad.wmnet with OS bullseye completed:

dbstore1003 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204051510_razzi_3407779_dbstore1003.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Looks like reimage went fine; the warning about icinga status is that the replication has not caught up, but I see the replication Seconds_Behind_Master decreasing over time.

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1005.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2022-04-05T15:54:06Z] <razzi> razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1005.eqiad.wmnet with OS bullseye completed:

dbstore1005 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204051552_razzi_3454861_dbstore1005.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=27c2b587-9114-435a-8894-b5c96a8ee85b) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade dbstore1007 to bullseye

dbstore1007.eqiad.wmnet

• razzi updated the task description. (Show Details)Apr 5 2022, 4:33 PM

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host dbstore1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host dbstore1007.eqiad.wmnet with OS bullseye completed:

dbstore1007 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204051636_razzi_3489480_dbstore1007.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

• razzi updated the task description. (Show Details)Apr 5 2022, 5:06 PM

All the reimages are done. Thanks for your input @Marostegui and @Ladsgroup .

You were right @Marostegui that the start command doesn't work with globs :) I ran each start separately

For future reference, here's a full log of the commands to upgrade dbstore1005.eqiad.wmnet; a couple of the commands I had were missing sudo, since there were sections that didn't follow the S<number> pattern I had to change a couple commands.

razzi@cumin1001:~$ sudo cookbook sre.hosts.downtime dbstore1005.eqiad.wmnet -D 1 -r 'Upgrade dbstore1005 to bullseye' -t T299481

# on dbstore1005.eqiad.wmnet
for sock in /var/run/mysqld/mysqld.*.sock; do
  sudo mysql -S $sock -e 'stop slave'
done

sudo systemctl stop 'mariadb@*'

sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005

razzi@cumin1001:~$ sudo cookbook sre.hosts.reimage --os bullseye -t T299481 dbstore1005

# on dbstore1005.eqiad.wmnet
sudo chown -R mysql. /srv/* ; sudo systemctl set-environment MYSQLD_OPTS="--skip-slave-start" ; sudo systemctl disable prometheus-mysqld-exporter.service ; sudo systemctl reset-failed

sudo systemctl start mariadb@s6
sudo systemctl start mariadb@s8
sudo systemctl start mariadb@staging
sudo systemctl start mariadb@x1

for sock in /var/run/mysqld/mysqld.*.sock; do
  sudo mysql_upgrade -S $sock
done

for sock in /var/run/mysqld/mysqld.*.sock; do
  sudo mysql -S $sock -e 'start slave'
done

2 notes:

staging gave an error when I tried to start replication, but I see on icinga under

MariaDB Replica Lag: staging: OK slave_sql_state not a slave, so this is not a problem

since these databases were multiinstance, I'm not sure that the sudo systemctl disable prometheus-mysqld-exporter.service was appropriate, but following the reimage all the prometheus-mysqld-exporter@s6.service etc are running currently looks like everything is in good shape.

• razzi moved this task from Ready to Deploy to Done on the Data-Engineering-Kanban board.Apr 5 2022, 5:42 PM

In T299481#7833034, @razzi wrote:

since these databases were multiinstance, I'm not sure that the sudo systemctl disable prometheus-mysqld-exporter.service was appropriate, but following the reimage all the prometheus-mysqld-exporter@s6.service etc are running currently looks like everything is in good shape.

That is exactly why it is needed to get disabled. As they are multinstance, that unit will never get to start and thus, would alert on icinga if not disabled.

Marostegui closed this task as Resolved.Apr 6 2022, 4:42 AM

• razzi mentioned this in T299480: Upgrade clouddb* hosts to Bullseye.Apr 7 2022, 7:15 PM

Upgrade dbstore100* hosts to Bullseye
Closed, ResolvedPublic
Actions

Description

Related Objects
Search...

Event Timeline

Upgrade dbstore100* hosts to BullseyeClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Upgrade dbstore100* hosts to Bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...