Page MenuHomePhabricator

Upgrade clouddb* hosts to Bullseye
Closed, ResolvedPublic

Description

We have finished our testing of Debian Bullseye and everything has been fine (T295965). clouddb* hosts can be migrated to Bullseye (mariadb version isn't changing and we are keeping 10.4).

  • clouddb13
  • clouddb14
  • clouddb15
  • clouddb16
  • clouddb17
  • clouddb18
  • clouddb19
  • clouddb20
  • clouddb21

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I'll do this next week. To my knowledge these hosts are pretty much the same as the dbstore hosts I did this week for https://phabricator.wikimedia.org/T299481, except that there can be no downtime if I depool the hosts first.

Yes, it is pretty much the same. Remember that there're two hosts per section, so you could depool at least half of them and let the others serving.
example: clouddb1013 and clouddb1017 serve s1.

One more thing to keep in mind is that clouddb1021 is the analytics one, so that one has all the sections. Not sure if that requires special treatment in order not to affect any service.

Change 779483 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] clouddb: depool clouddb1013-1016 for upgrades

https://gerrit.wikimedia.org/r/779483

Change 779483 merged by Razzi:

[operations/puppet@production] clouddb: depool clouddb1013-1016 for upgrades

https://gerrit.wikimedia.org/r/779483

Icinga downtime and Alertmanager silence (ID=55df59f3-2152-4064-a9d7-eecc09c55982) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade clouddb1013 to bullseye

clouddb1013.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1013.eqiad.wmnet with OS bullseye

Change 779488 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] netboot: add clouddb partitioned as database

https://gerrit.wikimedia.org/r/779488

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1013.eqiad.wmnet with OS bullseye executed with errors:

  • clouddb1013 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

I forgot to tell netboot to treat these hosts as database hosts, which I have now done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/779488

The install stopped at the disk partitions step and prompted me for what to do, with the following default configuration:

│      LVM VG tank, LV data - 9.5 TB Linux device-mapper (linear)  ▒      │
│      >     #1     9.5 TB       xfs                               ▒      │
│      SCSI1 (2,0,0) (sda) - 9.6 TB DELL PERC H730P Adp            0      │
│      >            1.0 MB       FREE SPACE                        ▒      │
│      >     #1    40.0 GB       ext4                              ▒      │
│      >     #2     8.0 GB    F  swap          swap                ▒      │
│      >     #3     9.6 TB    K  lvm                               ▒      │
│      >            1.0 MB       FREE SPACE                        ▒      │
│      >            1.0 MB       FREE SPACE                        ▒      │

Which looks right: keep the large lvm partition; but I think it'll be easier to merge the netboot patch and then redo the reimage and let it proceed automatically.

Change 779488 merged by Razzi:

[operations/puppet@production] netboot: add clouddb1013 partitioned as database

https://gerrit.wikimedia.org/r/779488

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1013.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1013.eqiad.wmnet with OS bullseye completed:

  • clouddb1013 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204121642_razzi_1672456_clouddb1013.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 779557 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] netboot: set reuse-db.cfg for clouddb10xx hosts

https://gerrit.wikimedia.org/r/779557

Change 779557 merged by Razzi:

[operations/puppet@production] netboot: set reuse-db.cfg for clouddb10xx hosts

https://gerrit.wikimedia.org/r/779557

Icinga downtime and Alertmanager silence (ID=b835e643-0d45-43b4-9fdc-04e643305c67) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1014.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1014.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1014.eqiad.wmnet with OS bullseye completed:

  • clouddb1014 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204122248_razzi_1720595_clouddb1014.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 779568 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] dbproxy: repool all hosts after finishing reimages for day

https://gerrit.wikimedia.org/r/779568

Ok after some help with wmf-pt-kill in https://phabricator.wikimedia.org/T305974 and a patch to update netboot for other clouddb10xx hosts https://gerrit.wikimedia.org/r/c/operations/puppet/+/779557 the reimage of clouddb1014 went smoothly. I'm repooling all hosts and will continue with clouddb1015-1021 tomorrow.

Change 779568 merged by Razzi:

[operations/puppet@production] dbproxy: repool all hosts after finishing reimages for day

https://gerrit.wikimedia.org/r/779568

Change 779918 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] wikireplicas: depool clouddb1015-16

https://gerrit.wikimedia.org/r/779918

Change 779918 merged by Razzi:

[operations/puppet@production] wikireplicas: depool clouddb1015-16

https://gerrit.wikimedia.org/r/779918

Change 779919 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] wikireplicas: fix depooling yaml

https://gerrit.wikimedia.org/r/779919

Change 779919 merged by Razzi:

[operations/puppet@production] wikireplicas: fix depooling yaml

https://gerrit.wikimedia.org/r/779919

Icinga downtime and Alertmanager silence (ID=f7843804-8334-4adc-b5df-6a326dc15126) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1015.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1015.eqiad.wmnet with OS bullseye completed:

  • clouddb1015 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204131836_razzi_1874126_clouddb1015.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=07eed8e0-a215-4571-9fd6-3913d92afc84) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1016.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1016.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1016.eqiad.wmnet with OS bullseye completed:

  • clouddb1016 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 97
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204132023_razzi_1887812_clouddb1016.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=05439375-0c10-4a70-8c23-9ac7d34861c3) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1017.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1017.eqiad.wmnet with OS bullseye

Icinga downtime and Alertmanager silence (ID=2754761a-a7ea-4907-8769-732880af2c50) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1018.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1018.eqiad.wmnet with OS bullseye

Icinga downtime and Alertmanager silence (ID=cfaab03a-0f35-4e64-881c-356ec9b7c6bc) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1019.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1017.eqiad.wmnet with OS bullseye completed:

  • clouddb1017 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204132118_razzi_1897054_clouddb1017.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1019.eqiad.wmnet with OS bullseye

Icinga downtime and Alertmanager silence (ID=85da21b7-6ddb-4fd3-b825-4db7b312cdaf) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1020.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1020.eqiad.wmnet with OS bullseye

Icinga downtime and Alertmanager silence (ID=0324aac8-1d29-463a-9c22-25aaee9c0b02) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1021.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=bac419f0-2d10-4028-8ef9-41a4d175439a) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1021.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1018.eqiad.wmnet with OS bullseye completed:

  • clouddb1018 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204132142_razzi_1902192_clouddb1018.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1019.eqiad.wmnet with OS bullseye completed:

  • clouddb1019 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204132148_razzi_1902918_clouddb1019.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1020.eqiad.wmnet with OS bullseye completed:

  • clouddb1020 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204132152_razzi_1903273_clouddb1020.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1021.eqiad.wmnet with OS bullseye executed with errors:

  • clouddb1021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Change 780435 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] wikireplicas: repool clouddb1017-1020 following reimaging

https://gerrit.wikimedia.org/r/780435

Change 780435 merged by Razzi:

[operations/puppet@production] wikireplicas: repool clouddb1017-1020 following reimaging

https://gerrit.wikimedia.org/r/780435

Icinga downtime and Alertmanager silence (ID=644878e4-4323-41a9-afa2-a71944aff698) set by razzi@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Upgrade to bullseye

clouddb1021.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by razzi@cumin1001 for host clouddb1021.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by razzi@cumin1001 for host clouddb1021.eqiad.wmnet with OS bullseye completed:

  • clouddb1021 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204261635_razzi_3288954_clouddb1021.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB