Page MenuHomePhabricator

Upgrade the druid-public cluster to bullseye
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Mar 20 2023, 1:35 PM
Referenced Files
F41525854: image.png
Nov 23 2023, 2:38 PM
F41525119: image.png
Nov 23 2023, 2:38 PM
F41524831: image.png
Nov 23 2023, 9:37 AM
F41524833: image.png
Nov 23 2023, 9:37 AM

Description

This ticket will track the upgrade of the druid-public analytics servers to Debian Bullseye.
We have recently onboarded 3 new druid-instances running Bullseye and are looking to decommission 3 hosts from the cluster T336043. So the only reimage left is for druid100[7-8]
Steps are

  • Check partman recipe configured

Begin reimage

  • druid1007
  • druid1008

Event Timeline

Change 902092 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/debs/druid@debian] * Rebuild for bullseye T332584 T332589 * Move to Java 11 * Remove adduser dependency for anything but druid-common, the rest don't need it * Remove versioned druid-common dependency, we're way past 0.10 for a while * Move to debhelper 13 (which absorbed dh-systemd)

https://gerrit.wikimedia.org/r/902092

Change 902092 abandoned by Muehlenhoff:

[operations/debs/druid@debian] * Rebuild for bullseye T332584 T332589 * Move to Java 11 * Remove adduser dependency for anything but druid-common, the rest don't need it * Remove versioned druid-common dependency, we're way past 0.10 for a while * Move to debhelper 13 (which absorbed dh-systemd)

Reason:

Obsolete, different patch was merged

https://gerrit.wikimedia.org/r/902092

Change 902302 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/software/druid_exporter@debian] Build for Bullseye

https://gerrit.wikimedia.org/r/902302

Change 902302 merged by Muehlenhoff:

[operations/software/druid_exporter@debian] Build for Bullseye

https://gerrit.wikimedia.org/r/902302

Mentioned in SAL (#wikimedia-operations) [2023-03-23T09:47:17Z] <moritzm> uploaded prometheus-druid-exporter 0.8-2 for bullseye-wikimedia T332584 T332589

BTullis triaged this task as High priority.Nov 15 2023, 9:44 AM
Stevemunene updated the task description. (Show Details)
Stevemunene removed a subscriber: nfraison.

Change 976385 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] set druid hosts to use the reuse partman recipe

https://gerrit.wikimedia.org/r/976385

Change 976385 merged by Stevemunene:

[operations/puppet@production] set druid hosts to use the reuse partman recipe

https://gerrit.wikimedia.org/r/976385

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Hit a bit of a block with the reimage at the partitioning step, exploring the options to find the best way forward for druid1008

image.png (1×1 px, 230 KB)

image.png (1×1 px, 371 KB)

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

  • druid1008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"druid1008.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Add puppet_version metadata to Debian installer
  • Checked BIOS boot parameters are back to normal
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1008\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

  • druid1008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Change 976943 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] update druid100[7-8] reuse partman recipe

https://gerrit.wikimedia.org/r/976943

Change 976943 merged by Stevemunene:

[operations/puppet@production] update druid100[7-8] reuse partman recipe

https://gerrit.wikimedia.org/r/976943

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

  • druid1008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

  • druid1008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

  • druid1008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

We fixed a partman recipe issue that was causing some errors, then proceeded as expected with
the expected options below then

image.png (315×1 px, 115 KB)

selected Yes from the image below
image.png (1×1 px, 273 KB)

The host is however stuck in an unresponsive state at the Check BIOS boot parameters are back to normal which should be followed by an automatic reboot. I have power cycled the machine but it is still in the same state.
starting another reimage for druid1008 hoping to get more on the specific cause

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye completed:

  • druid1008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311231455_stevemunene_2136367_druid1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-analytics) [2023-11-24T06:07:31Z] <stevemunene> pool druid1008 after reimage T332589

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1007.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2023-11-27T13:27:01Z] <stevemunene> reimage druid1007 to upgrade to bullseye T332589

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1007.eqiad.wmnet with OS bullseye completed:

  • druid1007 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"druid1007.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Add puppet_version metadata to Debian installer
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311271345_stevemunene_348828_druid1007.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is optimal
  • Icinga downtime removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1007\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

  • Updated Netbox data from PuppetDB

druid100[7-8] are now running bullseye. As stated druid100[4-6] are in the process of being decommissioned T336043 and once that is done the whole druid public cluster will be fully running bullseye.

Mentioned in SAL (#wikimedia-analytics) [2023-11-27T15:05:51Z] <stevemunene> pool druid1007 after bullseye reimage T332589

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye completed:

  • druid1010 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"druid1010.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • Host up (Debian installer)
  • Add puppet_version metadata to Debian installer
  • Checked BIOS boot parameters are back to normal
  • Host up (new fresh bullseye OS)
  • Generated Puppet certificate
  • Signed new Puppet certificate
  • Run Puppet in NOOP mode to populate exported resources in PuppetDB
  • Found Nagios_host resource for this host in PuppetDB
  • Downtimed the new host on Icinga/Alertmanager
  • Removed previous downtime on Alertmanager (old OS)
  • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311300855_stevemunene_2142028_druid1010.out
  • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
  • Rebooted
  • Automatic Puppet run was successful
  • Forced a re-check of all Icinga services for the host
  • Icinga status is optimal
  • Icinga downtime removed
  • Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1010\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

  • Updated Netbox data from PuppetDB
  • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)