Page MenuHomePhabricator

Upgrade the druid-analytics cluster to bullseye
Closed, ResolvedPublic

Description

This ticket tracks the work required to upgrade the five-node druid-analytics cluster to Debian bullseye.

It should be noted that, unlike the druid-public cluster, LVS isn't available for druid-analytics.

This means that we do not have load-balancing available and therefore many ingestion jobs are configured to use an-druid1001.eqiad.wmnet as their service host.

Event Timeline

Change 972851 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Setup partman reuse recipe for an-druid hosts

https://gerrit.wikimedia.org/r/972851

Change 972851 merged by Brouberol:

[operations/puppet@production] Setup partman reuse recipe for an-druid hosts

https://gerrit.wikimedia.org/r/972851

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1005.eqiad.wmnet with OS bullseye

Change 973178 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Fix typo in the an-druit netboot partman case

https://gerrit.wikimedia.org/r/973178

Change 973178 merged by Brouberol:

[operations/puppet@production] Fix typo in the an-druid netboot partman case

https://gerrit.wikimedia.org/r/973178

Change 973204 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Format both LVM volumes of an-druid1005 at next reimage

https://gerrit.wikimedia.org/r/973204

Change 973204 merged by Brouberol:

[operations/puppet@production] Format both LVM volumes of an-druid1005 at next reimage

https://gerrit.wikimedia.org/r/973204

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1005.eqiad.wmnet with OS bullseye completed:

  • an-druid1005 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311092032_brouberol_1662988_an-druid1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
brouberol changed the task status from Open to In Progress.Nov 13 2023, 8:30 AM
brouberol claimed this task.
brouberol moved this task from Ready for Work to In Progress on the Data-Platform-SRE board.

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1004.eqiad.wmnet with OS bullseye completed:

  • an-druid1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311141424_brouberol_574730_an-druid1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1003.eqiad.wmnet with OS bullseye completed:

  • an-druid1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311151423_brouberol_1225348_an-druid1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1002.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1002.eqiad.wmnet with OS bullseye completed:

  • an-druid1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161538_brouberol_1889527_an-druid1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 975206 had a related patch set uploaded (by Brouberol; author: Brouberol):

[analytics/refinery@master] Replace an-druid1001 by an-druid1001 in druid connection strings

https://gerrit.wikimedia.org/r/975206

Change 975207 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Replace an-druid1001 by an-druid1001 in druid connection strings

https://gerrit.wikimedia.org/r/975207

Change 975207 merged by Brouberol:

[operations/puppet@production] Replace an-druid1001 by an-druid1002 in druid connection strings

https://gerrit.wikimedia.org/r/975207

Change 975206 abandoned by Brouberol:

[analytics/refinery@master] Replace an-druid1001 by an-druid1002 in druid connection strings

Reason:

🎵 Dead codes don't need no patches 🎵

https://gerrit.wikimedia.org/r/975206

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1001.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1001.eqiad.wmnet with OS bullseye completed:

  • an-druid1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311221235_brouberol_1388795_an-druid1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB