Upgrade the druid-public cluster to bullseye
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Mar 20 2023, 1:35 PM

Description

This ticket will track the upgrade of the druid-public analytics servers to Debian Bullseye.
We have recently onboarded 3 new druid-instances running Bullseye and are looking to decommission 3 hosts from the cluster T336043. So the only reimage left is for druid100[7-8]
Steps are

Check partman recipe configured

Begin reimage

druid1007
druid1008

Details

Subject	Repo	Branch	Lines +/-
update druid100[7-8] reuse partman recipe	operations/puppet	production	+1 -1
set druid hosts to use the reuse partman recipe	operations/puppet	production	+2 -2
Build for Bullseye	operations/software/druid_exporter	debian	+10 -9

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Open	None	T288804 Upgrade the Data Engineering infrastructure to Debian Bullseye
Resolved	Stevemunene	T332589 Upgrade the druid-public cluster to bullseye

Event Timeline

BTullis created this task.Mar 20 2023, 1:35 PM

BTullis mentioned this in T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye.Mar 20 2023, 1:37 PM

Change 902092 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/debs/druid@debian] * Rebuild for bullseye T332584 T332589 * Move to Java 11 * Remove adduser dependency for anything but druid-common, the rest don't need it * Remove versioned druid-common dependency, we're way past 0.10 for a while * Move to debhelper 13 (which absorbed dh-systemd)

https://gerrit.wikimedia.org/r/902092

gerritbot added a project: Patch-For-Review.Mar 22 2023, 1:59 PM

Change 902092 abandoned by Muehlenhoff:

Reason:

Obsolete, different patch was merged

https://gerrit.wikimedia.org/r/902092

Mentioned in SAL (#wikimedia-operations) [2023-03-22T15:53:36Z] <moritzm> uploaded druid 0.19.wmf0-2 to bullseye-wikimedia T332584 T332589

Maintenance_bot removed a project: Patch-For-Review.Mar 22 2023, 4:10 PM

Change 902302 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/software/druid_exporter@debian] Build for Bullseye

https://gerrit.wikimedia.org/r/902302

gerritbot added a project: Patch-For-Review.Mar 23 2023, 8:37 AM

Change 902302 merged by Muehlenhoff:

[operations/software/druid_exporter@debian] Build for Bullseye

https://gerrit.wikimedia.org/r/902302

Maintenance_bot removed a project: Patch-For-Review.Mar 23 2023, 9:30 AM

Mentioned in SAL (#wikimedia-operations) [2023-03-23T09:47:17Z] <moritzm> uploaded prometheus-druid-exporter 0.8-2 for bullseye-wikimedia T332584 T332589

Muehlenhoff mentioned this in rOSDE87d4f7fb7399: Build for Bullseye.Mar 23 2023, 12:08 PM

BTullis added a project: Data-Platform-SRE.Jun 9 2023, 11:56 AM

JArguello-WMF removed a project: Shared-Data-Infrastructure.Jun 29 2023, 1:44 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 9:41 PM

BTullis moved this task from Incoming to Misc on the Data-Platform-SRE board.Aug 22 2023, 3:54 PM

BTullis moved this task from Misc to Ready for Work on the Data-Platform-SRE board.Oct 10 2023, 9:05 AM

BTullis triaged this task as High priority.Nov 15 2023, 9:44 AM

Stevemunene claimed this task.Nov 21 2023, 2:59 PM

Stevemunene updated the task description. (Show Details)

Stevemunene removed a subscriber: • nfraison.

Stevemunene moved this task from Ready for Work to In Progress on the Data-Platform-SRE board.Nov 21 2023, 4:26 PM

Change 976385 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] set druid hosts to use the reuse partman recipe

https://gerrit.wikimedia.org/r/976385

gerritbot added a project: Patch-For-Review.Nov 22 2023, 6:16 AM

Change 976385 merged by Stevemunene:

[operations/puppet@production] set druid hosts to use the reuse partman recipe

https://gerrit.wikimedia.org/r/976385

Maintenance_bot removed a project: Patch-For-Review.Nov 23 2023, 8:30 AM

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Hit a bit of a block with the reimage at the partitioning step, exploring the options to find the best way forward for druid1008

Stevemunene updated the task description. (Show Details)Nov 23 2023, 9:38 AM

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

druid1008 (FAIL)
- Downtimed on Icinga/Alertmanager
- Set pooled=inactive for the following services on confctl:

{"druid1008.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced PXE for next reboot
Host rebooted via IPMI
Host up (Debian installer)
Add puppet_version metadata to Debian installer
Checked BIOS boot parameters are back to normal
Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1008\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

druid1008 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Change 976943 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] update druid100[7-8] reuse partman recipe

https://gerrit.wikimedia.org/r/976943

gerritbot added a project: Patch-For-Review.Nov 23 2023, 10:37 AM

Change 976943 merged by Stevemunene:

[operations/puppet@production] update druid100[7-8] reuse partman recipe

https://gerrit.wikimedia.org/r/976943

Maintenance_bot removed a project: Patch-For-Review.Nov 23 2023, 11:10 AM

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

druid1008 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

druid1008 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye executed with errors:

druid1008 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- The reimage failed, see the cookbook logs for the details

We fixed a partman recipe issue that was causing some errors, then proceeded as expected with
the expected options below then

selected Yes from the image below

The host is however stuck in an unresponsive state at the Check BIOS boot parameters are back to normal which should be followed by an automatic reboot. I have power cycled the machine but it is still in the same state.
starting another reimage for druid1008 hoping to get more on the specific cause

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1008.eqiad.wmnet with OS bullseye completed:

druid1008 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311231455_stevemunene_2136367_druid1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Stevemunene updated the task description. (Show Details)Nov 24 2023, 6:05 AM

Mentioned in SAL (#wikimedia-analytics) [2023-11-24T06:07:31Z] <stevemunene> pool druid1008 after reimage T332589

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1007.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2023-11-27T13:27:01Z] <stevemunene> reimage druid1007 to upgrade to bullseye T332589

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1007.eqiad.wmnet with OS bullseye completed:

druid1007 (WARN)
- Downtimed on Icinga/Alertmanager
- Set pooled=inactive for the following services on confctl:

{"druid1007.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced PXE for next reboot
Host rebooted via IPMI
Host up (Debian installer)
Add puppet_version metadata to Debian installer
Checked BIOS boot parameters are back to normal
Host up (new fresh bullseye OS)
Generated Puppet certificate
Signed new Puppet certificate
Run Puppet in NOOP mode to populate exported resources in PuppetDB
Found Nagios_host resource for this host in PuppetDB
Downtimed the new host on Icinga/Alertmanager
Removed previous downtime on Alertmanager (old OS)
First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311271345_stevemunene_348828_druid1007.out
configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
Rebooted
Automatic Puppet run was successful
Forced a re-check of all Icinga services for the host
Icinga status is optimal
Icinga downtime removed
Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1007\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

Updated Netbox data from PuppetDB

Stevemunene updated the task description. (Show Details)Nov 27 2023, 2:05 PM

druid100[7-8] are now running bullseye. As stated druid100[4-6] are in the process of being decommissioned T336043 and once that is done the whole druid public cluster will be fully running bullseye.

Mentioned in SAL (#wikimedia-analytics) [2023-11-27T15:05:51Z] <stevemunene> pool druid1007 after bullseye reimage T332589

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1010.eqiad.wmnet with OS bullseye completed:

druid1010 (WARN)
- Downtimed on Icinga/Alertmanager
- Set pooled=inactive for the following services on confctl:

{"druid1010.eqiad.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=eqiad,cluster=druid-public,service=druid-public-broker"}

Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced PXE for next reboot
Host rebooted via IPMI
Host up (Debian installer)
Add puppet_version metadata to Debian installer
Checked BIOS boot parameters are back to normal
Host up (new fresh bullseye OS)
Generated Puppet certificate
Signed new Puppet certificate
Run Puppet in NOOP mode to populate exported resources in PuppetDB
Found Nagios_host resource for this host in PuppetDB
Downtimed the new host on Icinga/Alertmanager
Removed previous downtime on Alertmanager (old OS)
First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311300855_stevemunene_2142028_druid1010.out
configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
Rebooted
Automatic Puppet run was successful
Forced a re-check of all Icinga services for the host
Icinga status is optimal
Icinga downtime removed
Services in confctl are not automatically pooled, to restore the previous state you have to run the following commands:

sudo confctl select 'name=druid1010\.eqiad\.wmnet,dc=eqiad,cluster=druid\-public,service=druid\-public\-broker' set/pooled=yes

Updated Netbox data from PuppetDB
Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Gehel closed this task as Resolved.Dec 1 2023, 10:11 AM

	F41525854: image.png
	Nov 23 2023, 2:38 PM

	F41525119: image.png
	Nov 23 2023, 2:38 PM

	F41524831: image.png
	Nov 23 2023, 9:37 AM

	F41524833: image.png
	Nov 23 2023, 9:37 AM

Upgrade the druid-public cluster to bullseyeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade the druid-public cluster to bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...