Page MenuHomePhabricator

Upgrade hadoop workers to bullseye
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Mar 20 2023, 12:36 PM
Referenced Files
F37730472: image.png
Sep 18 2023, 1:28 PM
F37727493: image.png
Sep 15 2023, 11:16 AM
F37721279: image.png
Sep 14 2023, 2:34 PM
F37667677: image.png
Sep 6 2023, 4:52 PM
F37656999: image.png
Sep 5 2023, 12:34 PM
F37652889: image.png
Sep 4 2023, 10:08 AM
F37652969: image.png
Sep 4 2023, 10:08 AM
F37626140: image.png
Aug 24 2023, 11:41 AM

Description

This ticket will track the upgrade of the analaytics Hadoop workers to bullseye.

There are currently 91 hosts in this cluster, although at the time of writing, 6 are due to be decommissioned.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

an-worker1117 is stuck at install with an error no root filesystem is defined. Looking into this.

image.png (1×1 px, 488 KB)

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1117.eqiad.wmnet with OS bullseye completed:

  • an-worker1117 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308290542_stevemunene_2708160_an-worker1117.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

an-worker1117 is stuck at install with an error no root filesystem is defined. Looking into this.

Looking into this, we recently changed the partman recipe for an-worker1117 to the reuse-analytics-hadoop-worker-12dev.cfg which on #L15-L17 expects to find 3 partitions root journalnode and swap. However, due to the recent reimage that failed due to the fact that an-worker1117 was pointing to the wrong partition, the journalnode partition was not available as seen below.

@an-worker1117:~# lsblk
NAME                          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 446.6G  0 disk 
├─sda1                          8:1    0   953M  0 part /boot
├─sda2                          8:2    0     1K  0 part 
└─sda5                          8:5    0 445.7G  0 part 
  ├─an--worker1117--vg-swap   254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1117--vg-root   254:1    0  55.9G  0 lvm  /
  └─an--worker1117--vg-unused 254:2    0 291.4G  0 lvm

This was fixed by running the script available at #Worker_Nodes to remove the unused partition and create the journalnode partition. The result is as below:

@an-worker1117:~# lsblk
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0 446.6G  0 disk 
├─sda1                               8:1    0   953M  0 part /boot
├─sda2                               8:2    0     1K  0 part 
└─sda5                               8:5    0 445.7G  0 part 
  ├─an--worker1117--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1117--vg-root        254:1    0  55.9G  0 lvm  /
  └─an--worker1117--vg-journalnode 254:2    0    10G  0 lvm

Retried the reimage with sudo cookbook sre.hosts.reimage --os bullseye -t T332570 an-worker1117 --new since the host had disappeared from PuppetDB because down for too long. The reimage was successful, moving on to an-worker1118+

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1118.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1119.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1118.eqiad.wmnet with OS bullseye completed:

  • an-worker1118 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308290633_stevemunene_2717780_an-worker1118.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1119.eqiad.wmnet with OS bullseye completed:

  • an-worker1119 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308290636_stevemunene_2717859_an-worker1119.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1120.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1121.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1120.eqiad.wmnet with OS bullseye completed:

  • an-worker1120 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308290728_stevemunene_2732709_an-worker1120.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1121.eqiad.wmnet with OS bullseye completed:

  • an-worker1121 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308290731_stevemunene_2732769_an-worker1121.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1122.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1123.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1123.eqiad.wmnet with OS bullseye completed:

  • an-worker1123 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308291002_stevemunene_2766120_an-worker1123.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1122.eqiad.wmnet with OS bullseye completed:

  • an-worker1122 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308291004_stevemunene_2766112_an-worker1122.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1124.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1125.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1124.eqiad.wmnet with OS bullseye completed:

  • an-worker1124 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308291506_stevemunene_2829574_an-worker1124.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1125.eqiad.wmnet with OS bullseye completed:

  • an-worker1125 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308291508_stevemunene_2829792_an-worker1125.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1126.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1127.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1126.eqiad.wmnet with OS bullseye completed:

  • an-worker1126 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308300635_stevemunene_3011437_an-worker1126.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1127.eqiad.wmnet with OS bullseye completed:

  • an-worker1127 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308300637_stevemunene_3011616_an-worker1127.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1128.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye completed:

  • an-worker1129 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308300725_stevemunene_3025456_an-worker1129.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1128.eqiad.wmnet with OS bullseye completed:

  • an-worker1128 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202308300728_stevemunene_3025409_an-worker1128.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1129 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309040929_stevemunene_2233697_an-worker1129.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1129.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1129 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309040929_stevemunene_2233697_an-worker1129.out
    • The reimage failed, see the cookbook logs for the details

sre.hosts.downtime cookbook Failed in the second last step during the reimage with this;

Error: Could not prepare for execution: The puppet agent command does not take parameters                                      
================                                                                                                               
PASS |                                                                                         |   0% (0/1) [01:11<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [01:11<00:00, 71.75s/hosts]
100.0% (1/1) of nodes failed to execute command 'run-puppet-agent...et --attempts 60': alert1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'run-puppet-agent...et --attempts 60'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.hosts.downtime:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 212, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/downtime.py", line 116, in run
    self.puppet.run(quiet=True, attempts=60, timeout=600)
  File "/usr/lib/python3/dist-packages/spicerack/puppet.py", line 200, in run
    self._remote_hosts.run_sync(Command(command, timeout=timeout), batch_size=batch_size)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 496, in run_sync
    return self._execute(
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 702, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1129.eqiad.wmnet with reason: host reimage
//Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99//

The host seems to be back on icinga after a while, monitoring for any abnormalities

image.png (1×3 px, 541 KB)

Puppet run also successful and the hadoop-hdfs-datanode.service is running as well and host is listed as in service on the hdfs datanode interface
image.png (442×1 px, 70 KB)

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1130.eqiad.wmnet with OS bullseye completed:

  • an-worker1130 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309050626_stevemunene_3584374_an-worker1130.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1131.eqiad.wmnet with OS bullseye completed:

  • an-worker1131 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309050646_stevemunene_3605938_an-worker1131.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1132 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

an-worker1132 seems to be stuck on debian Install as seen below. power cycling the server and retrying the reimage.

image.png (1×3 px, 197 KB)

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1132 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye

The error I see is this:

image.png (433×720 px, 40 KB)

I go to this over IPMI with: ipmitool -I lanplus -H "an-worker1132.mgmt.eqiad.wmnet" -U root -E sol activate

I also logged in with the sudo install_console an-worker1132.eqiad.wmnet and I could verify that there is no journalnode volume.

~ # lvs
  LV     VG               Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root   an-worker1132-vg -wi-a-----  <55.88g                                                    
  swap   an-worker1132-vg -wi-a-----    9.31g                                                    
  unused an-worker1132-vg -wi-a----- <291.36g                                                    
~ # vgs
  VG               #PV #LV #SN Attr   VSize    VFree 
  an-worker1132-vg   1   3   0 wz--n- <445.69g 89.14g

There is a script referenced here, which has some commands for manually creating a journalnode, if required.
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Standard_Worker_Installation

I'll run the steps manually and retry the reimage.

I have executed:

lvcreate -L 10g -n journalnode an-worker1132-vg

Now we can see that there is a 10 GB journalnode volume.

# lvs
  LV          VG               Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  journalnode an-worker1132-vg -wi-a-----  10.00g                                                    
  root        an-worker1132-vg -wi-a----- <55.88g                                                    
  swap        an-worker1132-vg -wi-a-----   9.31g

I'll try the reimage again.

The installation looks to be proceeding as expected now. I will check the other nodes to see if any others will experience the same issue.

I ran sudo cumin A:hadoop-worker "lvs | grep journalnode" from cumin1001 and it looks like this is the only host that is going to be affected by this issue. There are a few discrepancies in the VG name or the size, which suggests that these are caused by the manual configuraiton in the past, but nothing serious.

1.3% (1/78) of nodes failed to execute command 'lvs | grep journalnode': an-worker1132.eqiad.wmnet
98.7% (77/78) success ratio (< 100.0% threshold) for command: 'lvs | grep journalnode'. Aborting.: an-worker[1078-1095,1097-1131,1133-1148].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
98.7% (77/78) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.: an-worker[1078-1095,1097-1131,1133-1148].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS bullseye completed:

  • an-worker1132 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309061725_btullis_1777678_an-worker1132.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1133.eqiad.wmnet with OS bullseye completed:

  • an-worker1133 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309070915_stevemunene_3486181_an-worker1133.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1134.eqiad.wmnet with OS bullseye completed:

  • an-worker1134 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309070941_stevemunene_3497414_an-worker1134.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Seeing some HDFS corrupt blocks from 2023-09-07 10:03 UTC on grafana.
Did a quick check on the master nodes which show 0 corrupt files

@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks
Connecting to namenode via https://an-master1001.eqiad.wmnet:50470/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The filesystem under path '/' has 0 CORRUPT files
@an-master1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs fsck / -list-corruptfileblocks
Connecting to namenode via https://an-master1001.eqiad.wmnet:50470/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The filesystem under path '/' has 0 CORRUPT files

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1135.eqiad.wmnet with OS bullseye completed:

  • an-worker1135 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309071356_stevemunene_3634094_an-worker1135.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1136.eqiad.wmnet with OS bullseye completed:

  • an-worker1136 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309071413_stevemunene_3642711_an-worker1136.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1137.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1137.eqiad.wmnet with OS bullseye completed:

  • an-worker1137 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309141043_stevemunene_13556_an-worker1137.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1139.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1138 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-analytics) [2023-09-14T14:13:33Z] <stevemunene> powercycle an-worker1138, investigating failures related to reimage T332570

an-worker1138 is currently facing an error

image.png (1×2 px, 338 KB)

Did a powercycle in order to access the terminal, however the host does not accept the root pw.
First thought was to check the partitions from the previous hosts experience as per Standard_Worker_Installation but the host is still inaccessible

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1139.eqiad.wmnet with OS bullseye completed:

  • an-worker1139 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309141328_stevemunene_48952_an-worker1139.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Did a powercycle in order to access the terminal, however the host does not accept the root pw.
First thought was to check the partitions from the previous hosts experience as per Standard_Worker_Installation but the host is still inaccessible

To track this down you can restart the reimage and follow along via the mgmt/serial console. If the error from the screenshot above happens again, just keep the error dialogue open and connect into the Debian installer from puppetmaster1001.eqiad.wmnet with:

sudo ssh -4 -i /root/.ssh/new_install  -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no an-worker1138.eqiad.wmnet

The file /var/syslog probably has a few hints what failed exactly. Also happy to have a look as well.

Thanks @MoritzMuehlenhoff.

I managed to get access to the instance via regular ssh and confirmed that the right volumes exist, which they do

sda                                  8:0    0 446.6G  0 disk 
├─sda1                               8:1    0   953M  0 part /boot
├─sda2                               8:2    0     1K  0 part 
└─sda5                               8:5    0 445.7G  0 part 
  ├─an--worker1138--vg-root        254:0    0  55.9G  0 lvm  /
  ├─an--worker1138--vg-swap        254:1    0   9.3G  0 lvm  [SWAP]
  └─an--worker1138--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal

Restarting the reimage and following along to see where else the issue could be.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye

Following the install via IPMI with ipmitool -I lanplus -H "an-worker1138.mgmt.eqiad.wmnet" -U root -E sol activate

Reimage seems to have been successful this time round. Waiting for the first puppet run to complete.

image.png (292×1 px, 36 KB)

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1138.eqiad.wmnet with OS bullseye completed:

  • an-worker1138 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309151112_stevemunene_350432_an-worker1138.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Following the install via IPMI with ipmitool -I lanplus -H "an-worker1138.mgmt.eqiad.wmnet" -U root -E sol activate

Reimage seems to have been successful this time round.

Great :-)

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1140 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181210_stevemunene_3044066_an-worker1140.out
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1141.eqiad.wmnet with OS bullseye completed:

  • an-worker1141 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181224_stevemunene_3170186_an-worker1141.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1140.eqiad.wmnet with OS bullseye executed with errors:

  • an-worker1140 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181210_stevemunene_3044066_an-worker1140.out
    • The reimage failed, see the cookbook logs for the details

an-worker1140 failed with an unsuccessful puppet run, but succeeded in the subsequent runs
the fail below

----- OUTPUT of 'run-puppet-agent --quiet' -----
================
PASS |                                                                                                                                                                 |   0% (0/1) [00:02<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:02<00:00,  2.40s/hosts]
100.0% (1/1) of nodes failed to execute command 'run-puppet-agent --quiet': cumin1001.eqiad.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'run-puppet-agent --quiet'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.

but all the subsequent runs were successful, manual puppet runs also succesful

Notice: Applied catalog in 49.25 seconds
stevemunene@an-worker1140:~$

The host is also fully back and running as per icinga and the hdfs namenode manager interface.

image.png (292×1 px, 45 KB)

Proceeding with the rest of the upgrades.

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1142.eqiad.wmnet with OS bullseye completed:

  • an-worker1142 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181445_stevemunene_3718642_an-worker1142.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1143.eqiad.wmnet with OS bullseye completed:

  • an-worker1143 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181504_stevemunene_3720801_an-worker1143.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1144.eqiad.wmnet with OS bullseye completed:

  • an-worker1144 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181601_stevemunene_3736949_an-worker1144.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1145.eqiad.wmnet with OS bullseye completed:

  • an-worker1145 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309181617_stevemunene_3741759_an-worker1145.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1146.eqiad.wmnet with OS bullseye completed:

  • an-worker1146 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309190813_stevemunene_3919727_an-worker1146.out
    • Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1147.eqiad.wmnet with OS bullseye completed:

  • an-worker1147 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309191001_stevemunene_3943227_an-worker1147.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1148.eqiad.wmnet with OS bullseye completed:

  • an-worker1148 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202309191015_stevemunene_3947579_an-worker1148.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (row E/F)

We have successfully completed the hadoop worker upgrades to Bullseye.

sudo cumin --no-progress a:hadoop-worker 'cat /etc/debian_version'
86 hosts will be targeted:
an-worker[1078-1095,1097-1156].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
OK to proceed on 86 hosts? Enter the number of affected hosts to confirm or "q" to quit: 86
===== NODE GROUP =====
(86) an-worker[1078-1095,1097-1156].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----
11.7
================
100.0% (86/86) success ratio (>= 100.0% threshold) for command: 'cat /etc/debian_version'.
100.0% (86/86) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

We have successfully completed the hadoop worker upgrades to Bullseye.

Excellent!