Page MenuHomePhabricator

Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts
Closed, ResolvedPublic

Description

This task will track the racking, setup, and OS installation of <enter the FQDN/hostname of the hosts being setup here>

Hostname / Racking / Installation Details

Hostnames: cloudrabbit100[1-3]
Racking Proposal: Cannot be racked in the same rack. Anywhere A-D should be fine.
Networking Setup: 1 10G connection to public vlan
Partitioning/Raid: sw raid 10 (all four drives) "raid10-4dev.cfg"
OS Distro: Bullseye

Hostnames: cloudnet100[5,6]
Racking Proposal: Cannot be racked in the same rack. Use WMCS racks C8/D5 as they should be adjacent to cloudgw hosts.
Networking Setup: 2 10G connects. 1st to cloud-hosts1-eqiad and 2nd TRUNK with 2 vlans, cloud-gw-transport and cloud-instance-transport
Partitioning/Raid: sw raid 10 (all four drives)
OS Distro: Bullseye

Hostnames: cloudservices1005
Racking Proposal: E/F not ok. Anywhere A-D should be fine.
Networking Setup: 10g, public1 VLAN
Partitioning/Raid: sw raid 10 (all four drives)
OS Distro: Bullseye

Per host setup checklist

Each host should have its own setup checklist copied and pasted into the list below.

cloudrabbit1001:
  • - receive in system on procurement task T303415 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudrabbit1002:
  • - receive in system on procurement task T303415 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudrabbit1003:
  • - receive in system on procurement task T303415 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudnet1005:
  • - receive in system on procurement task T303415 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudnet1006:
  • - receive in system on procurement task T303415 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.
cloudservices1005:
  • - receive in system on procurement task T303415 & in coupa
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location, state of planned)
  • - add mgmt dns (asset tag and hostname) and production dns entries in netbox, run cookbook sre.dns.netbox.
  • - network port setup via netbox, run homer from an active cumin host to commit
  • - bios/drac/serial setup/testing, see Lifecycle Steps & Automatic BIOS setup details
  • - firmware update (idrac, bios, network, raid controller)
  • - operations/puppet update - this should include updates to netboot.pp, and site.pp role(insetup) or cp systems use role(insetup::nofirm).
  • - OS installation & initital puppet run via sre.hosts.reimage cookbook.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 811771 merged by Cmjohnson:

[operations/puppet@production] adding new wmcs hosts to netboot.cfg

https://gerrit.wikimedia.org/r/811771

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye executed with errors:

  • cloudrabbit1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye executed with errors:

  • cloudrabbit1001 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

I am getting this on all but the cloudnets, those are not hitting the installer.

────────────────────┤ [!!] Configure the network ├─────────────────────┐

│                                                                       │
│                   Network autoconfiguration failed                    │
│ Your network is probably not using the DHCP protocol. Alternatively,  │
│ the DHCP server may be slow or some network hardware is not working   │
│ properly.                                                             │
│

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye executed with errors:

  • cloudrabbit1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye executed with errors:

  • cloudrabbit1003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bullseye executed with errors:

  • cloudservices1005 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@Cmjohnson Hey. Drop me a line on this one perhaps.

The issue is that the cloudnet assigned IPs do not seem to match the Vlans they have been assigned to. This has alerted in the Netbox report:

image.png (181×1 px, 72 KB)

The fix should be relatively straightforward. For instance for cloudnet1005 has been correctly assigned IPs matching the ' cloud-hosts1-c8-eqiad (1128)' Vlan, so changing the Vlan on cloudsw1-c8-eqiad xe-0/0/3 from 'cloud-hosts1-eqiad (1118)' to 1128 should fix it.

But what I'm more concerned with is how this discrepancy happened. We'd reworked the Netbox provisioning script so it should pick the rack-specific Vlan for new hosts. It's picked the IPs from there, but I can't understand why the switch has then be assigned to the old Vlan. So rather than just jumping in and fixing manually I want to try and work out why the script didn't work as intended and fix that. Thanks!

Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

@Cmjohnson just an update here. I left cloudnet1005 alone, so we can piece back why the switch ports ended up on the wrong vlans (I unfortunately couldn't find related logs in Netbox to see what the provision script there did).

I manually changed port xe-0/0/11 on cloudsw1-d5-eqiad to Vlan 1127 / 'cloud-hosts1-d5-eqiad' and re-tried the image to see if there were any other niggles (these are the first to be reimaged in this rack since the cloud network re-design).

What I found was that DHCP worked fine when initiated by the iDRAC/PXE-boot. The system got an IP address from the install server and the Debian installer started running.

However, when the debian installer went to do it's DHCP request, which should work the same, it failed. Looking on both the switch and the iDRAC GUI I can see that both server NIC ports remain hard down at this point. So obviously the DHCP request fails as the connection to the switch has gone down.

I'm at a loss to explain why the port was working, and then goes down during the debian-installer phase. Potentially could it be related to frimware for the NIC or something? What I can say is the DHCP config on the switch appears to be valid and working as expected.

@cmooney the cloudnet servers were manually moved in netbox, so I don't know if the script would've picked up the vlan change. I find it interesting that you fixed cloudnet vlan issue and the server is experiencing the same issue as the others.

@Cmjohnson ok. Is it possible that when you moved them you selected the wrong Vlan?

If the script is assigning IPs from one Vlan, but configuring the switches for a different one, that's a big problem we need to sort out in the script. On the other hand if the inconsistency was just a manual error during the move then it's no issue.

In terms of the fact debian-installer stage is failing DHCP I'm not sure. Both NICs remain hard down throughout. I think first step should probably be to look at the NIC firmware version and get it on the known best, which I think based on T304483#8032810 is 21.85.21.92, but you guys probably know better than me. Currently it's on 21.40.25.31.

Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

@cmooney it is most likely a manual change error. I did not completely delete the interface after removing the cloudcephosd hosts, I only updated it with the new vlan for the cloudnets. In the future, I will delete the interface entirely and start over.

As for the nic firmware, I am updating everything but 1006, I think you may have already done that. I do not think that will fix the issue.

@cmooney I believe I found the error, in site.pp I failed to put a ^ before the hostname

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bullseye

Change 812033 had a related patch set uploaded (by Cmjohnson; author: Cmjohnson):

[operations/puppet@production] updating site.pp entry cloudnet1005-6

https://gerrit.wikimedia.org/r/812033

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1001.wikimedia.org with OS bullseye completed:

  • cloudrabbit1001 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071559_cmjohnson_1613294_cloudrabbit1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Change 812033 merged by Cmjohnson:

[operations/puppet@production] updating site.pp entry cloudnet1005-6

https://gerrit.wikimedia.org/r/812033

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1002.wikimedia.org with OS bullseye completed:

  • cloudrabbit1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071613_cmjohnson_1614917_cloudrabbit1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudrabbit1003.wikimedia.org with OS bullseye completed:

  • cloudrabbit1003 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071614_cmjohnson_1615069_cloudrabbit1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudservices1005.wikimedia.org with OS bullseye completed:

  • cloudservices1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071625_cmjohnson_1618440_cloudservices1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

all but the cloudnets installed correctly, they're still presenting the dhcp error. I am thinking I may just blow out all the network configuration and delete the ports and start over. @cmooney

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1005.eqiad.wmnet with OS bullseye completed:

  • cloudnet1005 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207071831_cmjohnson_1648841_cloudnet1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

all but cloudnet1006 has gone through the installer, cloudnet1006 is still giving the dhcp error. I did try deleting all the ports and starting over but that did not seem to work.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

all but the cloudnets installed correctly, they're still presenting the dhcp error. I am thinking I may just blow out all the network configuration and delete the ports and start over. @cmooney

I wouldn't be super confident that will help. When I was checking last week all the network elements were set up right, and the fact they make it to the debain basically confirmed that. So definitely something on the NIC/firmware/driver side I suspect.

Just for the record cloudnet1005 did seem to install ok. Or at least DHCP did not fail at PXE or debian-installer stage.

It's using NIC firmware 21.85.21.92 though. Cloudnet1006 is the same exact hardware as I understand, but is still failing. It's still on firmware 21.40.25.31 so I reckon the upgrade is likely to work.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye executed with errors:

  • cloudnet1006 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • The reimage failed, see the cookbook logs for the details

@cmooney cloudnet1006 nic f/w was update but still fails, if you get a moment can you take a look. I am not sure what I am missing

There is an OS on the server but has not gone through puppet and unable to ssh

@Papaul or @RobH I don't know what I am doing wrong with cloudnet1006, the installer fails fairly early in the process. There is a current OS on it that was not finalized with puppet. If you get a spare moment can you take a look

@Cmjohnson if there is a current OS on it and was not finalized with puppet, try to re-run the cookbook with the --no-pxe --new flags.

Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye completed:

  • cloudnet1006 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207221256_cmjohnson_2871348_cloudnet1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> staged
Cmjohnson updated the task description. (Show Details)

Thanks @Papaul that worked. @Andrew all yours!

Change 826352 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Make Cloudservices1005 a designate node

https://gerrit.wikimedia.org/r/826352

Change 826352 merged by Andrew Bogott:

[operations/puppet@production] Make Cloudservices1005 a designate node

https://gerrit.wikimedia.org/r/826352

Change 826358 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudservices1005: hack in a temporary resolver fqdn

https://gerrit.wikimedia.org/r/826358

Change 826358 merged by Andrew Bogott:

[operations/puppet@production] cloudservices1005: hack in a temporary resolver fqdn

https://gerrit.wikimedia.org/r/826358

Change 826364 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Add cloudservices1005 to the list of designate hosts

https://gerrit.wikimedia.org/r/826364

Change 826364 merged by Andrew Bogott:

[operations/puppet@production] Add cloudservices1005 to the list of designate hosts

https://gerrit.wikimedia.org/r/826364

Change 826378 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudservices1005 will replace ns0 rather than ns1.

https://gerrit.wikimedia.org/r/826378

Change 826378 merged by Andrew Bogott:

[operations/puppet@production] cloudservices1005 will replace ns0 rather than ns1.

https://gerrit.wikimedia.org/r/826378

Change 826387 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Replace cloudservices1003 with cloudservices1005

https://gerrit.wikimedia.org/r/826387

Change 826388 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/dns@master] Replace cloudservices1003 with cloudservices1005 for ns0

https://gerrit.wikimedia.org/r/826388

Change 826387 merged by Andrew Bogott:

[operations/puppet@production] Replace cloudservices1003 with cloudservices1005

https://gerrit.wikimedia.org/r/826387

Change 826388 merged by Andrew Bogott:

[operations/dns@master] Replace cloudservices1003 with cloudservices1005 for ns0

https://gerrit.wikimedia.org/r/826388

Change 826393 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Remove temporary ns2 def for cloudservices1005

https://gerrit.wikimedia.org/r/826393

Change 826393 merged by Andrew Bogott:

[operations/puppet@production] Remove temporary ns2 def for cloudservices1005

https://gerrit.wikimedia.org/r/826393

Mentioned in SAL (#wikimedia-cloud) [2022-08-24T22:07:10Z] <andrewbogott> replaced cloudservices1003 with cloudservices1005 T304888