Page MenuHomePhabricator

cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot
Closed, ResolvedPublic

Description

I need to rebuild these host with Bullseye but PXE boot fails. So far none of the cloudvirts I've tried are able to launch the debian installer.

Both hosts are out of service and can be rebooted at any time.

(Update: according to current theory, there are nine cloudvirts affected by this issue: cloudvirt1016 through cloudvirt1024)

Event Timeline

Andrew renamed this task from cloudvirt1016 fails to PXE boot to cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot.Mar 8 2022, 5:33 PM
Andrew updated the task description. (Show Details)

It's hanging on the dhcp request:

Booting from QLogic MBA Slot 0100 v7.14.2

QLogic UNDI PXE-2.1 v7.14.2
Copyright (C) 2016 QLogic Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: 18 66 DA BB 31 72  GUID: 4C4C4544-0059-5310-804A-B8C04F384D32
DHCP....\
Andrew renamed this task from cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot to cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot.Mar 8 2022, 6:49 PM
Andrew updated the task description. (Show Details)

@MoritzMuehlenhoff says this will be fixed with firmware updates; I'd suggest that we update the firmware for all cloudvirts, at least up through 1030.

Change 769102 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] Re-add DHCP term to labs-in filter

https://gerrit.wikimedia.org/r/769102

Change 769102 merged by jenkins-bot:

[operations/homer/public@master] Re-add DHCP term to labs-in filter

https://gerrit.wikimedia.org/r/769102

@Andrew safe to do this anytime?

All three hosts are out of service and can be rebooted at any time.

I am able to update the BIOS but these servers were not initially purchased with the 10G cards so the standard way of updating firmware cannot be completed. I will have to try a different approach.

Similar problems with cloudvirt1047: T293391

Change 769508 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/homer/public@master] policies/cr-labs: Allow tftp to install servers

https://gerrit.wikimedia.org/r/769508

Andrew renamed this task from cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet and cloudvirt1021.eqiad.wmnet fail to PXE boot to cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot.Mar 10 2022, 8:56 PM
Andrew updated the task description. (Show Details)

I'm putting 1021 back in service with the old OS for now. 1016 and 1017 are still fair game for experimentation.

I spent some time on cloudvirt1017 yesterday, I was able to confirm that:

  • When on the live host, with tcpdump, sudo dhclient sends and receive DHCP request/reply

cumin1001:~$ sudo cookbook sre.hosts.dhcp --os bullseye cloudvirt1017

ayounsi@cloudvirt1017:~$ sudo tcpdump port 67 or port 68 -v
tcpdump: listening on enp4s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:22:02.831338 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 328)
    0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from f4:e9:d4:ba:b7:40 (oui Unknown), length 300, xid 0x48acb64b, Flags [none]
	  Client-Ethernet-Address f4:e9:d4:ba:b7:40 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    Hostname Option 12, length 13: "cloudvirt1017"
	    Parameter-Request Option 55, length 13: 
	      Subnet-Mask, BR, Time-Zone, Default-Gateway
	      Domain-Name, Domain-Name-Server, Option 119, Hostname
	      Netbios-Name-Server, Netbios-Scope, MTU, Classless-Static-Route
	      NTP
08:22:02.916826 IP (tos 0x0, ttl 64, id 60754, offset 0, flags [none], proto UDP (17), length 321)
    ae2-1118.cr1-eqiad.wikimedia.org.bootps > cloudvirt1017.eqiad.wmnet.bootpc: BOOTP/DHCP, Reply, length 293, hops 1, xid 0x48acb64b, Flags [none]
	  Your-IP cloudvirt1017.eqiad.wmnet
	  Server-IP install1003.wikimedia.org
	  Gateway-IP ae2-1118.cr1-eqiad.wikimedia.org
	  Client-Ethernet-Address f4:e9:d4:ba:b7:40 (oui Unknown)
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Offer
	    Server-ID Option 54, length 4: install1003.wikimedia.org
	    Lease-Time Option 51, length 4: 43200
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    BR Option 28, length 4: 10.64.20.255
	    Default-Gateway Option 3, length 4: vrrp-gw-1118.eqiad.wmnet
	    Domain-Name Option 15, length 11: "eqiad.wmnet"
	    Domain-Name-Server Option 6, length 4: recdns.anycast.wmnet
  • When on the console (running the reimage cookbook), the host reboots and stays in the DHCP prompt until timeout
CLIENT MAC ADDR: F4 E9 D4 BA B7 40  GUID: 4C4C4544-0031-5210-8042-B3C04F4B4832
19:14 
DHCP...../
  • During that time, DHCP requests make it to install1003, and are correctly sent back toward the host from the routers (relay):
18:13:51.399115  In IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from f4:e9:d4:ba:b7:40, length 661
18:13:51.400918 Out IP truncated-ip - 419 bytes missing! ae2-1118.cr1-eqiad.wikimedia.org.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 441
18:22:44.102636  In 
	Juniper PCAP Flags [Ext, no-L2, In], PCAP Extension(s) total length 22
	  Device Media Type Extension TLV #3, length 1, value: Flexible-Ethernet-Services (52)
	  Logical Interface Encapsulation Extension TLV #6, length 1, value: Ethernet (14)
	  Device Interface Index Extension TLV #1, length 2, value: 179
	  Logical Interface Index Extension TLV #4, length 4, value: 412
	  Logical Unit Number Extension TLV #5, length 4, value: 1118
	-----original packet-----
	PFE proto 2 (ipv4): (tos 0x0, ttl  64, id 0, offset 0, flags [none], proto: UDP (17), length: 743) 0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from f4:e9:d4:ba:b7:40, length 715, xid 0xd5bab740, secs 4, Flags [Broadcast] (0x8000)
	  Client-Ethernet-Address f4:e9:d4:ba:b7:40
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    Parameter-Request Option 55, length 24: 
	      Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server
	      Domain-Name-Server, RL, Hostname, BS
	      Domain-Name, SS, RP, EP
	      Vendor-Option, Server-ID, Vendor-Class, BF
	      Option 128, Option 129, Option 130, Option 131
	      Option 132, Option 133, Option 134, Option 135
	    MSZ Option 57, length 2: 1260
	    GUID Option 97, length 17: 0.68.69.76.76.49.0.16.82.128.66.179.192.79.75.72.50
	    ARCH Option 93, length 2: 0
	    NDI Option 94, length 3: 1.2.1
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
	    Agent-Information Option 82, length 56: 
	      Circuit-ID SubOption 1, length 42: asw2-b-eqiad:xe-7/0/6.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 10: xe-7/0/6.0
	    Agent-Information Option 82, length 51: 
	      Circuit-ID SubOption 1, length 42: cloudsw1-c8-eqiad:ae1.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 5: ae1.0
	    Agent-Information Option 82, length 51: 
	      Circuit-ID SubOption 1, length 42: cloudsw1-d5-eqiad:ae0.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 5: ae0.0
18:22:44.104407 Out 
	Juniper PCAP Flags [Ext], PCAP Extension(s) total length 22
	  Device Media Type Extension TLV #3, length 1, value: Flexible-Ethernet-Services (52)
	  Logical Interface Encapsulation Extension TLV #6, length 1, value: Ethernet (14)
	  Device Interface Index Extension TLV #1, length 2, value: 179
	  Logical Interface Index Extension TLV #4, length 4, value: 412
	  Logical Unit Number Extension TLV #5, length 4, value: 1118
	-----original packet-----
	84:18:88:0d:db:e2 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 487: vlan 1118, p 0, ethertype IPv4, (tos 0x0, ttl   1, id 48129, offset 0, flags [none], proto: UDP (17), length: 469) 10.64.20.3.67 > 255.255.255.255.68: [udp sum ok] BOOTP/DHCP, Reply, length 441, hops 1, xid 0xd5bab740, secs 4, Flags [Broadcast] (0x8000)
	  Your-IP 10.64.20.33
	  Server-IP 208.80.154.32
	  Gateway-IP 10.64.20.3
	  Client-Ethernet-Address f4:e9:d4:ba:b7:40
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Offer
	    Server-ID Option 54, length 4: 208.80.154.32
	    Lease-Time Option 51, length 4: 43200
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    Default-Gateway Option 3, length 4: 10.64.20.1
	    Domain-Name-Server Option 6, length 4: 10.3.0.1
	    Domain-Name Option 15, length 11: "eqiad.wmnet"
	    RP Option 17, length 10: "/tftpboot/"
	    Vendor-Option Option 43, length 82: 209.25.112.120.101.108.105.110.117.120.46.99.102.103.47.116.116.121.83.49.45.49.49.53.50.48.48.210.53.104.116.116.112.58.47.47.97.112.116.46.119.105.107.105.109.101.100.105.97.46.111.114.103.47.116.102.116.112.98.111.111.116.47.98.117.108.108.115.101.121.101.45.105.110.115.116.97.108.108.101.114.47
	    Agent-Information Option 82, length 56: 
	      Circuit-ID SubOption 1, length 42: asw2-b-eqiad:xe-7/0/6.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 10: xe-7/0/6.0

All IPs there are correct.

Running tcpdump on install1003, listening for traffic from 10.64.20.33 (cloudvirt1017) doesn't return anything. Meaning that it's not trying to fetch the PXE files.

So far I interpret that as the host not registering the DHCP reply. The only slightly odd point is that the DHCP offer (reply) is sent to the broadcast MAC by the routers instead of the server's MAC.

I'm not very knowledgeable in this subject, so I'm probably not making sense but, some things that are not clear to me xd

  • I see that the host has two interfaces connected to that switch (enp4s0f1 and enp4s0f0), from which one of them (enp4s0f1) is tagged.
    • Could it be that there's some multi-path issues going on?
    • Could it be that the DHCP replies are arriving to the wrong interface?
    • How does the tagged interface behave during this process? (I imagine that the switch config does not change, so the switch will send VLAN tagged packets down the link, though the host as it's still booting up might not unwrap them, is that correct?)
  • The captures are from the router right? Can they be done on the asw2-b7-eqiad switch outgoing interface? (xe-7/0/6.0last step to the host)
  • I see this on the pcap:
	    Agent-Information Option 82, length 56: 
	      Circuit-ID SubOption 1, length 42: asw2-b-eqiad:xe-7/0/6.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 10: xe-7/0/6.0
	    Agent-Information Option 82, length 51: 
	      Circuit-ID SubOption 1, length 42: cloudsw1-c8-eqiad:ae1.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 5: ae1.0
	    Agent-Information Option 82, length 51: 
	      Circuit-ID SubOption 1, length 42: cloudsw1-d5-eqiad:ae0.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 5: ae0.0

but cloudvirt1017 is not connected to the cloud switches (according to netbox), why are they there?

Some progress:

install1003 DHCP discover
09:19:22.791939 IP (tos 0x0, ttl 64, id 14579, offset 0, flags [none], proto UDP (17), length 689)
    ae2-1118.cr1-eqiad.wikimedia.org.bootps > install1003.wikimedia.org.bootps: BOOTP/DHCP, Request from f4:e9:d4:ba:b7:40 (oui Unknown), length 661, hops 1, xid 0xd9bab740, secs 64, Flags [Broadcast]
	  Gateway-IP ae2-1118.cr1-eqiad.wikimedia.org
	  Client-Ethernet-Address f4:e9:d4:ba:b7:40 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Discover
	    Parameter-Request Option 55, length 24: 
	      Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server
	      Domain-Name-Server, RL, Hostname, BS
	      Domain-Name, SS, RP, EP
	      Vendor-Option, Server-ID, Vendor-Class, BF
	      Option 128, Option 129, Option 130, Option 131
	      Option 132, Option 133, Option 134, Option 135
	    MSZ Option 57, length 2: 1260
	    GUID Option 97, length 17: 0.68.69.76.76.49.0.16.82.128.66.179.192.79.75.72.50
	    ARCH Option 93, length 2: 0
	    NDI Option 94, length 3: 1.2.1
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
	    Agent-Information Option 82, length 56: 
	      Circuit-ID SubOption 1, length 42: asw2-b-eqiad:xe-7/0/6.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 10: xe-7/0/6.0
	    Agent-Information Option 82, length 51: 
	      Circuit-ID SubOption 1, length 42: cloudsw1-c8-eqiad:ae1.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 5: ae1.0
install1003 DHCP offer
09:19:22.792802 IP (tos 0x0, ttl 64, id 34734, offset 0, flags [DF], proto UDP (17), length 469)
    install1003.wikimedia.org.bootps > ae2-1118.cr1-eqiad.wikimedia.org.bootps: BOOTP/DHCP, Reply, length 441, hops 1, xid 0xd9bab740, secs 64, Flags [Broadcast]
	  Your-IP cloudvirt1017.eqiad.wmnet
	  Server-IP install1003.wikimedia.org
	  Gateway-IP ae2-1118.cr1-eqiad.wikimedia.org
	  Client-Ethernet-Address f4:e9:d4:ba:b7:40 (oui Unknown)
	  file "lpxelinux.0"
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Offer
	    Server-ID Option 54, length 4: install1003.wikimedia.org
	    Lease-Time Option 51, length 4: 43200
	    Subnet-Mask Option 1, length 4: 255.255.255.0
	    Default-Gateway Option 3, length 4: vrrp-gw-1118.eqiad.wmnet
	    Domain-Name-Server Option 6, length 4: recdns.anycast.wmnet
	    Domain-Name Option 15, length 11: "eqiad.wmnet"
	    RP Option 17, length 10: "/tftpboot/"
	    Vendor-Option Option 43, length 82: 209.25.112.120.101.108.105.110.117.120.46.99.102.103.47.116.116.121.83.49.45.49.49.53.50.48.48.210.53.104.116.116.112.58.47.47.97.112.116.46.119.105.107.105.109.101.100.105.97.46.111.114.103.47.116.102.116.112.98.111.111.116.47.98.117.108.108.115.101.121.101.45.105.110.115.116.97.108.108.101.114.47
	    Agent-Information Option 82, length 56: 
	      Circuit-ID SubOption 1, length 42: asw2-b-eqiad:xe-7/0/6.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 10: xe-7/0/6.0
install1003 DHCP Request
09:19:26.941687 IP (tos 0x0, ttl 64, id 17094, offset 0, flags [none], proto UDP (17), length 629)
    ae2-1118.cr1-eqiad.wikimedia.org.bootps > install1003.wikimedia.org.bootps: BOOTP/DHCP, Request from f4:e9:d4:ba:b7:40 (oui Unknown), length 601, hops 1, xid 0xd9bab740, secs 64, Flags [Broadcast]
	  Gateway-IP ae2-1118.cr1-eqiad.wikimedia.org
	  Client-Ethernet-Address f4:e9:d4:ba:b7:40 (oui Unknown)
	  Vendor-rfc1048 Extensions
	    Magic Cookie 0x63825363
	    DHCP-Message Option 53, length 1: Request
	    Requested-IP Option 50, length 4: cloudvirt1017.eqiad.wmnet
	    Parameter-Request Option 55, length 24: 
	      Subnet-Mask, Time-Zone, Default-Gateway, IEN-Name-Server
	      Domain-Name-Server, RL, Hostname, BS
	      Domain-Name, SS, RP, EP
	      Vendor-Option, Server-ID, Vendor-Class, BF
	      Option 128, Option 129, Option 130, Option 131
	      Option 132, Option 133, Option 134, Option 135
	    MSZ Option 57, length 2: 1260
	    Server-ID Option 54, length 4: install1003.wikimedia.org
	    GUID Option 97, length 17: 0.68.69.76.76.49.0.16.82.128.66.179.192.79.75.72.50
	    ARCH Option 93, length 2: 0
	    NDI Option 94, length 3: 1.2.1
	    Vendor-Class Option 60, length 32: "PXEClient:Arch:00000:UNDI:002001"
	    Agent-Information Option 82, length 51: 
	      Circuit-ID SubOption 1, length 42: cloudsw1-c8-eqiad:ae1.0:cloud-hosts1-eqiad
	      Remote-ID SubOption 2, length 5: ae1.0

To which install1003 doesn't reply with a DHCP ACK, most likely because option 82 is incorrect.
My guess at this point is that because of different code version, asw2-b-eqiad, doesn't set option 82 on DHCP request, while cloudsw1-c8-eqiad (on the path) does.
From here, there are different avenues to explore:

  • Having the DHCP server to ignore option 82 on DHCP requests only (/cc @Volans)
  • Having the cloudsw not add option 82 on DHCP requests - not possible
  • Having asw2-b-eqiad add option 82 on DHCP requests - Very heavy as it requires a row downtime for upgrade
  • Move cloud-hosts servers from row B to cloud racks before re-image - heavy on DCops time, ideal on the longer term (/cc @wiki_willy )
  • Temporary disable option 82 on cloudsw1 (when needed), which I tried and finally got to a Partition disks prompt (/cc @cmooney)
    • Easiest workaround, run cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security option-82 before the re-image
  • Easiest workaround, run cloudsw1-c8-eqiad# deactivate vlans cloud-hosts1-eqiad forwarding-options dhcp-security option-82 before the re-image

Is that generally harmless? If so, does it need to be a temporary change rather than a permanent change? And, can you also feed me the command to undo the deactivation?

Thank you for your work on this!

Yeah it's totally harmless, the downside is that DHCP won't work on hosts directly connected to cloudsw1-c8-eqiad.

The undo is activate vlans cloud-hosts1-eqiad forwarding-options dhcp-security option-82 Followed by a commit for both. Don't hesitate to ping any of us to do the change if needed.

For this round of reimaging I'm happy to just edit the options while reimaging, but

  • I'll want to do this myself so I don't have to ping @ayounsi for each of the 30 servers that need upgrading, and
  • In the long run we really need to establish a re-image path that works 'normally' via the reimage cookbook. Either the cookbook needs to automatically adjust the cloudgw settings or we need to fix dhcp so it always works on these hosts. I don't want someone else to have to re-learn this lesson in 2024 when it come time to reimage the fleet again.

For the record: reimaging this host worked properly on the 14th after Arzhel applied the suggested hack. Today, though, the host can no longer be re-imaged; it hangs when trying to pxe boot. So I suspect that whatever fix was put in place has reverted... that, or we have yet a new problem I guess?

IDRAC and BIOS are up to date on cloudvirt1016

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1016.eqiad.wmnet with OS bullseye

@Andrew 1016 is now able to PXE boot i stop the OS install because i am having the error below. I think you can fix this. thank

Before the Logical Volume Manager can be configured, the current        │
  │ partitioning scheme has to be written to disk. These changes cannot     │
  │ be undone.                                                              │
  │                                                                         │
  │ After the Logical Volume Manager is configured, no additional changes   │
  │ to the partitioning scheme of disks containing physical volumes are     │
  │ allowed during the installation. Please decide if you are satisfied

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1016.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1016 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

@Papaul I fixed a few things and now 1016 is reimaging smoothly. Thank you!

FYI. I got a few emalis like this one regarding cloudvirt1016 this weekend:

Date: Sat, 19 Mar 2022 04:07:39 +0000
From: root <root@cloudvirt1016.eqiad.wmnet>
To: root@cloudvirt1016.eqiad.wmnet
Subject: SMART error (CurrentPendingSector) detected on host: cloudvirt1016

This message was generated by the smartd daemon running on:

   host name:  cloudvirt1016
   DNS domain: eqiad.wmnet

The following warning/error was logged by the smartd daemon:

Device: /dev/bus/0 [megaraid_disk_08] [SAT], 2 Currently unreadable (pending) sectors

Device info:
INTEL SSDSC2BX016T4R, S/N:BTHC7112043N1P6PGN, WWN:5-5cd2e4-14dae93f8, FW:G201DL2D, 1.60 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

Not sure if it's related though

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Base on the list @Andrew provided me on IRC (1016, 1017, 1019, 1022, 1023) he was able to re-image those hosts and base on the hosts that are not able to be re-image (1024,1025,2026) my first observation is any hosts connected to cloudsw[1-2]-c8, cloudsw[1-2]-d5 and asw-b2 will either PXE boot and failed on Failed to load ldlinux.c32 or will not PXE boot at all. On the other hand any hosts connected to asw2-b7 or asw2-b4 will work find. Waiting to do a last test to be 100% sure of this.

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • New OS is buster but bullseye was requested
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

I asked @Cmjohnson to connect cloudvrit1024 to asw2-b4 yesterday for testing, the result was the same

Failed to load ldlinux.c32

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1024 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1025 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1026 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye executed with errors:

  • cloudvirt1047 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye completed:

  • cloudvirt1024 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203222323_pt1979_2760646_cloudvirt1024.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@Andrew Any other issues with 1016 and 1017 ? If no can we please close this task?

Thanks.

Change 769508 abandoned by Majavah:

[operations/homer/public@master] policies/cr-labs: Allow tftp to install servers

Reason:

https://gerrit.wikimedia.org/r/769508