Page MenuHomePhabricator

cloudvirt1041: can't boot after reimage
Closed, ResolvedPublic

Description

I reimaged today cloudvirt1041 and after the debian installer completed, the host would not boot.

Some commands I run:

racadm>>getsel
[..]
Record:      4
Date/Time:   05/15/2024 11:02:02
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 1 device 0 function 0.
-------------------------------------------------------------------------------
Record:      17
Date/Time:   05/15/2024 11:02:03
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 1 device 0 function 1.
-------------------------------------------------------------------------------
Record:      30
Date/Time:   05/15/2024 11:20:39
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 1 device 0 function 0.
-------------------------------------------------------------------------------
Record:      43
Date/Time:   05/15/2024 11:20:40
Source:      system
Severity:    Critical
Description: A fatal error was detected on a component at bus 1 device 0 function 1.


racadm>>lclog view
[..]
FQDD            = System.Embedded.1
--------------------------------------------------------------------------------
SeqNumber       = 411
Message ID      = PCI1318
Category        = System
AgentID         = SEL
Severity        = Critical
Timestamp       = 2024-05-15 11:20:49
Message         = A fatal error was detected on a component at bus 1 device 0 function 1.
Message Arg   1 = 1
Message Arg   2 = 0
Message Arg   3 = 1
RawEventData    = 0x2B,0x00,0x02,0x88,0x9A,0x44,0x66,0xB1,0x00,0x04,0x13,0x38,0x6F,0xAC,0x01,0x01

Is there something wrong at hardware level maybe?

Event Timeline

aborrero added a parent task: Unknown Object (Task).May 15 2024, 11:49 AM
aborrero moved this task from Inbox to Blocked on the cloud-services-team board.
aborrero added a subscriber: Jhancock.wm.

hey @Jclark-ctr or @Jhancock.wm could you please advice / help with this server? thanks in advance.

additional information: when reimaging the server, the debian installer failed, complaining about the volume group name being in use already.

To try to workaround the problem, I jumped into a debian installer shell, and deleted all 3:

  • logical volumes, like root, swap, etc
  • volume groups, like vg0
  • physical volume, like /dev/md0

After that, the debian installer worked fine.

It was in the next reboot, after the debian installer completed, that the server failed to boot.

Maybe you can check if I messed up something with the disks or the RAID controller?

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

@aborrero I am stuck right now i did attempt to reimage with no luck. Unsure what version of grub we have installed but looks like the same as this bug. @Papaul do you have any insight on this? https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=987008

I'm going to mess with this a bit today if it doesn't step on anyone's toes.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

The main point of suspicion here is that it doesn't autoconfig the network but asks me to specify a netmask. Cathal suggests that switching the nic firmware to 21.85 might (possibly) get us to a more expected behavior there. @Jclark-ctr is that something you can arrange?

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:

  • cloudvirt1041 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" cloudvirt1041.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm completed:

  • cloudvirt1041 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405291726_andrew_3548421_cloudvirt1041.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status failed -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Andrew added a subscriber: Jclark-ctr.

After a nic firmware upgrade things seem to be working. It took a couple of tries (suspicious!) but now the host is imaged and when I reboot it it comes back up. I've tried a reboot both via ssh and also a hard reboot from racadm.

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-29T18:53:56Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T364984)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-29T18:54:02Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) (T364984)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-29T18:54:21Z] <andrew@cloudcumin1001> START - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (T364984)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-05-29T18:54:25Z] <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.openstack.cloudvirt.unset_maintenance (exit_code=99) (T364984)

This host is up and seems stable, but VMs running on it cannot reach the internet.

Since this host was being moved from a 2-nic to 1-nic setup, this doesn't shock me. Arturo, I'm guessing you have a final step to take here?

root@buildvm-75bedd50-9658-48fe-8fc4-ee61c0b4fad3:~# hostname
buildvm-75bedd50-9658-48fe-8fc4-ee61c0b4fad3
root@buildvm-75bedd50-9658-48fe-8fc4-ee61c0b4fad3:~# ping puppet
ping: puppet: Temporary failure in name resolution
root@buildvm-75bedd50-9658-48fe-8fc4-ee61c0b4fad3:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:29:22:74 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    inet6 fe80::f816:3eff:fe29:2274/64 scope link 
       valid_lft forever preferred_lft forever

I guess I never got to refresh the interfaces in netbox.

Just run the cookbook:

aborrero@cumin1002:~ $ sudo cookbook sre.network.configure-switch-interfaces cloudvirt1041
Acquired lock for key /spicerack/locks/cookbooks/sre.network.configure-switch-interfaces: {'concurrency': 20, 'created': '2024-05-30 15:19:20.640556', 'owner': 'aborrero@cumin1002 [3727862]', 'ttl': 1800}
START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1041
----- OUTPUT of 'configure exclus...re;rollback;exit' -----
Entering configuration mode
[edit interfaces xe-0/0/24 unit 0 family ethernet-switching vlan]
-       members [ cloud-hosts1-eqiad cloud-private-d5-eqiad ];
+       members [ cloud-hosts1-eqiad cloud-instances2-b-eqiad cloud-private-d5-eqiad ];
load complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...re;rollback;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
==> Commit the above change?
Type "go" to proceed or "abort" to interrupt the execution
> go
User input is: "go"
----- OUTPUT of 'configure exclus...confirmed 1;exit' -----
Entering configuration mode
[edit interfaces xe-0/0/24 unit 0 family ethernet-switching vlan]
-       members [ cloud-hosts1-eqiad cloud-private-d5-eqiad ];
+       members [ cloud-hosts1-eqiad cloud-instances2-b-eqiad cloud-private-d5-eqiad ];
configuration check succeeds
commit confirmed will be automatically rolled back in 1 minutes unless confirmed
commit complete
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure exclus...confirmed 1;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Commited the above change, needs to be confirmed
----- OUTPUT of 'configure;commit check;exit' -----
Entering configuration mode
configuration check succeeds
Exiting configuration mode
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'configure;commit check;exit'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Change confirmed
No configuration change needed on the switch for eno2np1
Released lock for key /spicerack/locks/cookbooks/sre.network.configure-switch-interfaces: {'concurrency': 20, 'created': '2024-05-30 15:19:20.640556', 'owner': 'aborrero@cumin1002 [3727862]', 'ttl': 1800}
END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1041

please @Andrew try again.

This host is now pooled and working properly.