Page MenuHomePhabricator

Possible firmware issues reimaging mw2282
Closed, ResolvedPublic

Description

When reimaging this host to bullseye it fails to PXE boot. The IPMI command is successful and the host can be seen to attempt PXE booting but eventually gives up and boots off of the OS on-disk. When a tcpdump is run on the install server no traffic from the host can be seen. Infrastructure foundations indicate that this may be a firmware issue with this particular host that needs correcting

ops-codfw please let us know if there is a firmware update for this host, so we can have another go at reimaging

Event Timeline

I set the host as inactive since I noticed a bit of log spam on lvs2013

Feb 8 08:40:59 lvs2013 pybal[2489063]: [eventgate-analytics_4592 IdleConnection] WARN: mw2282.codfw.wmnet (enabled/down/not pooled): Connection to 10.192.48.104:4592 failed.

I think there may be an issue here with the cable (usually the NIC firmware issue hits us when the debian-installer does it's DHCP request, rather than at the initial PXEboot stage).

The switch it's connected to is currently not detecting a link on that port:

cmooney@asw-d-codfw> show interfaces descriptions | match mw2282 
ge-4/0/4        up    down mw2282

However quite unusually the system itself does detect the link and shows as 'UP':

root@mw2282:~# ip -br addr show dev eno1
eno1             UP             10.192.48.104/22 2620:0:860:104:10:192:48:104/64 fe80::d294:66ff:fe3c:b741/64

I tried bouncing the port but it had no effect. @Jhancock.wm if you get a moment a first step here might be to check or replace the cable for this host and see if the switch port comes up. Thanks.

@cmooney the SFP failed. I've replaced it and it looks to be up now.

Good catch! Unfortunately I'm still seeing the same PXE behaviour failing on boot

I can't seem to access the idrac remotely. Is it okay if I power down the server at this time?

I can't seem to access the idrac remotely. Is it okay if I power down the server at this time?

I had some weirdness when accessing it also, I had to racreset a few times to get the console working. Go ahead, the host is not in icinga or pooled for anything.

I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted it. stayed up this time. should have replaced the cable as well the first time D= It -should- stay up this time. lmk if it acts up again.

also tried the racreset trick and I'm getting a straight 404 error on the idrac login.

FYI the host is up and running with the old OS but new puppet role and puppet disabled since 26 days, it has disappeared from puppetdb (because of the puppet disabled) hence from monitoring and everything else, being effectively a ghost apart it being reported by a Netbox report.

I think in those cases is better to keep the hosts powered off until they get reimaged.

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye

I reseated the NIC and it connected. when I rebooted it went down again and didn't come up. swapped it out and rebooted it. stayed up this time. should have replaced the cable as well the first time D= It -should- stay up this time. lmk if it acts up again.

also tried the racreset trick and I'm getting a straight 404 error on the idrac login.

Unfortunately this hasn't changed the failure to PXE boot behaviour :(

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin2002 for host mw2282.codfw.wmnet with OS bullseye executed with errors:

  • mw2282 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" mw2282.codfw.wmnet to get a root shellbut depending on the failure this may not work.

idk if this would help, but can we run the provisioning script with the --no-dhcp and --no-user tags. to catch any bios settings that might have changed?

idk if this would help, but can we run the provisioning script with the --no-dhcp and --no-user tags. to catch any bios settings that might have changed?

Which script would this be? I've been using sre.hosts.reimage and those flags aren't supported.

sre.hosts.provision <hostname> --no-dhcp --no-user

sre.hosts.provision <hostname> --no-dhcp --no-user

Also --no-switch in this case I'd say.

This command also fails - but interestingly the host itself appears to have lost network connectivity. ethtool reports that the link is up but I can't connect in or out and the arp table is empty, I can only get in via the management console.

This command also fails - but interestingly the host itself appears to have lost network connectivity. ethtool reports that the link is up but I can't connect in or out and the arp table is empty, I can only get in via the management console.

Hmm very odd, checking it now all of that is working, I can ssh to it from my laptop. Did something change? Switch config hasn't been adjusted.

root@mw2282:~# ping 10.192.48.1
PING 10.192.48.1 (10.192.48.1) 56(84) bytes of data.
64 bytes from 10.192.48.1: icmp_seq=1 ttl=64 time=0.462 ms
64 bytes from 10.192.48.1: icmp_seq=2 ttl=64 time=1.19 ms

The reimage problem may now be the firmware issue - causing link not coming up during the debian installer.

@hnowlan if you want to try the reimage again I can take a look at the console and try and see if I notice anything, ping me on irc.

port shows activity on the server, but the network side is showing as down. Reseating either cable does nothing. but reseating the SFP makes it come back up

Possible port speed issue? it is a 1G server on a 10G switch.

@cmooney it was me, I was reseating the cable

port shows activity on the server, but the network side is showing as down. Reseating either cable does nothing. but reseating the SFP makes it come back up

Possible port speed issue? it is a 1G server on a 10G switch.

I'm still logged in to it over ssh so all seems ok right now. So you've definitely fixed an issue with the SFP replacement there.

There may well be a remaining issue with reimage/PXEboot/DHCP but physically we are now good. Thanks!

Reimaging fails still after these changes fwiw - however, a reboot has broken network connectivity again?! The host is up and rebooted in the management interface, but I can't ssh in again.

replaced the SFP this time. came up. server reboot is causing the port to go down, possibly

Reimaging fails still after these changes fwiw - however, a reboot has broken network connectivity again?! The host is up and rebooted in the management interface, but I can't ssh in again.

I can ssh just fine. Clearly the SSH gods simply do not favour you :P

Seriously though I'm not sure what events lead to the issue you are seeing @hnowlan? I did a test reboot there and the system rebooted fine, link was up and working immediately on boot and I could ssh in.

Maybe ping me on irc we can see if we can reproduce the issue.

I tried another reimage and it currently proceeding successfully - maybe replacing the SFP did the job? This is all a bit inexplicable.

Mentioned in SAL (#wikimedia-operations) [2024-02-14T17:59:43Z] <hnowlan@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on mw2282.codfw.wmnet with reason: Testing if reimage is stable T355333

Mentioned in SAL (#wikimedia-operations) [2024-02-14T17:59:47Z] <hnowlan@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw2282.codfw.wmnet with reason: Testing if reimage is stable T355333

hnowlan claimed this task.

Reimage was successful, networking survived a reboot. All done!

I will mark the SFP I pulled as bad. See if I can test it on a new server.