Page MenuHomePhabricator

[reimage,ceph] reimaging cloudcephosd hosts gets stuck in network configuration screen
Closed, ResolvedPublic

Description

The installer fails by itself, it can be worked around by manually turning the interface up and running udhcpd:

ip link set up dev enp175s0f0np0
udhcpd -i enp175s0f0np0

And then skipping to the next step in the install process works (re-running the network config says it passed, but then it loses the network again).

to test

upgrade nic firmware

Something to try is to update the network firmware, has to be done before starting the reimage of the host:

sre.hardware.upgrade-firmware
set nics in legacy mode in the bios

https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Troubleshooting

Where it happened so far

This has happened so far with:

  • cloudcephosd1006 - manually reinstalled
  • cloudcephosd1007 - manually reinstalled
  • cloudcephosd1008 - manually reinstalled

Solution

Upgrading the nic firmware to 21.85.21.92 allowed the reimage to get through.

Event Timeline

dcaro changed the task status from Open to In Progress.
dcaro triaged this task as High priority.

cloudcephosd1009 was able to start installing after upgrading the nic firmware to 21.85.21.92 \o/

will do a couple more before declaring that the solution

Yep, upgrading the firmware works