Page MenuHomePhabricator

DHCP failing for at least 2 ms-be servers in codfw
Open, MediumPublic

Description

I tried to reimage both ms-be2074 and ms-be2076 today to the new per-rack VLANs (cf T354872), in both cases the process has failed (though the cookbooks are still running) because neither node is able to DHCP any more. In both cases, the hosts start up, attempt to DHCP, and fail back to booting from hard-disk - and the reimage cookbook takes this as "host not yet up" because they come back on the old IP not the new one. I've halted both reimage cookbooks for now, so feel free to take whatever action you like on both hosts without further input from me.

The MAC addresses are (from the console) 00 62 0B 74 EA 40 for ms-be2074 and 00 62 0B 75 4A 80 for ms-be2076.

AFAICT the last time these nodes were imaged was when initially set up in 2023 (T349839), but I do need to reimage these (and there's a third node waiting which I've not attempted), and I wanted to get them back into the rings before the SRE summit next week really. I don't know what's gone wrong in the mean time :-/

Event Timeline

[priority because I do need to be able to reliably reimage swift nodes]

Strangely I re-imaged both servers from cumin2002 and ran into no issues. Perhaps when you ran the first re-images @MatthewVernon, though they failed, they setup the conditions for the following re-images to succeed? Was there any interesting output from the move-vlan cookbook? I ran the following commands:

$ sudo cookbook sre.hosts.reimage --new --os bullseye --move-vlan ms-be2074
$ sudo cookbook sre.hosts.reimage --new --os bullseye --move-vlan ms-be2076

That is strange - logs are in /var/log/spicerack/sre/hosts/reimage.log on cumin2002 for both reimages as you might expect, but scrolling back up through my tmux, it looks like the move-vlan cookbook completes just fine, then the reimage cookbook does some DHCP stuff and reboots the host as one would expect.

I'll try ms-be2077 a little later today and see how that goes.

MatthewVernon lowered the priority of this task from High to Medium.Thu, Jan 22, 1:07 PM

@jhathaway I did ms-be2077 today, and see the same failure mode - it failed entirely to DHCP. I killed the cookbook, re-ran it ( sudo cookbook sre.hosts.reimage --os bullseye -t T354872 --move-vlan --new ms-be2077) and it reimaged just fine. Obviously the re-run doesn't make any DNS changes because the first run does that, which does make me think the --move-vlan bit is what's going wrong.

I'll have some more codfw hosts to do for T354872 in due course (I'll start draining the next few nodes after Lisbon) - shall I let you try the initial reimage+move-vlan on those so you can see what's going wrong?

[I think the priority can come down since it seems to be specific to --move-vlan reimages. Thank you for at least finding a workaround I could use so I can put these back in the rings today!]

@jhathaway I did ms-be2077 today, and see the same failure mode - it failed entirely to DHCP. I killed the cookbook, re-ran it ( sudo cookbook sre.hosts.reimage --os bullseye -t T354872 --move-vlan --new ms-be2077) and it reimaged just fine. Obviously the re-run doesn't make any DNS changes because the first run does that, which does make me think the --move-vlan bit is what's going wrong.

Yeah I agree. Perhaps there is a race condition with that script updating the switch port, @cmooney any recommendations on debugging the move-vlan cookbook?

I'll have some more codfw hosts to do for T354872 in due course (I'll start draining the next few nodes after Lisbon) - shall I let you try the initial reimage+move-vlan on those so you can see what's going wrong?

That would be great.

Perhaps there is a race condition with that script updating the switch port, @cmooney any recommendations on debugging the move-vlan cookbook?

Yeah I had a quick look at the code but nothing is jumping out at me. I don't think there is a verbose mode or similar we can use to get more info on exactly where it breaks down.

I think the best way forward might be if @MatthewVernon could tell us when the next node is ready for reimage, and we could do it ourselves and observe exactly what is making the DHCP fail.

I think that was due to the bug fixed in T416401: Cookbook sre.hosts.reimage: DHCP snippet created with old IP when --move-vlan is used. It should be good now.

Cool, thanks. I'll test when I next have a drained system ready to be moved :)