Page MenuHomePhabricator

Installation issues on PowerEdge R440 eqiad Ganeti servers with buster / firmware update needed
Closed, ResolvedPublic

Description

We've ran into this issue before in codfw (T296856):

These servers are failing to install buster with the outdated firmware on the system. Following the reimage over the serial console I can see that the d-i image and the initrd are being served (and it's also showing up in the logs of the install* server), but after that the system reboots into the installed system. These can only be installed with buster after firmware has been updated. VMs need to be drained from instances, so I'm listing the servers and when they have been freed of instances (we can't free up too many at a time) and please tick them off when firmware was updated. It'll be 18 servers in total.

  • ganeti1005
  • ganeti1006
  • ganeti1007
  • ganeti1008
  • ganeti1009
  • ganeti1010
  • ganeti1011
  • ganeti1012
  • ganeti1013
  • ganeti1014
  • ganeti1015
  • ganeti1016
  • ganeti1017
  • ganeti1018
  • ganeti1019
  • ganeti1020
  • ganeti1021
  • ganeti1022

Event Timeline

wiki_willy added subscribers: Cmjohnson, wiki_willy.

Hi @Cmjohnson - just a heads up, this one is a bit higher priority.

Thanks,
Willy

@MoritzMuehlenhoff For clarity, I can do this now for ganeti1018? Are we doing these 1 at a time? Thanks

Yes, ganeti1018 is ready to go now. In general I would add the servers which have been emptied/which are ready to go to the task description with (there'll be more tomorrow, but it needs some time to drain them) and then you can tick them off when done, does that work for you?

Sounds good, thank you for clarifying. I will take care of 1018 now

Mentioned in SAL (#wikimedia-operations) [2022-01-19T17:25:21Z] <cmjohnson1> updating firmware, ganeti1018 T299527

I updated the firmware on 1018 but also made the error of updating the idrac, the new idrac version needs to be rolled back. I am no longer able to access the mgmt web portal.

I updated the firmware on 1018

Thanks, with the updated firmware I was able to reimage the server.

but also made the error of updating the idrac, the new idrac version needs to be rolled back. I am no longer able to access the mgmt web portal.

I was able to log into the mgmt via SSH, does that mean the rollback of the IDRAC is complete? If not, does it require downtime for the server? If so, I'd wait with pushing VMs back to ganeti1018.

yes, mgmt works via ssh but the new version doesn't allow me to access the web interface. I use that interface to do most firmware updates and get hardware log reports for Dell when there is an issue. I am going to try rolling back to the original version and then applying an older update and then the latest update. I'm in the middle of the rollback now.

@MoritzMuehlenhoff The idrac is giving me a hard time, it's not worth slowing this process down. The idrac has no bearing on your issue. The BIOS and Network Firmware has been updated.

@MoritzMuehlenhoff The idrac is giving me a hard time, it's not worth slowing this process down. The idrac has no bearing on your issue. The BIOS and Network Firmware has been updated.

Thanks! We can still revisit the IDRAC update when the whole Ganeti/eqiad update is done. I'll send more servers for firmware updates your way on Monday.

One more server is ready and downtimed; ganeti1013

One more server is ready and downtimed; ganeti1014

Mentioned in SAL (#wikimedia-operations) [2022-01-24T17:50:42Z] <cmjohnson1> updating firmware on ganeti1013 T299527

@MoritzMuehlenhoff 1013 is finished, ganeti1014 will need me to do a hard power cycle, I will be able to get to that a little later today.

@MoritzMuehlenhoff 1013 is finished, ganeti1014 will need me to do a hard power cycle, I will be able to get to that a little later today.

Ack! ganeti1013 has been reinstalled, let me know when the power cycle for ganeti1014 is complete.

One more server is ready and downtimed; ganeti1005

Mentioned in SAL (#wikimedia-operations) [2022-01-25T16:18:47Z] <cmjohnson1> updating firmware ganeti1014 T299527

Mentioned in SAL (#wikimedia-operations) [2022-01-25T16:21:41Z] <cmjohnson1> updating firmware ganeti1005 T299527

@MoritzMuehlenhoff both 1014 and 1005 have been updated.

Thanks, an additional server is now ready and downtimed; ganeti1006.

Mentioned in SAL (#wikimedia-operations) [2022-01-25T19:35:33Z] <cmjohnson1> updating firmware ganeti1006 T299527

One more server is ready and downtimed; ganeti1015

One more server is ready and downtimed; ganeti1007

Mentioned in SAL (#wikimedia-operations) [2022-01-27T17:00:05Z] <cmjohnson1> updating firmware ganeti1007 and ganeti1015 T299527

One more server is ready and downtimed; ganeti1019

One more server is ready and downtimed; ganeti1010

1010 is updated, 1019 is locking up, I will need to power off and unplug

1010 is updated, 1019 is locking up, I will need to power off and unplug

Ack, thanks! It can be powered off any time, it's taken out of service and downtimed.

And one more server is ready and also downtimed: ganeti1008

One more server is ready and downtimed; ganeti1016

One more server is ready and downtimed; ganeti1011

@MoritzMuehlenhoff 1006, 1016 and 1019 updated

I can't connect to the serial console of ganeti1016, I asks me for the password, but then the connection just stalls (reproducibly), can you please check if you can see any error or it it can be reset?

One more server is ready and downtimed; ganeti1020

@MoritzMuehlenhoff 1016 is fixed and accessible, 1011 and 1020 updating now

1011 and 1020 have been updated

One more server is ready and downtimed; ganeti1021

One more server is ready and downtimed; ganeti1022

One more server is ready and downtimed; ganeti1017

One more server is ready and downtimed; ganeti1012

One more server is ready and downtimed; ganeti1009. This is the last one :-)

@MoritzMuehlenhoff All completed, resolving this task

Thanks for the swift turnarounds on these!