Page MenuHomePhabricator

rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle
Closed, ResolvedPublic

Description

We were trying to pxeboot-image rigel but it did not come back up after a reboot. Looks like the same problem as T149006, and we've done the same steps.

Discussed with Papaul and we've agreed that Monday is ok for onsite investigation, as we can use the eqiad bastion until then.

Event Timeline

Jgreen triaged this task as Unbreak Now! priority.May 4 2018, 4:58 PM
Jgreen created this task.
Restricted Application added subscribers: Liuxinyu970226, TerraCodes, Aklapper. · View Herald Transcript
Papaul subscribed.

@Jgreen Power drained on the server and update all the firmwares as well.

Let me know if you need anything else from me.

Thanks

@Papaul I tried power cycling just now and the same thing happened again, nothing on vsp and I can't reach it on the normal interface.

@cwdent if power drain and the firmware update didn't work it might be a hardware issue ( ILO interface or main baord) in this case informatively the server warranty expired on 2017-11-04.

@cwdent if power drain and the firmware update didn't work it might be a hardware issue ( ILO interface or main baord) in this case informatively the server warranty expired on 2017-11-04.

So do we replace parts or the whole machine?

@Jgreen if the ILO card is on the main board, the whole main board will have to be replaced if not only the ILO card will be replaced. Most of the time on the HP servers the ILO card is on the main board .

Jgreen mentioned this in Unknown Object (Task).May 7 2018, 7:33 PM
Jgreen added a subtask: Unknown Object (Task).May 7 2018, 7:36 PM

If the mainboard dies on an out of warranty system, we typically decommission the host. We're looking to order more misc systems for eqiad on T189317, and planned to place a similar order for codfw.

However, we also have some spare systems in codfw we could allocate immediately (with the task filed in hardware-requests and approval of the allocaiton.)

I have a single spare with only 32GB of ram (the rest have 64). It has the following specs:

warranty through 2019-11-03
dual Intel® Xeon® Processor E5-2623
quad 4TB hdd (overkill but ALL the spares in codfw have this so we may as well use it.)
1gb networking.

If this would work for you @Jgreen, please let me know and I can create the needed hw-request task.

RobH mentioned this in Unknown Object (Task).May 7 2018, 10:32 PM

@jgree as requested, the server is back up again

Rigel was set by default to boot first from NIC so every time the server reboots, it stuck and the error bellow so I change the boot order to boot first from disk and asked Jeff to login into the ILO and reboot the server and the server came up this time with out an errors. Did the test twice.

20180508_094017.jpg (2×3 px, 2 MB)

Please note my past comment regarding allocation of a spare was discussed in irc between myself and @Jgreen

rigel's ilom is non-functional, but the system is able to boot into the OS. I've created T194094 to purchase a replacement.

RobH lowered the priority of this task from Unbreak Now! to Medium.May 8 2018, 3:46 PM

Lowering to normal, as the server is known bad (ilom malfunction) but out of warranty. There is already a high priority procurement task (T194094) to replace it.

Papaul and I spent some more time on this, and found that "BIOS Serial Console" was set to auto, not COM2 as it should be for ILO output. We were seeing BIOS before the firmware update so I'm thinking it my have reverted to the default. Once we fixed this, we were able to pxeboot and boot from disk normally. Closing this task because I think the hardware is fine (for now, at least).

RobH closed subtask Unknown Object (Task) as Declined.May 11 2018, 7:29 PM
Vvjjkkii renamed this task from rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle to gldaaaaaaa.Jul 1 2018, 1:11 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Jgreen as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from gldaaaaaaa to rigel.frack.codfw.wmnet (fundraising codfw bastion) will not boot after a power cycle.Jul 1 2018, 10:52 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to Jgreen.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.