Page MenuHomePhabricator

wmf7622 wont powercycle (cannot be allocated from spares)
Closed, ResolvedPublic0 Story Points

Description

wmf7622 was being investigated for potential use when it was discovered it fails to accept reboot commands.

/admin1-> racadm serveraction powerup
ERROR: Timeout while waiting for server to perform requested power action.

Task created to track/troubleshoot issue.

  • - update idrac firmware to latest revision 3.30.30.30
  • - racreset command sent to idrac
  • - test after update - system now states it accepts the power on command but doesn't actually power on
  • - @Cmjohnson removes all power from the server to see if it resets things properly.
  • - test after power removal if system will accept a racadm serveraction powerup command and actually power up.
  • - continue troubleshooting (if required) with Dell support
  • - once system is fixed, change netbox state from failed to inventory

Event Timeline

RobH triaged this task as Normal priority.May 9 2019, 8:31 PM
RobH created this task.
Restricted Application added a project: Operations. · View Herald TranscriptMay 9 2019, 8:31 PM
RobH reassigned this task from RobH to Cmjohnson.May 9 2019, 8:54 PM

Ok, I flashed the idrac firmware to the newest, and now it says it accepts the power on command, however it doesn't actually power on.

So Chris will have to pull power and see if a full power reset fixes this server, I'll add a checklist.

RobH updated the task description. (Show Details)May 9 2019, 8:57 PM
RobH moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.

Hello, process question about this. The current flowchart for states doesn't allow Spare->Failed to happen, so there are some implicit assumptions inside of f or example the PuppetDB netbox report about that (Failed state is expected to be in Puppet since it implicitly comes from a production state). Is it the preference that boxes like this go through a Failed state (and thus never appear in Puppet? Thanks.

RobH added a comment.May 15 2019, 6:31 PM

Hello, process question about this. The current flowchart for states doesn't allow Spare->Failed to happen, so there are some implicit assumptions inside of f or example the PuppetDB netbox report about that (Failed state is expected to be in Puppet since it implicitly comes from a production state). Is it the preference that boxes like this go through a Failed state (and thus never appear in Puppet? Thanks.

Good questions, and I am not sure the right answer.

So this is a system that was spare, as it was ordered in and not used yet, and then I went to use it and it did not work. So it technically was going to go from 'inventory' to 'planned' and then to 'failed' as it was only tested and failed once I went to use it. Since it all happened in the course of minutes, I moved it right from inventory to failed.

This may be easier to discuss via IRC however than async phab task, feel free to ping me there as well!

Also adding @Volans here who designed this for his input :)

I think it's fair to add transitions from pretty much any state to the failed state.

My original thought, that's why they are not there in the current version, was that it seemed to make sense only to mark something that is not anymore in production because it's failed, while if a failure happens within a planned or staged host could be considered part of the road towards active.

Clearly this was taking into account only servers, where this might be true, but we actually have spares of other kind of equipment like switches for which it makes sense to not have them racked and so they could go from spares to failed directly.

Let me know if there is an agreement and I'll add the transitions from Spare, Planned and Staged to Failed and back to those states.

I'm definitely in favor or allowing a failed state to basically come from any other state.

Cmjohnson closed this task as Resolved.Jun 11 2019, 4:01 PM
Cmjohnson updated the task description. (Show Details)
Cmjohnson updated the task description. (Show Details)

This server accepts all the racadm commands successfully. I verified on-site that these things actually happened

/admin1-> racadm serveraction powercycle
Server power operation initiated successfully
/admin1-> console com2
Connected to Serial Device 2. To end type: ^\
KEY MAPPING FOR CONSOLE REDIRECTION:

Use the <ESC><1> key sequence for <F1>
Use the <ESC><2> key sequence for <F2>

/admin1-> racadm serveraction powerdown
Server power operation initiated successfully
/admin1-> racadm serveraction powerup
Server power operation initiated successfully
/admin1->