Page MenuHomePhabricator

Inspect and diagnose labstore1001's H800 controler
Closed, ResolvedPublic

Description

To be done when labstore1002 is the active server, as part of the switchover test.

Things that may be worthwhile to inspect in detail:

  • Firmware revision of controller and BIOS
  • Possible divergence in firmware-level configuration
  • Stability of wiring

A good load test is also likely to be necessary.

The original problem has not yet reoccurred since the last cold start, but we should do everything we can to be confident in the hardware before labstore1001 is switch back as primary.

Event Timeline

coren created this task.Apr 7 2015, 2:37 PM
coren updated the task description. (Show Details)
coren raised the priority of this task from to Needs Triage.
coren added a subscriber: coren.
Restricted Application added a project: acl*sre-team. · View Herald TranscriptApr 7 2015, 2:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Andrew triaged this task as High priority.Apr 11 2015, 9:28 PM
Andrew set Security to None.

did this happen?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 21 2015, 3:49 PM
coren added a comment.Jul 27 2015, 3:08 PM

@yuvipanda: No, the switchover test never took place and other concerns overrode this, and now labstore1001 is disconnected from the shelves.

At this point, I don't think we have any reason to believe the labstore1001 hardware is flaky. Hardware with known issue are: the labstore1002 H800 controler is known to randomly no pass POST, and one of the labstore2001 shelves is not working.

coren closed this task as Resolved.Oct 28 2015, 4:03 PM
coren claimed this task.

Resolved by the switchover test to end all switchover tests: labstore1001 is now back to being the primary server.