Page MenuHomePhabricator

Ensure that labstore machine is 'known good' hardware
Closed, ResolvedPublic

Description

labstore1002 seemed to have flaky hardware, ensure that either all the components that need replacing have been replaced or we have switched back to labstore 1001.

Related Objects

StatusSubtypeAssignedTask
Resolvedyuvipanda
Resolvedcoren
Resolvedcoren
Declinedcoren
ResolvedNone
ResolvedNone
Resolvedcoren
Resolvedcoren
Declinedyuvipanda
Resolvedcoren
Resolvedcoren
Resolvedcoren
Resolved Cmjohnson
Resolved chasemp
Resolved chasemp
Resolvedcoren
Resolvedcoren
Resolvedmark
Resolved Cmjohnson
Resolvedcoren
Resolvedcoren
Resolvedfaidon
Declinedfaidon

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Cloud-Services.
yuvipanda added subscribers: coren, mark, Ricordisamoa and 2 others.

Switching back to labstore1001 should be a high priority but - as it requires significant downtime - needs to be planned and a good time found. What needs to happen:

  • Make certain labstore1001 can in fact serve files (easily tested without outage)
  • Coordinate with @Springle as switching requires moving cabling around

At the selected window:

  • Stop NFS cleanly on labstore1002
  • Flush and unmount all labstore1002 filesystems (easiest done with a halt/poweroff)
  • Switch cabling of the shelves from 1002 to 1001
  • [Optional: power 1002 back on, make sure it's available to take over quickly by powering through the BIOS issue if needed]
  • Verify that the shelves are visible from 1001 properly and that all is well
  • Start NFS service on 1001

Chris can confirm for the time it takes to switch wiring around, but I expect no less than 10-15 minutes of downtime during the switch, which we need to (at least) double for safety.

We can roll back by switching the wiring back to 1002 and rebooting it - possibly with some struggle with the H800 BIOS (as we had when we switch during the crash recovery). Chris being on-site means that - in a pinch - we can even replace hardware in 1002 before kicking it back up.

All told, we should have a maintenance window no less than two hours - with luck we'll need only 15 minutes of it.

coren moved this task from Doing to To Do on the Labs-Sprint-107 board.
coren claimed this task.

NFS service has been switched back to labstore1001; and labstore1002's controller is now being swapped out for a new instance.