labstore1002 seemed to have flaky hardware, ensure that either all the components that need replacing have been replaced or we have switched back to labstore 1001.
Switching back to labstore1001 should be a high priority but - as it requires significant downtime - needs to be planned and a good time found. What needs to happen:
- Make certain labstore1001 can in fact serve files (easily tested without outage)
- Coordinate with @Springle as switching requires moving cabling around
At the selected window:
- Stop NFS cleanly on labstore1002
- Flush and unmount all labstore1002 filesystems (easiest done with a halt/poweroff)
- Switch cabling of the shelves from 1002 to 1001
- [Optional: power 1002 back on, make sure it's available to take over quickly by powering through the BIOS issue if needed]
- Verify that the shelves are visible from 1001 properly and that all is well
- Start NFS service on 1001
Chris can confirm for the time it takes to switch wiring around, but I expect no less than 10-15 minutes of downtime during the switch, which we need to (at least) double for safety.
We can roll back by switching the wiring back to 1002 and rebooting it - possibly with some struggle with the H800 BIOS (as we had when we switch during the crash recovery). Chris being on-site means that - in a pinch - we can even replace hardware in 1002 before kicking it back up.
All told, we should have a maintenance window no less than two hours - with luck we'll need only 15 minutes of it.
NFS service has been switched back to labstore1001; and labstore1002's controller is now being swapped out for a new instance.