Switch NFS server back to labstore1001
Closed, ResolvedPublic

Description

  • Make certain labstore1001 can in fact serve files (easily tested without outage)
  • Coordinate with @Cmjohnson as switching requires moving cabling around

At the selected window:

  • Stop NFS cleanly on labstore1002
  • Flush and unmount all labstore1002 filesystems (easiest done with a halt/poweroff)
  • Switch cabling of the shelves from 1002 to 1001
  • [Optional: power 1002 back on, make sure it's available to take over quickly by powering through the BIOS issue if needed]
  • Verify that the shelves are visible from 1001 properly and that all is well
  • Start NFS service on 1001

Chris can confirm for the time it takes to switch wiring around, but I expect no less than 10-15 minutes of downtime during the switch, which we need to (at least) double for safety.

We can roll back by switching the wiring back to 1002 and rebooting it - possibly with some struggle with the H800 BIOS (as we had when we switch during the crash recovery). Chris being on-site means that - in a pinch - we can even replace hardware in 1002 before kicking it back up.

All told, we should have a maintenance window no less than two hours - with luck we'll need only 15 minutes of it.

coren created this task.Jul 27 2015, 3:19 PM
coren updated the task description. (Show Details)
coren raised the priority of this task from to Needs Triage.
coren added projects: Labs-Sprint-107, Labs.
coren added subscribers: yuvipanda, Aklapper, Ricordisamoa and 3 others.
mark added a comment.EditedJul 30 2015, 11:31 AM

I'd like to see a migration plan be developed on this ticket.

coren moved this task from To Do to Doing on the Labs-Sprint-107 board.Jul 30 2015, 7:17 PM
coren claimed this task.
coren added a comment.Aug 3 2015, 5:26 PM

I'm not sure what extra details to add, though I'll list the exact commands shortly. I've tested that 1001, in its current state, can actually successfully export filesystems to labs instances, which was the first preprequisite. Reinstalling 1001 from puppet now should close the three related tickets and serves as a good last check (T107574)

Given this, I'll script the switchover in detail and - once we've agreed on the plan - send the maintenance notice email to the lists once a time has been picked.

coren moved this task from To Do to Doing on the Labs-Sprint-108 board.Aug 10 2015, 4:40 PM
coren added a comment.Aug 10 2015, 4:43 PM

A draft of the planned announcement is on etherpad at https://etherpad.wikimedia.org/p/labs-maintenance-aug-2015-draft for comments and adjustments.

This is pending on a "simple" decision on how to handle a package that cannot be currently installed to finish installation, and we should be ready to switch.

coren added a comment.Aug 10 2015, 5:12 PM

The (more) detailed plan:

  • Coordinate with @Cmjohnson as switching requires moving cabling around

At the selected window:

  • power down 1001
  • service nfs-kernel-server-stop
  • wait for NFS to be gone entirely (no [nfsd] processes left - 30s or so)
  • poweroff 1002
  • Switch cabling of the shelves from 1002 to 1001, otherwise keeping the same layout
    • Alternately, we can wire in parallel for now and keep 1002 powered down as this provides for faster recovery if we have to roll back.
    • Iff 1002 is disconnected (see above): power 1002 back on, make sure it's available to take over quickly by powering through the BIOS issue if needed
    • power up 1001
    • check that all arrays are assembled (from /proc/mdstat). If necessary, force assembly (the arrays might be thought foreign)
    • regenerate /etc/mdadm/mdadm.conf
    • check that all the volumes are properly visible (lvs)
    • add the filesystems to /etc/fstab once they are confirmed available
    • Start file service with /usr/local/sbin/start-nfs

At that point, NFS service should resume within 90s (the recovery grace period) and instances should wake up between 3-5 minutes later on average.

I suggest we put a hard 1h cap on the preceeding so that, if NFS has not been restored at that time, we rollback by:

  • powering down 1001
  • Rewire 1002 as it was (if applicable), or power it back up (otherwise)
  • Restart NFS on 1002 with /usr/local/sbin/start-nfs

We know that 1002 can resume operations with a start-nfs since we have already (accidentally) tested that when it was accidentally powered off.

yuvipanda triaged this task as High priority.Oct 24 2015, 10:27 PM

Changing priority to 'high' since labstore1002 is increasingly having more hardware issues.

yuvipanda set Security to None.
coren added a comment.Oct 26 2015, 4:30 PM

@Cmjohnson: We really want to do this early this week rather than later - can you set aside a 2h period in the DC for this sometime Tuesday or Wednesday? Your intervention is likely to be very simple (see above, switch a cable and be ready to switch back in case of rollback)

Let's plan on Wednesday morning at 930 Eastern.

mark moved this task from Backlog to Next Sprint on the Labs-Team-Backlog board.Oct 26 2015, 5:27 PM
coren added a comment.Oct 28 2015, 1:54 PM

This will start as soon as @Cmjohnson reaches the datacenter.

Change 249408 had a related patch set uploaded (by coren):
Labs: switch active labstore

https://gerrit.wikimedia.org/r/249408

Change 249408 merged by Andrew Bogott:
Labs: switch active labstore

https://gerrit.wikimedia.org/r/249408

coren closed this task as Resolved.Oct 28 2015, 3:48 PM

The switch is now complete, and labstore1001 is back to being the primary labs NFS server.

Only a single issue occurred that may bear looking into in case an emergency switch becomes necessary: for some reason (possibly kernel version mismatch?) snapshots made on labstore1002 (through the backup process) could not all be activated on labstore1001 stalling the boot process. Simply doing an lvremove on them sufficed to fix the issue (since they are disposable).