Page MenuHomePhabricator

nfs-manage failover script needs to be tested with real load and fixed
Closed, ResolvedPublic

Description

Recently we have executed a few failovers between the labstore1004 and labstore1005 hosts and had issues with the nfs-manage script. It seemed so great a year ago :)

  • Add documention to wikitech for script usage and purpose
  • Add better inline documentation on steps for up/down
  • Testing with more load as the principle problem is the inability to umount portions of the bind mount tree which means not being able to umount the underlying devices. The short term resolution is to reboot the server to release resources which is seen by the other node in the pair. That's not a very good place to be.

I did the last 2 migrations cat'ing the contents of the file and stepping through the process line by line and the procedure is broad strokes solid but could use improvements and further testing for high load / usage scenarios.

Event Timeline

fuser -k introduction or some such is possibly an addition? With nfs-kernel-server stopped file integrity issues from clients shouldn't be an issue but clearly there are edge cases here.

Change 446715 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] labstore: notes for nfs-manage

https://gerrit.wikimedia.org/r/446715

Change 446715 merged by Rush:
[operations/puppet@production] labstore: notes for nfs-manage

https://gerrit.wikimedia.org/r/446715

Just to put an assignee on this one, since I'm thinking about it a lot.

Change 571821 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Change 571821 merged by Bstorm:
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Change 604857 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857

Change 604857 merged by Bstorm:
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857

Bstorm claimed this task.

The version uploaded (manually applied--which slowed it down) was shown to work in this last setup. Now failovers will be faster than ever. I feel confident in closing this ticket.

I also think we should apply the DRBD replication process to cloudstore1008/9 instead of the mechanism that is there now.

Change 618804 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] share storage: remove nfs-manage-binds

https://gerrit.wikimedia.org/r/618804

Change 618804 merged by Bstorm:
[operations/puppet@production] shared-storage: remove nfs-manage-binds

https://gerrit.wikimedia.org/r/618804