nfs-manage failover script needs to be tested with real load and fixed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Jul 3 2017, 6:04 PM

Description

Recently we have executed a few failovers between the labstore1004 and labstore1005 hosts and had issues with the nfs-manage script. It seemed so great a year ago :)

Add documention to wikitech for script usage and purpose
Add better inline documentation on steps for up/down
Testing with more load as the principle problem is the inability to umount portions of the bind mount tree which means not being able to umount the underlying devices. The short term resolution is to reboot the server to release resources which is seen by the other node in the pair. That's not a very good place to be.

I did the last 2 migrations cat'ing the contents of the file and stepping through the process line by line and the procedure is broad strokes solid but could use improvements and further testing for high load / usage scenarios.

Details

Subject	Repo	Branch	Lines +/-
shared-storage: remove nfs-manage-binds	operations/puppet	production	+3 -197
labstore: fix the failover process	operations/puppet	production	+2 -2
cloudstore: remove dependency on bind mounts	operations/puppet	production	+113 -59
labstore: notes for nfs-manage	operations/puppet	production	+11 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Bstorm	T169570 nfs-manage failover script needs to be tested with real load and fixed
Declined	None	T203469 Test different NFS mount, export options and methods for failing over and coping with loss of a server
Invalid	None	T207880 create jessie NFS cluster with DRBD in Cloud VPS

Event Timeline

• chasemp created this task.Jul 3 2017, 6:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 3 2017, 6:04 PM

fuser -k introduction or some such is possibly an addition? With nfs-kernel-server stopped file integrity issues from clients shouldn't be an issue but clearly there are edge cases here.

• chasemp triaged this task as High priority.Jul 3 2017, 6:04 PM

Paladox subscribed.Jul 3 2017, 6:51 PM

bd808 moved this task from Triage to Storage on the Cloud-Services board.Sep 14 2017, 5:19 AM

• chasemp added a subscriber: • Bstorm.Jun 6 2018, 7:25 PM

Change 446715 had a related patch set uploaded (by Rush; owner: cpettet):
[operations/puppet@production] labstore: notes for nfs-manage

https://gerrit.wikimedia.org/r/446715

gerritbot added a project: Patch-For-Review.Jul 19 2018, 3:36 AM

Change 446715 merged by Rush:
[operations/puppet@production] labstore: notes for nfs-manage

https://gerrit.wikimedia.org/r/446715

Just to put an assignee on this one, since I'm thinking about it a lot.

• Bstorm edited projects, added cloud-services-team (Kanban); removed Patch-For-Review, SRE, Cloud-Services.Aug 31 2018, 3:32 PM

bd808 moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Jul 11 2019, 4:30 PM

• Bstorm changed the status of subtask T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server from Open to Stalled.Aug 6 2019, 1:55 PM

• Bstorm moved this task from Doing to Soon! on the cloud-services-team (Kanban) board.Sep 25 2019, 3:47 PM

• Bstorm removed • Bstorm as the assignee of this task.Oct 23 2019, 3:20 PM

Change 571821 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

gerritbot added a project: Patch-For-Review.Feb 20 2020, 6:42 PM

Change 571821 merged by Bstorm:
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Change 604857 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857

Change 604857 merged by Bstorm:
[operations/puppet@production] labstore: fix the failover process

https://gerrit.wikimedia.org/r/604857

The version uploaded (manually applied--which slowed it down) was shown to work in this last setup. Now failovers will be faster than ever. I feel confident in closing this ticket.

I also think we should apply the DRBD replication process to cloudstore1008/9 instead of the mechanism that is there now.

• Bstorm closed subtask T203469: Test different NFS mount, export options and methods for failing over and coping with loss of a server as Declined.Jun 11 2020, 9:46 PM

Change 618804 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] share storage: remove nfs-manage-binds

https://gerrit.wikimedia.org/r/618804

Change 618804 merged by Bstorm:
[operations/puppet@production] shared-storage: remove nfs-manage-binds