Add cluster-awareness to nfs-exportd
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	May 21 2020, 10:08 PM

Description

Since we abandoned bind mounts on the NFS clusters per https://gerrit.wikimedia.org/r/c/operations/puppet/+/571821, nfs-exportd throws non-zero statuses on stretch+ on DRBD secondary nodes (which makes perfect sense). Why it didn't on Jessie is the real mystery.

As a bandaid fix, we switched from subprocess.check_call() to subprocess.call(), but that seems like a good way to mask serious errors in the future. Adding some cluster awareness like that used by maintain-dbusers should fix things in a more sensible way. While we are at it, it wouldn't hurt to set this to only run exportfs when changes are made. Re-running exportfs every 5 minutes could not possibly be good for the system's performance despite the fact that we've done it for years.

Details

	Subject	Repo	Branch	Lines +/-
	cloud nfs: only run nfs-exportd on the current active node	operations/puppet	production	+43 -13

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Invalid		None	T197804 Puppet: forbid new Python2 code
Open		None	T218426 Upgrade various Cloud VPS Python 2 scripts to Python 3
Resolved	BUG REPORT	• Bstorm	T218423 Add python 3 packages to openstack::clientpackages::common
Resolved		MoritzMuehlenhoff	T232677 Remove support for Debian Jessie in Cloud Services
			Restricted Task
			Restricted Task
Resolved		MoritzMuehlenhoff	T224549 Track remaining jessie systems in production
Resolved		• Bstorm	T169289 Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues
Resolved		• Bstorm	T169286 labstore1005 A PCIe link training failure error on boot
Resolved		MoritzMuehlenhoff	T169290 New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS
Resolved		• Bstorm	T203254 labstore1004 and labstore1005 high load issues following upgrades
Resolved		• Bstorm	T224582 Migrate labstore1004/labstore1005 to Stretch/Buster
Resolved		• Bstorm	T253353 Add cluster-awareness to nfs-exportd

Event Timeline

• Bstorm triaged this task as Medium priority.May 21 2020, 10:08 PM

• Bstorm created this task.

• Bstorm moved this task from Backlog to Shared Storage on the Data-Services board.

• Bstorm moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

• Bstorm claimed this task.Jun 16 2020, 5:23 PM

• Bstorm moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.

Change 606543 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud nfs: only run nfs-exportd on the current active node

https://gerrit.wikimedia.org/r/606543

gerritbot added a project: Patch-For-Review.Jun 18 2020, 11:02 PM

Mentioned in SAL (#wikimedia-operations) [2020-06-22T22:39:38Z] <bstorm_> downtimed labstore1005 to prevent an alert during puppet merge T253353