Page MenuHomePhabricator

NFS v4.1/2 as possible fix for elevated load and lock contention on our NFS servers
Open, Stalled, MediumPublic

Description

Having long since given up on fixing the load issue because we have enough processor cores to roll with it, I came across this report from Red Hat: https://access.redhat.com/solutions/2142081

It turns out that v4.1 implemented parallel opens which can fix some of the more egregious file locking problems in NFS v4. The sensitivity to high load that has plagued these systems might be fairly easy to fix now that the version is more correctly targeted on v4.0 because it may correctly mount as 4.1. This is worth testing in toolsbeta to make sure the client side is ok with it.

If the client side work, a large scale remount is needed to see any benefit and find out if it is worth it, unfortunately.

Event Timeline

Bstorm triaged this task as Medium priority.Jul 14 2020, 5:06 PM
Bstorm created this task.

Change 612647 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud-nfs: Allow changing the nfs mount version

https://gerrit.wikimedia.org/r/612647

Notes on what I've learned by patching that into toolsbeta. On a stretch-based bastion: the remount works great, but reboot doesn't. On boot simply having any NFSv4.1 or 4.2 mount causes all NFS mounts to display a sort of frustrating hang on mount. It will often mount one of the targets at random, but never all of them. The hang cannot be interrupted, of course, because NFS.

The fix from this state so far seems to be changing back to 4.0 and rebooting.

The good news is that this works great with NFS 4.2 on Buster. I was able to reboot a k8s worker using those mounts in Toolsbeta without issue.

To be clear, toolsbeta-test-k8s-worker-2 is mounting v4.2 and toolsbeta-sgebastion-04 is the one that doesn't work right so far.

And toolsbeta-sgeexec-0901 works fine! So it's not the fact that it is stretch. It may be the fact that it is a bastion.

bastion has systemd version 241-5~bpo9+1, while the exec node has 232-25+deb9u11.

That may be the real difference, but there are other customizations to consider.

Mentioned in SAL (#wikimedia-cloud) [2020-07-15T21:35:55Z] <bstorm> set all of toolsbeta to mount NFS 4.2 except the bastion T257945

Change 612647 merged by Bstorm:
[operations/puppet@production] cloud-nfs: Allow changing the nfs mount version

https://gerrit.wikimedia.org/r/612647

bd808 moved this task from Backlog to Shared Storage on the Data-Services board.Jul 19 2020, 11:41 PM
Bstorm added a comment.EditedJul 21 2020, 4:16 PM

Set the value for nfs version on the bastions to '4', the default, so that if I update the entire project, like I did toolsbeta, they aren't affected.

So just now, rolling this out to tools-sgegrid-shadow, I saw an error on boot that might help with all the errors:
Jul 21 16:11:36 tools-sgegrid-shadow kernel: [ 120.872350] NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO

I don't know if that's the same issue on the bastions, but that's an actually-researchable error.

It seems like this is a problem on stretch kernels in general in some cases. I'm going to set the grid shadow master back to 4.0. It's interesting that I did not see this on grid nodes in toolsbeta, but this definitely means we cannot expect to roll this out to stretch clients easily.

Bstorm added a comment.EditedJul 21 2020, 5:03 PM

So I finally found a workaround for this issue (because I now worry it will pop up repeatedly) from an AWS doc.

I'm going to test this notion first in toolsbeta https://docs.aws.amazon.com/efs/latest/ug/images/AmazonElasticFileSystem-UserGuide-console1.pdf

To summarize, AWS EFS is required to be mounted as version 4.1 and uses loads of features like pNFS. They noticed that on systemd-based systems what we are seeing sometimes happens when you have more than one nfs 4.1 or 4.2 mount in fstab. They recommend using a weird systemd workaround that I need to play with a bit using a systemd unit to mount on demand, sequentially. The error number is exactly what I saw, so I'm inclined to believe this describes precisely our error.

Bstorm added a comment.EditedJul 21 2020, 6:02 PM

The workaround did not help at all on the bastion. I'm starting to think the safest way to mount NFS on a systemd-based system is via systemd mounts. That sounds like a fair bit of work :(

It may still work on the other stretch nodes if I can reproduce the issue first.

Looks like I have a solid reproduction of the issue on toolsbeta-sgegrid-shadow

[  113.825517] NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO
[  113.825661] NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO
[  113.826137] NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO
[  113.938823] Process accounting resumed
[  170.214351] systemd[1]: apt-daily.timer: Adding 7h 9min 22.160224s random time.
root@toolsbeta-sgegrid-shadow:~#

Ok, so switching to nfs 4.1 doesn't help. Neither does the AWS hack. Creating an nfsmount.conf with defaultvers=4.0 doesn't help...

I bet it would work by using systemd exclusively to do the mounts instead of fstab, and I kind of hate that option.

I just got it to work on stretch and nfs4.1. I used "vers=4,minorversion=1" as my format. WHY DOES THAT MATTER!!?

Ok, it seems to be that I need to blacklist a module as well, which makes me feel better about the solution. One more test.

Ah, apparently, I was just lucky. Now none of them mount on there.

I am convinced. Unless we want to use systemd for NFS (and even then it might not work) we can only use NFS 4.0 on stretch. This consistently works fine on buster.

Double checked and this is definitely not an issue on a buster VM. This is purely stretch. So this should be safe to apply to puppet prefixes that are known to be buster.

Mentioned in SAL (#wikimedia-cloud) [2020-07-22T22:28:32Z] <bstorm> setting tools-k8s-control prefix to mount NFS v4.2 T257945

Mentioned in SAL (#wikimedia-cloud) [2020-07-22T23:07:28Z] <bstorm> disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once T257945

Mentioned in SAL (#wikimedia-cloud) [2020-07-22T23:11:19Z] <bstorm> running puppet and NFS remount on tools-k8s-worker-[1-15] T257945

Mentioned in SAL (#wikimedia-cloud) [2020-07-22T23:14:39Z] <bstorm> running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] T257945

Mentioned in SAL (#wikimedia-cloud) [2020-07-22T23:17:46Z] <bstorm> running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] T257945

Mentioned in SAL (#wikimedia-cloud) [2020-07-22T23:22:49Z] <bstorm> running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] T257945

Mentioned in SAL (#wikimedia-cloud) [2020-07-22T23:32:10Z] <bstorm> setting the default NFS version to 4.2 while excepting the two stretch servers T257945

Bstorm changed the task status from Open to Stalled.Jul 23 2020, 9:27 PM

That all went quite well. I'll leave this open until we can upgrade the grid, I suppose. At this point, this is stalled on stretch deprecation unless we eventually get NFS 4.1 or 4.2 to work on reboot on Debian Stretch.