Page MenuHomePhabricator

labstore1004 and labstore1005 high load issues following upgrades
Closed, ResolvedPublic

Description

This is to track the issue and find resolution. After upgrading to the latest kernel, labstore1004 (the current primary) has changed it's load characteristics dramatically, riding at about a factor of 10 above previous levels. At this point, it doesn't seem to be causing significant problems for NFS client machines, but it is a problem for effective monitoring and heavy use scenarios.

The kernel we are currently experiencing this on is 4.9.0-0.bpo.8-amd64 #1 SMP Debian 4.9.110-3+deb9u4~deb8u1

Downgrading seems unwise due to security concerns and simply the age of the kernels we know don't do this.

Related Objects

StatusSubtypeAssignedTask
Resolved Bstorm
ResolvedMoritzMuehlenhoff
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Resolvedaborrero
Resolvedtaavi
DuplicateNone
Resolvedtaavi
DeclinedNone
Resolvedaborrero
DeclinedNone
Resolvedaborrero
Resolvedtaavi
Resolvedtaavi
Resolved nskaggs
Declinedtaavi
DeclinedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Bstorm triaged this task as Medium priority.Aug 31 2018, 2:17 PM

One of the first things I'd like to try is removing subtree checking from the exports. I want to have test cases ready for when that happens, so I'm putting that off until Tuesday.

For reference:

The NFS/DRBD primary before the upgrade

Screen Shot 2018-08-31 at 10.19.29 AM.png (1×2 px, 725 KB)

The NFS/DRBD primary after our upgrade

Screen Shot 2018-08-31 at 10.21.51 AM.png (1×2 px, 786 KB)

Please note that the time scale on the first is a month. The peaks and troughs match the on the two. The lower graph is for only two days, which is why they stretch out more. The peaks are not as high, comparatively, but the load is overall very high, with little real impact from it.

Probably way too far outside the box but I did notice that at the last openstack summit lots of folks were not using the baked in NFSd in favor of https://github.com/nfs-ganesha/nfs-ganesha/wiki. If the meltdown/spectre changes have made the existing practices too expensive maybe explore soemthing that doesn't run in kernel space if it gets desperate.

I've been looking at lizardfs and similar notions, so I don't consider that at all too far outside the box, lol. I'll try small changes first, but things like that might be the way forward.

Also, it is worth noting that the load here is much higher than on labstore1006/7 with a similar kernel. From labstore1006 no_subtree_check ;-)

Change 457896 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs-exportd: remove subtree_check from project exports

https://gerrit.wikimedia.org/r/457896

Change 457896 merged by Bstorm:
[operations/puppet@production] nfs-exportd: remove subtree_check from project exports

https://gerrit.wikimedia.org/r/457896

subtree_check was not the magic fix. There is still a share doing subtree checking on there, but the proportional fix I'm hoping for isn't quite what I'm seeing. However, dmesg is filled with

[477102.039297] rpc-srv/tcp: nfsd: sent only 297568 when sending 1048640 bytes - shutting down socket
[477168.516644] rpc-srv/tcp: nfsd: sent only 281184 when sending 1048640 bytes - shutting down socket
[477235.026423] rpc-srv/tcp: nfsd: sent only 401472 when sending 1048640 bytes - shutting down socket
[477301.424096] rpc-srv/tcp: nfsd: sent only 387304 when sending 1048640 bytes - shutting down socket
[477367.805830] rpc-srv/tcp: nfsd: sent only 325864 when sending 1048640 bytes - shutting down socket
[477434.331189] rpc-srv/tcp: nfsd: sent only 126664 when sending 1048640 bytes - shutting down socket

THAT could cause some load problems. Time to tune some TCP buffers.

Change 458198 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sysctl: Allow override of tcp settings in the kernel

https://gerrit.wikimedia.org/r/458198

Change 458198 abandoned by Bstorm:
sysctl: Allow override of tcp settings in the kernel

Reason:
Better to use the priority system

https://gerrit.wikimedia.org/r/458198

Change 458291 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: Change tcp buffer settings

https://gerrit.wikimedia.org/r/458291

Change 458291 merged by Bstorm:
[operations/puppet@production] labstore: Change tcp buffer settings

https://gerrit.wikimedia.org/r/458291

Change 458504 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix priority on sysctl file

https://gerrit.wikimedia.org/r/458504

Change 458504 merged by Bstorm:
[operations/puppet@production] labstore: fix priority on sysctl file

https://gerrit.wikimedia.org/r/458504

Change 458514 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: load monitoring should be based on number of processors

https://gerrit.wikimedia.org/r/458514

Change 458514 merged by Bstorm:
[operations/puppet@production] labstore: load monitoring should be based on number of processors

https://gerrit.wikimedia.org/r/458514

Dug around very deeply. Overall, the problem is the state of the NFSD threads so commonly ends up in uninterruptible sleep (which increases load and can't possibly help performance). Opening the firehose by turning on all NFS debugging flags.

How fun is that? I don't see anything wrong in the debug output.

I do see that we have not tweaked the IO scheduler characteristics inline with DRBD documentation. We are using the deadline scheduler for IO (which seems default on server installs here), but the other recommendations are not in place. Since at this point IO is the only place I can find where improvements are very likely (since this is under "high load" when it is sort-of-kind-of-quiet for this cluster). It is entirely possible that some characteristics of the scheduler changed after the upgrade to become more expensive than previously. IOwait is very low, but it isn't always low per processor, which could correspond to the uninterruptible sleep state of the nfsd threads. That does suggest it is worth it to tweak the scheduler a bit. Sadly, Friday evening is not the time to attempt that.

I may still try to enable debugging during a particularly high load period this weekend to see if I can get something more useful.

Applied settings here https://docs.linbit.com/docs/users-guide-8.4/#s-prepare-storage
regarding io queue scheduler. Just checking to see what impact, if any, that has here.

It's early to tell, but this may have reduced overall load a small amount. If so, some changes will be needed to make the tuning persistent. It's hard to tell because it is not a big difference. That said, it would not have paged last night.

I've created a backport of the nfs-utils package from stretch for jessie, it's not yet uploaded to apt.wikimedia.org, but available at https://people.wikimedia.org/~jmm/nfs/

Tried it on the inactive server first and found that it does something a bit odd with an inactive nfs server (as we have it set up). Without nfs started (it's generally not running on a secondary in the pair), the exportfs command fails, which is no surprise in some ways, but apparently it did work before. May need to make the script not run the exportfs command unless it is active.

Ok, simply starting and stopping the nfs server resolved that. Note: in this version the service is called nfs-server, not nfs-kernel-server.

Also note: it depends on keyutils.

Well, that went terribly! Stopping the nfs-kernel-server and installing nfs-common and nfs-kernel-server backported packages did not result in a working server. Rolled back.

I suspect that I may also need to restart some services such as rpcbind, portmap, etc, that may not have been captured by the service I did restart. The result was that the server was advertising exports, but it was not giving anybody permission to use them (despite the exports file being correct).

@GTirloni has built a VPS project for us to test this in. Clearly, we need a bit more testing before releasing this again. So far, the secondary NFS server, labstore1005, still has the backported packages installed. If we cannot make this work in testing, I'll roll back the secondary as well.

Not sure it is a new situation, but I just became aware that the RPS setting intended to balance IRQs over CPUs for network receive queues isn't working on labstore1004. All receive and tx queues for the interface are clearly going over CPU0 only...and RPS settings are looking like they are not really set in general in sysfs. This also has two numa nodes...what a lovely rabbit hole. (copied from IRC)

Mentioned in SAL (#wikimedia-cloud) [2018-09-20T20:59:02Z] <bstorm_> nfs mount is now at /mnt/nfs/test on the client. Since it now works, we should be able to break it for T203254

Mentioned in SAL (#wikimedia-cloud) [2018-09-21T15:00:15Z] <bstorm_> Installed backported packages to nfs server and demonstrated that it can be made to work by following a procedure to be documented in T203254

The almost correct procedure for installing the backports:

  • install keyutils (new pre-req)
  • systemctl stop nfs-common
  • systemctl stop nfs-kernel-server
  • dpkg -i nfs-* in the dir with the packages
  • sudo mv idmapd.conf.ucf-dist idmapd.conf

But before I get too excited, it looks to have just started having the same problems again in testing.

It did. Killing rpc-statd fixed the problem. This doesn't need statd for basically anything. It was only started due to the nfs-common package which doesn't run in this version. Rebooting the server doesn't restart it, and nfs continues working. So far so good. Running more bonny++ to stress-test it.

Ok, now the test environment is working correctly and not randomly collapsing. Presuming this stays true over the weekend, the procedure is as follows:

  • install keyutils
  • systemctl stop nfs-common
  • systemctl stop nfs-kernel-server
  • kill the rpc-statd process
  • dpkg -i nfs-* with the packages
  • sudo mv /etc/imapd.conf.ucf-dist /etc/idmapd.conf
  • systemctl start nfs-server
  • make sure rpc-statd is still dead
  • test everything
  • ensure that puppet will disable nfs-server instead of nfs-kernel-server since the service changed names

Testing showed no reduction of any significance in performance (gains in some places) when benchmarking before and after. If we are lucky, this will actually fix the problem with load. I do believe that we may need to visit the odd behavior I'm seeing around network queues and IRQs, though either way.

Nice, if these are confirmed working, we should import my nfs-utils backport to apt.wikimedia.org. Should these go to a separate component (something like component/nfs13, which is then added to selective NFS servers) or be added in general? Apart from labstore100[4/5], we also have labstore100[67] and dumpsdata, would these get updated as well or rather not? If yes, we can also simply import the packages to apt.wikimedia.org/main.

Aaand, it went badly when I tried it on the live systems. Similar issues. I'm trying to find what was different.

Change 463506 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] openstack: add case for stretch and newtron in client repos

https://gerrit.wikimedia.org/r/463506

Change 463506 merged by Bstorm:
[operations/puppet@production] openstack: add case for stretch and newton in client repos

https://gerrit.wikimedia.org/r/463506

Change 463782 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] openstack: add case for stretch and newton in client repos

https://gerrit.wikimedia.org/r/463782

Hi, here's some reviews about LizardFS https://www.jdieter.net/posts/2016/09/30/from-nfs-to-lizardfs/ and https://www.itcentralstation.com/product_reviews/lizardfs-review-34235-by-valentin-hobel .

I have experence with LizardFS as i deployed it over the summer at Miraheze. We are quite a large mediawiki farm. LizardFS is using very low load / ram usage on the chunk servers (where it stores the data).

The only thing is make sure the master has enough ram and cpu power but other then that you will be able to switch storage servers (or add or remove them) all without changing mounts. Since the mount will bind to the master host.

I really do recommend it over NFS!

Ok. Now puppet can succeed on a cloudstore/labstore running stretch, after patching around the whole issue of the use_ldap scripts (which seem to be unnecessary in stretch). There's one or two other things to change, but then we could actually try just running a stretch NFS cluster.

Change 467969 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: correct the service name for stretch

https://gerrit.wikimedia.org/r/467969

Change 463782 merged by Bstorm:
[operations/puppet@production] openstack: add case for stretch and newton in client repos

https://gerrit.wikimedia.org/r/463782

Change 467969 merged by Bstorm:
[operations/puppet@production] labstore: correct the service name for stretch

https://gerrit.wikimedia.org/r/467969

One change that may alleviate disk wait times is to change the RAID volumes from WriteBack to WriteThrough since our controller is battery-backed (megacli -LDSetProp WB -LAll -aAll).

For comparison, here's the configuration cache policy of other storage servers (megacli -LDInfo -LAll -aAll):

ServerVirtual DriveSizeDefaultCurrent
labstore10010:010.9TBWriteBackWriteBack
labstore10011:07.2TBWriteBackWriteBack
labstore10011:17.2TBWriteBackWriteBack
labstore10011:27.2TBWriteBackWriteBack
labstore10011:37.2TBWriteBackWriteBack
labstore10020:010.9TBWriteBackWriteBack
labstore10021:07.2TBWriteBackWriteBack
labstore10021:17.2TBWriteBackWriteBack
labstore10021:27.2TBWriteBackWriteBack
labstore10021:37.2TBWriteBackWriteBack
labstore10030:01.8TBWriteBackWriteBack
labstore10030:19TBWriteBackWriteBack
labstore10031:014TBWriteBackWriteThrough
labstore10031:114TBWriteBackWriteThrough
labstore10031:214TBWriteBackWriteThrough
labstore10040:0931GBWriteBackWriteBack
labstore10040:19TBWriteBackWriteBack
labstore10040:112TBWriteBackWriteBack
labstore10050:0931GBWriteBackWriteBack
labstore10050:09TBWriteBackWriteBack
labstore10050:012TBWriteBackWriteBack
labstore10070:010.9TBWriteBackWriteBack

The controller should fallback to WriteBack if the battery goes bad (No Write Cache if Bad BBU)

labstore1001/2 are the walking dead, FYI. They are blank spares to be decommissioned when 8/9 replace 1003. 1003 is only still there because we don't have replacements up yet. 1003 is scratch and misc, which is more transient data in some cases.

That said, we are having problems with write latency. Wouldn't WriteThrough make that worse?

Yes, it would :facepalm: I mixed up the two concepts and was thinking about WriteBack and writing WriteThrough. Sorry. I'm out of ideas now.

So after all this time, one thing that none of us considered is that we are using Intel scalable processors. We didn't check what the upgrade did in terms of the scaling strategy. T210723#5255625 got me thinking that maybe I should take a look, and look at that:

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 
powersave

Basically, the processors are running at the lowest possible performance point, which is the default for some drivers/processors.
I manually set them all to performance on both servers to see how that affects load. It can be made persistent through reboots with an /etc/default file if this works.

Mentioned in SAL (#wikimedia-operations) [2019-06-13T17:34:48Z] <bstorm_> T203254 set cpu scaling governor to performance on labstore1004 and labstore1005

Interestingly, the graph has overall settled down a lot recently (before I tried this change). The max in the past 24 hours is an entirely acceptable 50 (for this number of cores). It still is high compared to before that fateful upgrade, but it is low compared to what it had been (often up to 300).

The last time the load went over 100 was before rOPUPabaea106903e

So that's good.

The change doesn't seem to have hurt or helped since I made it. Stopping the client-side monitoring has done far more.

Keeping an eye on trends a bit more

Bstorm moved this task from Doing to Watching on the cloud-services-team (Kanban) board.
Bstorm claimed this task.

Same notes as in T169289, this is the new normal, and we just need to get used to it.

While NFS 4.2 seemed to have helped in places, but I think that easing tc limits did more. The load numbers are now as low as they were historically, and they dropped further when I opened up the limits more. The load was likely caused by lock release taking too long.

Closed this figuring this is the new normal. Now it is actually better.