Maniphest T203254

labstore1004 and labstore1005 high load issues following upgrades
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Aug 31 2018, 2:17 PM

Description

This is to track the issue and find resolution. After upgrading to the latest kernel, labstore1004 (the current primary) has changed it's load characteristics dramatically, riding at about a factor of 10 above previous levels. At this point, it doesn't seem to be causing significant problems for NFS client machines, but it is a problem for effective monitoring and heavy use scenarios.

The kernel we are currently experiencing this on is 4.9.0-0.bpo.8-amd64 #1 SMP Debian 4.9.110-3+deb9u4~deb8u1

Downgrading seems unwise due to security concerns and simply the age of the kernels we know don't do this.

Details

Subject	Repo	Branch	Lines +/-
labstore: correct the service name for stretch	operations/puppet	production	+8 -2
openstack: add case for stretch and newton in client repos	operations/puppet	production	+4 -0
openstack: add case for stretch and newton in client repos	operations/puppet	production	+4 -4
labstore: load monitoring should be based on number of processors	operations/puppet	production	+2 -4
labstore: fix priority on sysctl file	operations/puppet	production	+1 -1
labstore: Change tcp buffer settings	operations/puppet	production	+14 -0
sysctl: Allow override of tcp settings in the kernel	operations/puppet	production	+46 -43
nfs-exportd: remove subtree_check from project exports	operations/puppet	production	+3 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
		Restricted Task
Resolved	• Bstorm	T169289 Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues
Resolved	MoritzMuehlenhoff	T169290 New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS
Resolved	• Bstorm	T203254 labstore1004 and labstore1005 high load issues following upgrades
Resolved	• Bstorm	T224582 Migrate labstore1004/labstore1005 to Stretch/Buster
Resolved	• Bstorm	T253353 Add cluster-awareness to nfs-exportd
Open	None	T257945 NFS v4.1/2 as possible fix for elevated load and lock contention on our NFS servers
Resolved	aborrero	T277653 Toolforge: add Debian Buster to the grid and eliminate Debian Stretch
Resolved	aborrero	T277866 cloud-init: figure out how to change /etc/hosts from cloud-init/vendordata
Resolved	aborrero	T278232 Toolforge: figure out how to work with the new domain in the grid
Resolved	aborrero	T278748 Toolforge: introduce support for selecting grid queue release
Resolved	aborrero	T282972 "--release is not implemented for --backend=kubernetes" with latest tools-webservice from Git
Resolved	taavi	T288961 /usr/local/bin/webservice:109: DeprecationWarning: dist() and linux_distribution() functions are deprecated in Python 3.5
Resolved	aborrero	T300501 Toolforge grid: uwsgi in buster fails to load python3 venvs
Resolved	taavi	T280037 Toolforge: set up monitoring tooling for stretch deprecation
Duplicate	None	T280252 Toolforge Buster bastion no longer tab completes become command
Resolved	taavi	T284767 Toolforge: migrate cron servers to Debian Buster
Declined	None	T298089 Sort out Mono repositories for the buster grid
Resolved	aborrero	T298948 Toolforge grid deployment/management automation
Declined	None	T300032 spicerack: introduce GridEngine controller
Resolved	aborrero	T301665 Toolforge jobs framework: create documentation on wikitech
Resolved	taavi	T309525 Toolforge: Create a cookbook to decomission a SGE node
Resolved	taavi	T309732 New SGE nodes can't talk to the grid engine master
Resolved	• nskaggs	T309821 Buster webservice grid went BOOM!
Declined	taavi	T309902 Tiny swap on many grid nodes
Declined	None	T336034 Toolforge grid automation: consider creating a cookbook to heal the grid from D state procs

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Bstorm triaged this task as Medium priority.Aug 31 2018, 2:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 31 2018, 2:17 PM

One of the first things I'd like to try is removing subtree checking from the exports. I want to have test cases ready for when that happens, so I'm putting that off until Tuesday.

For reference:

The NFS/DRBD primary before the upgrade

Screen Shot 2018-08-31 at 10.19.29 AM.png (1×2 px, 725 KB)

The NFS/DRBD primary after our upgrade

Screen Shot 2018-08-31 at 10.21.51 AM.png (1×2 px, 786 KB)

Please note that the time scale on the first is a month. The peaks and troughs match the on the two. The lower graph is for only two days, which is why they stretch out more. The peaks are not as high, comparatively, but the load is overall very high, with little real impact from it.

Probably way too far outside the box but I did notice that at the last openstack summit lots of folks were not using the baked in NFSd in favor of https://github.com/nfs-ganesha/nfs-ganesha/wiki. If the meltdown/spectre changes have made the existing practices too expensive maybe explore soemthing that doesn't run in kernel space if it gets desperate.

I've been looking at lizardfs and similar notions, so I don't consider that at all too far outside the box, lol. I'll try small changes first, but things like that might be the way forward.

Also, it is worth noting that the load here is much higher than on labstore1006/7 with a similar kernel. From labstore1006 no_subtree_check ;-)

Change 457896 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nfs-exportd: remove subtree_check from project exports

https://gerrit.wikimedia.org/r/457896

gerritbot added a project: Patch-For-Review.Sep 4 2018, 1:24 PM

Change 457896 merged by Bstorm:
[operations/puppet@production] nfs-exportd: remove subtree_check from project exports

https://gerrit.wikimedia.org/r/457896

subtree_check was not the magic fix. There is still a share doing subtree checking on there, but the proportional fix I'm hoping for isn't quite what I'm seeing. However, dmesg is filled with

[477102.039297] rpc-srv/tcp: nfsd: sent only 297568 when sending 1048640 bytes - shutting down socket
[477168.516644] rpc-srv/tcp: nfsd: sent only 281184 when sending 1048640 bytes - shutting down socket
[477235.026423] rpc-srv/tcp: nfsd: sent only 401472 when sending 1048640 bytes - shutting down socket
[477301.424096] rpc-srv/tcp: nfsd: sent only 387304 when sending 1048640 bytes - shutting down socket
[477367.805830] rpc-srv/tcp: nfsd: sent only 325864 when sending 1048640 bytes - shutting down socket
[477434.331189] rpc-srv/tcp: nfsd: sent only 126664 when sending 1048640 bytes - shutting down socket

THAT could cause some load problems. Time to tune some TCP buffers.

Change 458198 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sysctl: Allow override of tcp settings in the kernel

https://gerrit.wikimedia.org/r/458198

Change 458198 abandoned by Bstorm:
sysctl: Allow override of tcp settings in the kernel

Reason:
Better to use the priority system

https://gerrit.wikimedia.org/r/458198

Change 458291 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: Change tcp buffer settings

https://gerrit.wikimedia.org/r/458291

Change 458291 merged by Bstorm:
[operations/puppet@production] labstore: Change tcp buffer settings

https://gerrit.wikimedia.org/r/458291

Change 458504 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: fix priority on sysctl file

https://gerrit.wikimedia.org/r/458504

Change 458504 merged by Bstorm:
[operations/puppet@production] labstore: fix priority on sysctl file

https://gerrit.wikimedia.org/r/458504

Change 458514 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: load monitoring should be based on number of processors

https://gerrit.wikimedia.org/r/458514

Change 458514 merged by Bstorm:
[operations/puppet@production] labstore: load monitoring should be based on number of processors

https://gerrit.wikimedia.org/r/458514

Dug around very deeply. Overall, the problem is the state of the NFSD threads so commonly ends up in uninterruptible sleep (which increases load and can't possibly help performance). Opening the firehose by turning on all NFS debugging flags.

How fun is that? I don't see anything wrong in the debug output.

I do see that we have not tweaked the IO scheduler characteristics inline with DRBD documentation. We are using the deadline scheduler for IO (which seems default on server installs here), but the other recommendations are not in place. Since at this point IO is the only place I can find where improvements are very likely (since this is under "high load" when it is sort-of-kind-of-quiet for this cluster). It is entirely possible that some characteristics of the scheduler changed after the upgrade to become more expensive than previously. IOwait is very low, but it isn't always low per processor, which could correspond to the uninterruptible sleep state of the nfsd threads. That does suggest it is worth it to tweak the scheduler a bit. Sadly, Friday evening is not the time to attempt that.

I may still try to enable debugging during a particularly high load period this weekend to see if I can get something more useful.

MoritzMuehlenhoff subscribed.Sep 10 2018, 4:39 PM

• Bstorm mentioned this in T204071: tools-webgrid-generic-1402 filling up from pacct on man tasks.Sep 11 2018, 5:35 PM

Applied settings here https://docs.linbit.com/docs/users-guide-8.4/#s-prepare-storage
regarding io queue scheduler. Just checking to see what impact, if any, that has here.

It's early to tell, but this may have reduced overall load a small amount. If so, some changes will be needed to make the tuning persistent. It's hard to tell because it is not a big difference. That said, it would not have paged last night.

• GTirloni mentioned this in T161898: Tools instances flapping puppet failure alerts.Sep 13 2018, 12:15 AM

• GTirloni subscribed.Sep 13 2018, 12:29 AM

I've created a backport of the nfs-utils package from stretch for jessie, it's not yet uploaded to apt.wikimedia.org, but available at https://people.wikimedia.org/~jmm/nfs/

Tried it on the inactive server first and found that it does something a bit odd with an inactive nfs server (as we have it set up). Without nfs started (it's generally not running on a secondary in the pair), the exportfs command fails, which is no surprise in some ways, but apparently it did work before. May need to make the script not run the exportfs command unless it is active.

Ok, simply starting and stopping the nfs server resolved that. Note: in this version the service is called nfs-server, not nfs-kernel-server.

Also note: it depends on keyutils.

• Bstorm mentioned this in T204359: Investigate and/or deploy LACP to NFS servers for Cloud Services.Sep 14 2018, 3:34 PM

ArielGlenn subscribed.Sep 17 2018, 6:48 PM

Well, that went terribly! Stopping the nfs-kernel-server and installing nfs-common and nfs-kernel-server backported packages did not result in a working server. Rolled back.

I suspect that I may also need to restart some services such as rpcbind, portmap, etc, that may not have been captured by the service I did restart. The result was that the server was advertising exports, but it was not giving anybody permission to use them (despite the exports file being correct).

@GTirloni has built a VPS project for us to test this in. Clearly, we need a bit more testing before releasing this again. So far, the secondary NFS server, labstore1005, still has the backported packages installed. If we cannot make this work in testing, I'll roll back the secondary as well.

Not sure it is a new situation, but I just became aware that the RPS setting intended to balance IRQs over CPUs for network receive queues isn't working on labstore1004. All receive and tx queues for the interface are clearly going over CPU0 only...and RPS settings are looking like they are not really set in general in sysfs. This also has two numa nodes...what a lovely rabbit hole. (copied from IRC)

irqstuff.txt3 KBDownload

Mentioned in SAL (#wikimedia-cloud) [2018-09-20T20:59:02Z] <bstorm_> nfs mount is now at /mnt/nfs/test on the client. Since it now works, we should be able to break it for T203254

Mentioned in SAL (#wikimedia-cloud) [2018-09-21T15:00:15Z] <bstorm_> Installed backported packages to nfs server and demonstrated that it can be made to work by following a procedure to be documented in T203254

The almost correct procedure for installing the backports:

install keyutils (new pre-req)
systemctl stop nfs-common
systemctl stop nfs-kernel-server
dpkg -i nfs-* in the dir with the packages
sudo mv idmapd.conf.ucf-dist idmapd.conf

But before I get too excited, it looks to have just started having the same problems again in testing.

It did. Killing rpc-statd fixed the problem. This doesn't need statd for basically anything. It was only started due to the nfs-common package which doesn't run in this version. Rebooting the server doesn't restart it, and nfs continues working. So far so good. Running more bonny++ to stress-test it.

Ok, now the test environment is working correctly and not randomly collapsing. Presuming this stays true over the weekend, the procedure is as follows:

install keyutils
systemctl stop nfs-common
systemctl stop nfs-kernel-server
kill the rpc-statd process
dpkg -i nfs-* with the packages
sudo mv /etc/imapd.conf.ucf-dist /etc/idmapd.conf
systemctl start nfs-server
make sure rpc-statd is still dead
test everything
ensure that puppet will disable nfs-server instead of nfs-kernel-server since the service changed names

Testing showed no reduction of any significance in performance (gains in some places) when benchmarking before and after. If we are lucky, this will actually fix the problem with load. I do believe that we may need to visit the odd behavior I'm seeing around network queues and IRQs, though either way.

Nice, if these are confirmed working, we should import my nfs-utils backport to apt.wikimedia.org. Should these go to a separate component (something like component/nfs13, which is then added to selective NFS servers) or be added in general? Apart from labstore100[4/5], we also have labstore100[67] and dumpsdata, would these get updated as well or rather not? If yes, we can also simply import the packages to apt.wikimedia.org/main.

Aaand, it went badly when I tried it on the live systems. Similar issues. I'm trying to find what was different.

Change 463506 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] openstack: add case for stretch and newtron in client repos

https://gerrit.wikimedia.org/r/463506

Change 463506 merged by Bstorm:
[operations/puppet@production] openstack: add case for stretch and newton in client repos

https://gerrit.wikimedia.org/r/463506

Change 463782 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] openstack: add case for stretch and newton in client repos

https://gerrit.wikimedia.org/r/463782

Hi, here's some reviews about LizardFS https://www.jdieter.net/posts/2016/09/30/from-nfs-to-lizardfs/ and https://www.itcentralstation.com/product_reviews/lizardfs-review-34235-by-valentin-hobel .

I have experence with LizardFS as i deployed it over the summer at Miraheze. We are quite a large mediawiki farm. LizardFS is using very low load / ram usage on the chunk servers (where it stores the data).

The only thing is make sure the master has enough ram and cpu power but other then that you will be able to switch storage servers (or add or remove them) all without changing mounts. Since the mount will bind to the master host.

I really do recommend it over NFS!

• Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Oct 3 2018, 3:28 PM

Ok. Now puppet can succeed on a cloudstore/labstore running stretch, after patching around the whole issue of the use_ldap scripts (which seem to be unnecessary in stretch). There's one or two other things to change, but then we could actually try just running a stretch NFS cluster.

Change 467969 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: correct the service name for stretch

https://gerrit.wikimedia.org/r/467969

Change 463782 merged by Bstorm:
[operations/puppet@production] openstack: add case for stretch and newton in client repos

https://gerrit.wikimedia.org/r/463782

Change 467969 merged by Bstorm:
[operations/puppet@production] labstore: correct the service name for stretch

https://gerrit.wikimedia.org/r/467969

One change that may alleviate disk wait times is to change the RAID volumes from WriteBack to WriteThrough since our controller is battery-backed (megacli -LDSetProp WB -LAll -aAll).

For comparison, here's the configuration cache policy of other storage servers (megacli -LDInfo -LAll -aAll):

Server	Virtual Drive	Size	Default	Current
labstore1001	0:0	10.9TB	WriteBack	WriteBack
labstore1001	1:0	7.2TB	WriteBack	WriteBack
labstore1001	1:1	7.2TB	WriteBack	WriteBack
labstore1001	1:2	7.2TB	WriteBack	WriteBack
labstore1001	1:3	7.2TB	WriteBack	WriteBack
labstore1002	0:0	10.9TB	WriteBack	WriteBack
labstore1002	1:0	7.2TB	WriteBack	WriteBack
labstore1002	1:1	7.2TB	WriteBack	WriteBack
labstore1002	1:2	7.2TB	WriteBack	WriteBack
labstore1002	1:3	7.2TB	WriteBack	WriteBack
labstore1003	0:0	1.8TB	WriteBack	WriteBack
labstore1003	0:1	9TB	WriteBack	WriteBack
labstore1003	1:0	14TB	WriteBack	WriteThrough
labstore1003	1:1	14TB	WriteBack	WriteThrough
labstore1003	1:2	14TB	WriteBack	WriteThrough
labstore1004	0:0	931GB	WriteBack	WriteBack
labstore1004	0:1	9TB	WriteBack	WriteBack
labstore1004	0:1	12TB	WriteBack	WriteBack
labstore1005	0:0	931GB	WriteBack	WriteBack
labstore1005	0:0	9TB	WriteBack	WriteBack
labstore1005	0:0	12TB	WriteBack	WriteBack
labstore1007	0:0	10.9TB	WriteBack	WriteBack

The controller should fallback to WriteBack if the battery goes bad (No Write Cache if Bad BBU)

labstore1001/2 are the walking dead, FYI. They are blank spares to be decommissioned when 8/9 replace 1003. 1003 is only still there because we don't have replacements up yet. 1003 is scratch and misc, which is more transient data in some cases.

That said, we are having problems with write latency. Wouldn't WriteThrough make that worse?

Yes, it would :facepalm: I mixed up the two concepts and was thinking about WriteBack and writing WriteThrough. Sorry. I'm out of ideas now.

• Bstorm moved this task from Doing to Soon! on the cloud-services-team (Kanban) board.Jan 10 2019, 11:33 PM

• Bstorm added a parent task: T169290: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS.Mar 21 2019, 5:47 PM

• Bstorm mentioned this in T217086: Investigate why the new Son of Grid Engine grid landed in a worse state when NFS was filled than the old Sun Grid Engine grid did.Mar 21 2019, 5:50 PM

• GTirloni unsubscribed.Mar 21 2019, 9:11 PM

• GTirloni added a project: Data-Services.Mar 23 2019, 9:43 PM

bd808 moved this task from Backlog to Shared Storage on the Data-Services board.May 30 2019, 7:04 PM

So after all this time, one thing that none of us considered is that we are using Intel scalable processors. We didn't check what the upgrade did in terms of the scaling strategy. T210723#5255625 got me thinking that maybe I should take a look, and look at that:

$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 
powersave

Basically, the processors are running at the lowest possible performance point, which is the default for some drivers/processors.
I manually set them all to performance on both servers to see how that affects load. It can be made persistent through reboots with an /etc/default file if this works.

Mentioned in SAL (#wikimedia-operations) [2019-06-13T17:34:48Z] <bstorm_> T203254 set cpu scaling governor to performance on labstore1004 and labstore1005

Interestingly, the graph has overall settled down a lot recently (before I tried this change). The max in the past 24 hours is an entirely acceptable 50 (for this number of cores). It still is high compared to before that fateful upgrade, but it is low compared to what it had been (often up to 300).

The last time the load went over 100 was before rOPUPabaea106903e

So that's good.

The change doesn't seem to have hurt or helped since I made it. Stopping the client-side monitoring has done far more.

Keeping an eye on trends a bit more

bd808 moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.Sep 11 2019, 3:33 PM

• Bstorm removed • Bstorm as the assignee of this task.Sep 25 2019, 3:21 PM

• Bstorm moved this task from Doing to Watching on the cloud-services-team (Kanban) board.

• Bstorm added a subtask: T224582: Migrate labstore1004/labstore1005 to Stretch/Buster.Jan 23 2020, 9:24 PM

• Bstorm changed the status of subtask T224582: Migrate labstore1004/labstore1005 to Stretch/Buster from Open to Stalled.Mar 16 2020, 7:31 PM

• Bstorm changed the status of subtask T224582: Migrate labstore1004/labstore1005 to Stretch/Buster from Stalled to Open.May 19 2020, 9:10 PM

• Bstorm closed subtask T224582: Migrate labstore1004/labstore1005 to Stretch/Buster as Resolved.Jun 11 2020, 7:35 PM

Same notes as in T169289, this is the new normal, and we just need to get used to it.

• Bstorm changed the status of subtask T257945: NFS v4.1/2 as possible fix for elevated load and lock contention on our NFS servers from Open to Stalled.Jul 23 2020, 9:27 PM

While NFS 4.2 seemed to have helped in places, but I think that easing tc limits did more. The load numbers are now as low as they were historically, and they dropped further when I opened up the limits more. The load was likely caused by lock release taking too long.

Closed this figuring this is the new normal. Now it is actually better.

taavi changed the status of subtask T257945: NFS v4.1/2 as possible fix for elevated load and lock contention on our NFS servers from Stalled to Open.May 2 2023, 3:15 PM

Maintenance_bot removed a project: Patch-For-Review.May 2 2023, 3:30 PM

	F25558658: Screen Shot 2018-08-31 at 10.19.29 AM.png
	Aug 31 2018, 2:22 PM

	F25558689: Screen Shot 2018-08-31 at 10.21.51 AM.png
	Aug 31 2018, 2:22 PM

	F26112250: irqstuff.txt
	Sep 20 2018, 6:47 PM

labstore1004 and labstore1005 high load issues following upgradesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

labstore1004 and labstore1005 high load issues following upgrades
Closed, ResolvedPublic
Actions

Related Objects
Search...