Page MenuHomePhabricator

New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS
Closed, ResolvedPublic

Description

We upgraded labstore1004/1005 to 4.9.25-1~bpo8+3 and things got really bad. Downgraded back to the former kernel and things got better.

Attached graph shows this in terrifying color.

nfsperformance.png (457×800 px, 101 KB)

Event Timeline

Andrew renamed this task from New anti-stackclash (4.9.25-1~bpo8+3 ) kernal SUPER BAD for NFS to New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS.Jun 30 2017, 2:37 AM

"A total of 16,214 non-merge changesets were pulled into the mainline repository for the 4.9 development cycle, making this cycle the busiest in the kernel project's history." https://www.linux.com/news/linux-weather-forecast

Which NFS services/processes caused this?

Which NFS services/processes caused this?

Summarizing from IRC for posterity :)

Load was proportional to what we would expect but way inflated (periods of high use were higher and periods of low use were lower). We generally see load of .5-3 during normal operations over the last 10 months or so and here it was averaging 20-50 and we were seeing 80-110. Client side we saw load climb, and we observed a rotating cast of nfsd procs in D wait state server side. When nfs-kernel-server was stopped load dropped until it was started again. Other than performance being way off normal T169281 was the only real clue that things were wrong.

I talked to someone in #drbd (lge a dev I think) who said they have no reason to think there would be an issue with 4.4 or 4.9 kernel variants with the module version 8.4.5 but they suggested we grab https://github.com/LINBIT/drbd-8.4 and build at 8.4.10 their 'out of tree' bug fix and up-to-date tag as that's the next step to really demonstrating for upstream. Suggested double checking IO scheduler doesn't change since that could have drastic effects.

This is resolved, the jessie-based labstore servers are running 4.9 since a few weeks.