Page MenuHomePhabricator

New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS
Closed, ResolvedPublic

Description

We upgraded labstore1004/1005 to 4.9.25-1~bpo8+3 and things got really bad. Downgraded back to the former kernel and things got better.

Attached graph shows this in terrifying color.

nfsperformance.png (457×800 px, 101 KB)

Related Objects

StatusSubtypeAssignedTask
Resolved Bstorm
ResolvedMoritzMuehlenhoff
Resolved Bstorm
Resolved Bstorm
Resolved Bstorm
OpenNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Resolvedaborrero
Resolvedtaavi
DuplicateNone
Resolvedtaavi
DeclinedNone
Resolvedaborrero
DeclinedNone
Resolvedaborrero
Resolvedtaavi
Resolvedtaavi
Resolved nskaggs
Declinedtaavi
DeclinedNone

Event Timeline

Andrew renamed this task from New anti-stackclash (4.9.25-1~bpo8+3 ) kernal SUPER BAD for NFS to New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS.Jun 30 2017, 2:37 AM

"A total of 16,214 non-merge changesets were pulled into the mainline repository for the 4.9 development cycle, making this cycle the busiest in the kernel project's history." https://www.linux.com/news/linux-weather-forecast

Which NFS services/processes caused this?

Which NFS services/processes caused this?

Summarizing from IRC for posterity :)

Load was proportional to what we would expect but way inflated (periods of high use were higher and periods of low use were lower). We generally see load of .5-3 during normal operations over the last 10 months or so and here it was averaging 20-50 and we were seeing 80-110. Client side we saw load climb, and we observed a rotating cast of nfsd procs in D wait state server side. When nfs-kernel-server was stopped load dropped until it was started again. Other than performance being way off normal T169281 was the only real clue that things were wrong.

I talked to someone in #drbd (lge a dev I think) who said they have no reason to think there would be an issue with 4.4 or 4.9 kernel variants with the module version 8.4.5 but they suggested we grab https://github.com/LINBIT/drbd-8.4 and build at 8.4.10 their 'out of tree' bug fix and up-to-date tag as that's the next step to really demonstrating for upstream. Suggested double checking IO scheduler doesn't change since that could have drastic effects.

This is resolved, the jessie-based labstore servers are running 4.9 since a few weeks.