Page MenuHomePhabricator

New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS
Closed, ResolvedPublic

Description

We upgraded labstore1004/1005 to 4.9.25-1~bpo8+3 and things got really bad. Downgraded back to the former kernel and things got better.

Attached graph shows this in terrifying color.

Event Timeline

Andrew created this task.Jun 30 2017, 2:37 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2017, 2:37 AM
Andrew renamed this task from New anti-stackclash (4.9.25-1~bpo8+3 ) kernal SUPER BAD for NFS to New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS.Jun 30 2017, 2:37 AM
Paladox added a subscriber: Paladox.Jul 1 2017, 6:55 PM

"A total of 16,214 non-merge changesets were pulled into the mainline repository for the 4.9 development cycle, making this cycle the busiest in the kernel project's history." https://www.linux.com/news/linux-weather-forecast

Which NFS services/processes caused this?

chasemp added a subscriber: chasemp.EditedJul 6 2017, 7:34 PM

Which NFS services/processes caused this?

Summarizing from IRC for posterity :)

Load was proportional to what we would expect but way inflated (periods of high use were higher and periods of low use were lower). We generally see load of .5-3 during normal operations over the last 10 months or so and here it was averaging 20-50 and we were seeing 80-110. Client side we saw load climb, and we observed a rotating cast of nfsd procs in D wait state server side. When nfs-kernel-server was stopped load dropped until it was started again. Other than performance being way off normal T169281 was the only real clue that things were wrong.

I talked to someone in #drbd (lge a dev I think) who said they have no reason to think there would be an issue with 4.4 or 4.9 kernel variants with the module version 8.4.5 but they suggested we grab https://github.com/LINBIT/drbd-8.4 and build at 8.4.10 their 'out of tree' bug fix and up-to-date tag as that's the next step to really demonstrating for upstream. Suggested double checking IO scheduler doesn't change since that could have drastic effects.

MoritzMuehlenhoff closed this task as Resolved.Sep 10 2018, 1:20 PM

This is resolved, the jessie-based labstore servers are running 4.9 since a few weeks.