Page MenuHomePhabricator

jessie kernel vm subsystem issues for upload caches
Closed, ResolvedPublic

Description

On the jessie upload-cache test host (cp1064), we've had some issues in the kernel vm subsystem related to the heavy disk i/o of the varnish :3128 backend's persistent/bigobj caches. The primary symptom (the one that must be fixed) is periodic extremely heavy, short-lived spikes of CPU utilization (sys, as opposed to user/iowait) which seem to be happening in a contended lock in the kswapd threads during something related to cache page eviction. These temporarily spike to loadavg in the hundreds or thousands and pause normal network/disk i/o significantly enough to cause 503 spikes.

There are numerous ongoing experiments on how to resolve this (vm kernel parameter tuning, switching filesystems, changing varnish storage configuration/rules, possible kernel downgrades, or upgrades with new bugfix patches).

Event Timeline

BBlack claimed this task.
BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added subscribers: BBlack, faidon.
BBlack updated the task description. (Show Details)
BBlack set Security to None.
BBlack updated the task description. (Show Details)

Just to keep a more-detailed record of things that have been tried:

At various points in this process, the frequency of the spikes can be as short as one every several minutes to as infrequent as once every hour or two. Also, after wiping disk caches and restarting varnishes and/or the whole machine, there can be a long period of smooth operation before the pattern settles in and starts happening again (likely, the time it takes for one or more of the disk caches to significantly fill and need to start evicting varnish cache objects to make room for new ones?)

In the original config (very similar to our working config on the precise kernel), in addition to the sys% spikes, iowait in general was very erratic and heavy. Initially tried tuning the vm parameters dirty_background_ratio, dirty_ratio, and dirty_expire_centisecs. Tried many different directions and values with these, and with appropriate values they do smooth out the erratic iowait and reduce the frequency/severity of the critical sys% spikes, but do not eliminate the problem. Some of the vm params were obviously-helpful regardless and were committed for jessie nodes (e.g. rOPUP5a89be89f0da8ee5dc40c7ee8bc23e9fcbc2e21c)

Switched the cache filesystems from XFS to ext4. This reduced sys% and iowait% further and made them even smoother (with or without the aforementioned vm params), but the sys% spikes still returned.

Tried various other vm subsystem tuning with parameters like vfs_cache_pressure, swappiness, min_free_kbytes, etc. The only thing here that significantly helped was to significantly raise vm.min_free_kbytes. We were already tuning this value (for cache and LVS nodes) to a percentage that worked out to ~256MB on these cache nodes in our existing config. Tried raising to rather extreme values of 1GB and 2GB, and these did seem to reduce the rate/probability of the sys% spikes, but they do not eliminate them.

I also (somewhat inadvertently) disabled the bigobj file= storage backends (for cached objects >= 100MB) on cp1064 for a while and this dramatically-reduced the remaining problem (to a level that's probably acceptable), so this does confirm that those varnish disk i/o are a key component of the problem.

When I've been able to catch the spikes with "perf" profiling on, it hints that most of the cpu sys% is being wasted spinning in a contended osq_lock in a kswapd thread, and that this is related to cache page eviction. There are also hints in the traces that the network i/o is in play here too, e.g. that the scenario could very well be that it's triggered by a coincidence of lots of allocations of kernel network buffers for writev() of varnish outbound traffic which trigger the page eviction requests, which then stumble on a situation with evicting disk cache to make room for them (which aligns with the positive results from insanely-high min_free_kbytes).

The kernel change from precise to jessie is 3.2 -> 3.16, and a casual grep of the enormous commit logs from that period reveals several candidate changes to related parts of the vm subsystem. There's some recent lkml thread/commit traffic related to all of these things, e.g.:
http://lkml.iu.edu/hypermail/linux/kernel/1501.3/01348.html
https://lkml.org/lkml/2015/1/26/742
http://www.kernelhub.org/?p=2&msg=672409
... these may provide some pointers on where to look for recently-introduced problematic changes (that first patch in particular could even be our fix)

For the moment, cp1064 is disabled in backend pooling while I'm focused on other things, so we don't suffer 503 spikes while we're not actively investigating this.

Status update: moving to a 3.18-based kernel from debian-experimental (on top of VM tuning and the ext4 switch, etc) has gotten this issue under control on cp1064. The vm subsystem changes from v3.16..v3.18 are numerous and non-trivial, so we'll probably opt for stabilizing a 3.18 kernel for the upload caches (as opposed to trying to backport isolated changes). There's a fair amount of related work to be done to package->puppetize it all cleanly.

This is (and many related things are) now resolved, and new automatic installs on jessie are set up with appropriate kernels, VM tuning, filesystem tuning, etc.