Page MenuHomePhabricator

cp* boxes, pagecache issues & trying newer kernels
Closed, DuplicatePublic

Description

As early as November, we identified an issue with Varnish boxes and Linux's mm/kswapd behavior. This was previously detailed with this report at the end of November:
https://lists.wikimedia.org/mailman/private/ops/2013-November/026473.html
Since then, the testing of Linux 3.11 continued, but resulted in some unstable behavior (locked up boxes at random) and lack of time hasn't allowed us to continue pursuing this avenue.
In the meanwhile, per Domas' suggestion, we have deployed a cronjob that runs every minute and echos 1 > /proc/sys/vm/compact_memory. This has fixed some of the effects of the more immediate issues we were seeing (like the XFS "deadlock detected" issue).
Apparently, not all of the effects have been fixed by the cronjob, though. The attached graphs shows cp3012 doing the same "dropping large portions of pagecache" dance today, which resulted in a visible-to-users 503 spike.
We should explore the effect of what newer kernels will have, possibly 3.13 now, which is what trusty is getting released with and we will need to eventually move to anyway.

Details

Reference
rt7268

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:53 AM
rtimport added a project: ops-core.
rtimport set Reference to rt7268.
faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon set Security to None.

This is about to become more relevant (or fixed :)) with T86648.

gerritbot subscribed.

Change 187684 had a related patch set uploaded (by BBlack):
disable compact_memory on jessie T83809

https://gerrit.wikimedia.org/r/187684

Patch-For-Review

Change 187684 merged by BBlack:
disable compact_memory on jessie T83809

https://gerrit.wikimedia.org/r/187684