We are currently debugging an issue with our cache_upload varnish backends, recently converted to Varnish 4: T131502.
The bug has to do with the file storage backend to which we had to migrate given that the persistent storage backend has been deprecated and we found it to be very unstable.
When the problem kicks in, the backend varnishd responds with 503s to all requests that trigger saving an object into file storage. Cache hits do not seem to be affected.
The following varnishlog entries are common to all 503s returned when varnish is in such state:
-- ExpKill LRU_Fail -- FetchError Could not get storage
We've captured lots of such errors, an example can be found here: https://phabricator.wikimedia.org/P4033
While trying to mitigate the issue, we've tried multiple strategies including:
- Bump nuke_limit from 50 to 1000 and lru_interval from 2s to 31s
- Increase hit rate
That seemed to help for a while, but the issue happened again today at 10:14 UTC and on cp4005, and it lasted till 10:25 UTC when I've depooled the host and restarted varnishd.
The ganglia view for that timeframe might be helpful to find varnish metrics correlating with fetch_failed behavior.