Details
Event Timeline
Change 307247 had a related patch set uploaded (by Ema):
Upgrade cp4007 (ulsfo cache_upload) to Varnish 4
Change 307282 had a related patch set uploaded (by Ema):
Upgrade upload ulsfo to Varnish 4
Mentioned in SAL [2016-08-29T12:42:27Z] <ema> upgrading cp4013 to Varnish 4 (T131502)
Mentioned in SAL [2016-08-29T13:36:37Z] <ema> upgrading cp4014 to Varnish 4 (T131502)
Mentioned in SAL [2016-08-29T14:20:26Z] <ema> upgrading cp4015 to Varnish 4 (T131502)
Change 307964 had a related patch set uploaded (by Ema):
Revert "Upgrade upload ulsfo to Varnish 4"
Mentioned in SAL [2016-09-01T16:09:59Z] <ema> downgrading cp4006 to varnish 3 T131502
Due to multiple issues with Varnish 4 including T144257 and recurring 503 plateaus, we've decided to downgrade most of cache_upload in ulsfo to Varnish 3. We are going to keep Varnish 4 on cp4005 only and try to observe the issues there, on a single machine that we can easily depool in case of troubles.
The current status is:
cp4005 - v4 puppet enabled
cp4006 - v3 puppet enabled
cp4007, cp4001[3-4] - v4 puppet disabled, to be downgraded soon
Mentioned in SAL [2016-09-02T14:27:56Z] <ema> downgrading cp4007 to varnish 3 T131502
Mentioned in SAL [2016-09-02T15:33:09Z] <ema> downgrading cp4013 to varnish 3 T131502
Mentioned in SAL [2016-09-02T16:10:40Z] <ema> downgrading cp4014 to varnish 3 T131502
Mentioned in SAL [2016-09-02T16:48:59Z] <ema> downgrading cp4015 to varnish 3 T131502
We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users. The following plan has been devised to achieve that goal:
- Rollback ulsfo to v3-only by downgrading cp4005
- Route ulsfo straight to eqiad, without going through codfw
- Connect codfw directly to swift
- Upgrade codfw to v4
This way, we will have codfw running a v4-only stack connected straight to swift with real traffic hitting v4 directly.
Change 308560 had a related patch set uploaded (by Ema):
Revert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"
Change 308560 merged by Ema:
Revert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"
Mentioned in SAL [2016-09-05T12:41:42Z] <ema> downgrading cp4005 to varnish 3 T131502
re: replication and swift, since https://gerrit.wikimedia.org/r/#/c/293272/ mw is writing to both codfw and swift synchronously, with swiftrepl doing the catch up on transient errors. Therefore having codfw cache_upload talk to swift codfw should work
Change 308582 had a related patch set uploaded (by Ema):
cache_upload: route around codfw in cache::route_table
Change 308588 had a related patch set uploaded (by Ema):
cache_upload: route codfw straight to applayer
Change 308593 had a related patch set uploaded (by Ema):
Upgrade upload codfw to Varnish 4
The upgrade lasted between 17:35 and 23:23. It went smoothly, without the 503 spikes we observed in ulsfo. Also, the CL:0 issue reported in T144257 does not seem to occur in codfw. This is the varnishlog command used to check that, for the record:
varnishlog -c -g request -q 'RespStatus == 200 and RespHeader ~ "Content-Length: 0" and ReqMethod ~ GET and ReqURL !~ "^/from/pybal" and ReqURL !~ wikimedia-monitoring-test and ReqURL !~ "^/check$"' -n frontend
That being said, cache_codfw backends haven't started nuking objects yet. It will be interesting to see if anything changes when that happens.
Mentioned in SAL [2016-09-08T13:28:10Z] <ema> upgrading cache_upload ulsfo to varnish 4, dns depooled T131502
Change 309310 had a related patch set uploaded (by Ema):
Upgrade upload ulsfo to Varnish 4
codfw is running fine with v4 routed straight to the applayer. We're going to upgrade ulsfo back to v4 routed to codfw to test v4<->v4 behavior.
Change 309383 had a related patch set uploaded (by BBlack):
Revert "depool upload in ulsfo"
Change 309928 had a related patch set uploaded (by Ema):
cache_upload esams: route to codfw
Change 309930 had a related patch set uploaded (by Ema):
Upgrade upload esams to Varnish 4
Change 309964 had a related patch set uploaded (by Ema):
Upgrade upload eqiad to Varnish 4
Change 309993 had a related patch set uploaded (by BBlack):
cache_upload: restore normal inter-DC routing
To recap the current state of affairs and recent investigation/experimentation:
- We finished upgrading the remaining DCs to Varnish4 earlier today, as shown in the SAL entries above. Cache routing is also restored to normal (ulsfo->codfw->eqiad, esams->eqiad).
- We've seen some small 503 spikes (different than the earlier small plateaus) that log as connection failures for varnish<->varnish, and we've deployed what seems to be a mitigating fix: raising the connection limit for varnish-be->varnish-be by an order of magnitude in https://gerrit.wikimedia.org/r/#/c/309982/ . This is plausibly because of Varnish 4's default streaming mode resulting in more backend connection parallelism, or the pass for all Range reqs, or other such behavioral changes from our previous V3 setup.
- There were 2x frontend child process crashes in eqiad, and it's because netmapper doesn't handle NULL input, which is a new thing in V4 for one of several reasons. We've got a VCL workaround applied in https://gerrit.wikimedia.org/r/#/c/310025/ , and a new version of vmod_netmapper with a proper fix is packaged but undeployed ( https://gerrit.wikimedia.org/r/#/c/310019/ ). It's not deployed yet because that itself is painful under V4 due to the vmod-upgrade issues related to https://github.com/varnishcache/varnish-cache/issues/2041
- Some issues appeared to be related to the VSLP director and health-checks, so we tried disabling varnish<->varnish healthcheck probes, but that didn't fix anything, so that was reverted.
- At one point we had ulsfo backends failing in a way that looked related to coalesce concurrency (high waiting threads, etc), and it seemed likely that our previous workarounds to avoid caching of CL:0+Status:200 responses could be causing that. The CL:0+Status:200 cache prevention was reverted ( https://gerrit.wikimedia.org/r/#/c/310023/ ), and then later unreverted ( https://gerrit.wikimedia.org/r/#/c/310155/ ) when we started observing cached bad responses again. The bad ones were banned post-revert.
- A common theme in the causal (lowest-cache-layer) 503 plateaus tends to be logs of LRU_Failed. We've chased that thread a bit and found that it's likely that the file storage backend works nothing like malloc, and apparently upstream is aware that it may not work right at all for some combinations of workload and storage size, and that LRU issues are related once the storage gets filled up. We've tried raising lru_interval and nuke_limit significant (to 73 and 1000, respectively), but that doesn't seem to have any effect.
Since LRU-related things work completely-differently in the deprecated persistent storage backend that we were trying to move away from towards file, and the old persistent backed in V3 was known to handle our situation well, it seems like our best recourse at present is to begin converting all the backends back to persistent. I reverted the storage engine change in https://gerrit.wikimedia.org/r/#/c/310161/ , and I'm in the process of iterating through the cache clusters (start at eqiad then codfw) and wiping/restarting backends to put the persistent engine back into place.
Since the above update:
- Tried converting eqiad + codfw to persistent, and fighting through a bit on levels of 503s in case they were just from aggressive rollout schedule. In the end, even after both were converted and "stable"-ish in many other respects, the persistent storage kept panicing/crashing the child processes too quickly to be a realistic option. So I reverted back to file storage at both.
- As a first pass at reducing fe->be request rates by increasing cache hit%, I progressively dropped the FE cacheable object size limit from 50MB to 8MB and then to 1MB. No negative fallout, possibly mild positive fallout, but only longer-term trends will tell. My intuition based on the varnishstat numbers is that our optimum is an average FE object size of ~128KB or less, but not sure how that corresponds for upper limit on CL. It's may be that we have to drop it further before we see larger improvements.
Other ideas for improving the LRU_Fail situation directly or indirectly: decreasing the file storage size (yuck), decreasing the TTL caps (maybe? low-probability idea), and/or deploying the 2-hit-wonder patch (again on the theory that great FE hitrate == less contention in the backends for LRU nuking).
With persistent apparently a bad idea, if we can't get file-storage past the LRU_Fail problem, we don't have many great ideas left for a stable Varnish 4 install here...
Also, after the various restarts above, I re-set (at runtime with varnishadm) all cache_upload backends to 31 seconds for lru_interval in the hope that that's still helpful, even if it's not a full cure.
Change 310551 had a related patch set uploaded (by Ema):
cache_upload: do not set do_stream=true on Varnish 4
Change 310767 had a related patch set uploaded (by Ema):
cache_upload fe: do not set do_stream=true on Varnish 4
Change 314658 had a related patch set uploaded (by Ema):
cache_upload: remove varnish3 VCL compat