- Mentioned In
- T131503: Convert text cluster to Varnish 4
T145661: varnish backends start returning 503s after ~6 days uptime
T144257: Certain images failing to load in ulsfo
rOPUP054c21b80831: cache_upload varnishtest: pass Range requests
rOPUP67d6594e3ce2: Upgrade cp4005 (ulsfo cache_upload) to Varnish 4
rOPUP64c4a13291c9: Upgrade cp1048 (cache_upload) to Varnish 4
T142848: Stop using persistent storage in our backend varnish layers.
T142233: Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections
rOPUPc2e8893a3a7c: cache_upload: persistent storage backend naming on v4
rOPUP9d4773b9bf4c: cache_upload: persistent storage backend naming on v4
rOPUP15359f828ca5: upload VCL: X-Range hack for V4
rOPUP47520c6ae97c: VCL: add call for cluster/layer vcl_backend_fetch for V4
rOPUP15c7843afce9: cache_upload VCL forward port to Varnish 4
rOPUPc3761a2d067b: upload VCL: prep for easier V4 migration
rOPUPd64b357f1783: upload VCL: X-Range hack for V4
rOPUPfc2315e3a9c2: upload VCL: X-Range hack for V4
rOPUP25575b93d6e6: VCL: add call for cluster/layer vcl_backend_fetch for V4
rOPUP42021d307510: VCL: add calls for cluster/layer vcl_backend_fetch for V4
rOPUP59ef6bb57631: upload VCL: X-Range hack for V4
rOPUPd0949a4d973c: upload VCL: X-Range hack for V4
rOPUP33564e078e0f: VCL: add call for cluster/layer vcl_backend_fetch for V4
rOPUPa5091a626035: cache_upload VCL forward port to Varnish 4
rOPUP223789e14437: upload VCL: prep for easier V4 migration
rOPUPf26ca413ae37: cache_upload VCL forward-port to Varnish 4
rOPUPbe33d9b86dea: upload VCL: prep for easier V4 migration
rOPUP240f22dacd2c: cache_upload VCL forward-port to Varnish 4
rOPUP052b9368d082: cache_upload VCL forward-port to Varnish 4
- Mentioned Here
- T144257: Certain images failing to load in ulsfo
T131353: Port remaining scripts depending on varnishlog.py to new VSL API
T142848: Stop using persistent storage in our backend varnish layers.
Due to multiple issues with Varnish 4 including T144257 and recurring 503 plateaus, we've decided to downgrade most of cache_upload in ulsfo to Varnish 3. We are going to keep Varnish 4 on cp4005 only and try to observe the issues there, on a single machine that we can easily depool in case of troubles.
The current status is:
cp4005 - v4 puppet enabled
cp4006 - v3 puppet enabled
cp4007, cp4001[3-4] - v4 puppet disabled, to be downgraded soon
We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users. The following plan has been devised to achieve that goal:
- Rollback ulsfo to v3-only by downgrading cp4005
- Route ulsfo straight to eqiad, without going through codfw
- Connect codfw directly to swift
- Upgrade codfw to v4
This way, we will have codfw running a v4-only stack connected straight to swift with real traffic hitting v4 directly.
The upgrade lasted between 17:35 and 23:23. It went smoothly, without the 503 spikes we observed in ulsfo. Also, the CL:0 issue reported in T144257 does not seem to occur in codfw. This is the varnishlog command used to check that, for the record:
varnishlog -c -g request -q 'RespStatus == 200 and RespHeader ~ "Content-Length: 0" and ReqMethod ~ GET and ReqURL !~ "^/from/pybal" and ReqURL !~ wikimedia-monitoring-test and ReqURL !~ "^/check$"' -n frontend
That being said, cache_codfw backends haven't started nuking objects yet. It will be interesting to see if anything changes when that happens.
To recap the current state of affairs and recent investigation/experimentation:
- We finished upgrading the remaining DCs to Varnish4 earlier today, as shown in the SAL entries above. Cache routing is also restored to normal (ulsfo->codfw->eqiad, esams->eqiad).
- We've seen some small 503 spikes (different than the earlier small plateaus) that log as connection failures for varnish<->varnish, and we've deployed what seems to be a mitigating fix: raising the connection limit for varnish-be->varnish-be by an order of magnitude in https://gerrit.wikimedia.org/r/#/c/309982/ . This is plausibly because of Varnish 4's default streaming mode resulting in more backend connection parallelism, or the pass for all Range reqs, or other such behavioral changes from our previous V3 setup.
- There were 2x frontend child process crashes in eqiad, and it's because netmapper doesn't handle NULL input, which is a new thing in V4 for one of several reasons. We've got a VCL workaround applied in https://gerrit.wikimedia.org/r/#/c/310025/ , and a new version of vmod_netmapper with a proper fix is packaged but undeployed ( https://gerrit.wikimedia.org/r/#/c/310019/ ). It's not deployed yet because that itself is painful under V4 due to the vmod-upgrade issues related to https://github.com/varnishcache/varnish-cache/issues/2041
- Some issues appeared to be related to the VSLP director and health-checks, so we tried disabling varnish<->varnish healthcheck probes, but that didn't fix anything, so that was reverted.
- At one point we had ulsfo backends failing in a way that looked related to coalesce concurrency (high waiting threads, etc), and it seemed likely that our previous workarounds to avoid caching of CL:0+Status:200 responses could be causing that. The CL:0+Status:200 cache prevention was reverted ( https://gerrit.wikimedia.org/r/#/c/310023/ ), and then later unreverted ( https://gerrit.wikimedia.org/r/#/c/310155/ ) when we started observing cached bad responses again. The bad ones were banned post-revert.
- A common theme in the causal (lowest-cache-layer) 503 plateaus tends to be logs of LRU_Failed. We've chased that thread a bit and found that it's likely that the file storage backend works nothing like malloc, and apparently upstream is aware that it may not work right at all for some combinations of workload and storage size, and that LRU issues are related once the storage gets filled up. We've tried raising lru_interval and nuke_limit significant (to 73 and 1000, respectively), but that doesn't seem to have any effect.
Since LRU-related things work completely-differently in the deprecated persistent storage backend that we were trying to move away from towards file, and the old persistent backed in V3 was known to handle our situation well, it seems like our best recourse at present is to begin converting all the backends back to persistent. I reverted the storage engine change in https://gerrit.wikimedia.org/r/#/c/310161/ , and I'm in the process of iterating through the cache clusters (start at eqiad then codfw) and wiping/restarting backends to put the persistent engine back into place.
Since the above update:
- Tried converting eqiad + codfw to persistent, and fighting through a bit on levels of 503s in case they were just from aggressive rollout schedule. In the end, even after both were converted and "stable"-ish in many other respects, the persistent storage kept panicing/crashing the child processes too quickly to be a realistic option. So I reverted back to file storage at both.
- As a first pass at reducing fe->be request rates by increasing cache hit%, I progressively dropped the FE cacheable object size limit from 50MB to 8MB and then to 1MB. No negative fallout, possibly mild positive fallout, but only longer-term trends will tell. My intuition based on the varnishstat numbers is that our optimum is an average FE object size of ~128KB or less, but not sure how that corresponds for upper limit on CL. It's may be that we have to drop it further before we see larger improvements.
Other ideas for improving the LRU_Fail situation directly or indirectly: decreasing the file storage size (yuck), decreasing the TTL caps (maybe? low-probability idea), and/or deploying the 2-hit-wonder patch (again on the theory that great FE hitrate == less contention in the backends for LRU nuking).
With persistent apparently a bad idea, if we can't get file-storage past the LRU_Fail problem, we don't have many great ideas left for a stable Varnish 4 install here...
Also, after the various restarts above, I re-set (at runtime with varnishadm) all cache_upload backends to 31 seconds for lru_interval in the hope that that's still helpful, even if it's not a full cure.