Convert upload cluster to Varnish 4
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Apr 1 2016, 2:01 PM

Details

Subject	Repo	Branch	Lines +/-
cache_upload: remove varnish3 VCL compat	operations/puppet	production	+0 -47
cache_upload fe: do not set do_stream=true on Varnish 4	operations/puppet	production	+4 -0
cache_upload: do not set do_stream=true on Varnish 4	operations/puppet	production	+2 -0
varnish: bump nuke_limit	operations/puppet	production	+1 -1
New upstream release	operations/software/varnish/libvmod-netmapper	debian	+11 -2
cache_upload: restore normal inter-DC routing	operations/puppet	production	+2 -2
Upgrade upload eqiad to Varnish 4	operations/puppet	production	+1 -3
Upgrade upload esams to Varnish 4	operations/puppet	production	+1 -0
cache_upload esams: route to codfw	operations/puppet	production	+1 -1
Revert "depool upload in ulsfo"	operations/dns	master	+0 -1
Upgrade upload ulsfo to Varnish 4	operations/puppet	production	+1 -0
depool upload in ulsfo	operations/dns	master	+1 -0
Upgrade upload codfw to Varnish 4	operations/puppet	production	+1 -0
cache_upload: route codfw straight to applayer	operations/puppet	production	+1 -1
cache_upload: route around codfw in cache::route_table	operations/puppet	production	+1 -1
Revert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"	operations/puppet	production	+0 -1
Revert "Upgrade upload ulsfo to Varnish 4"	operations/puppet	production	+1 -1
Upgrade upload ulsfo to Varnish 4	operations/puppet	production	+1 -3
Upgrade cp4007 (ulsfo cache_upload) to Varnish 4	operations/puppet	production	+1 -0
Upgrade cp4006 (ulsfo cache_upload) to Varnish 4	operations/puppet	production	+1 -0
cache_upload varnishtest: pass Range requests	operations/puppet	production	+18 -17
Upgrade cp4005 (ulsfo cache_upload) to Varnish 4	operations/puppet	production	+1 -0
cache_upload: persistent storage backend naming on v4	operations/puppet	production	+12 -2
upload VCL: X-Range hack for V4	operations/puppet	production	+30 -2
VCL: add call for cluster/layer vcl_backend_fetch for V4	operations/puppet	production	+42 -0
cache_upload VCL forward port to Varnish 4	operations/puppet	production	+47 -25
upload VCL: prep for easier V4 migration	operations/puppet	production	+2 -3

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Duplicate	None	T109331 Deleted files sometimes remain visible to non-privileged users if permanently linked
Duplicate	None	T133819 upload-lb.ulsfo.wikimedia.org still allow access to some deleted files
Open	None	T124286 [Epic] Wikidata language support
Open	None	T134592 Allow setting the UI to a language other than English for anonymous users
Duplicate	• Nikerabbit	T149419 Interface language selection for unregistered users on Wikimedia projects
Invalid	None	T114662 RFC: Per-language URLs for multilingual wiki pages
Duplicate	BBlack	T119038 Image cache issue when 'over-writing' an image on commons
Resolved	• ema	T133821 Make CDN purges reliable
Resolved	• ema	T108580 HTTPS for internal service traffic
Invalid	None	T109325 Outbound HTTPS for varnish backend instances
Open	None	T122867 Evaluate the feasibility of cache invalidation for the action API
Resolved	• ema	T122881 Install XKey vmod
Declined	None	T142841 Sideways Only-If-Cached on misses at a primary DC
Resolved	• ema	T131499 Upgrade all cache clusters to Varnish 4
Resolved	• ema	T131502 Convert upload cluster to Varnish 4
Resolved	BBlack	T131761 Solve large-object/stream/pass/chunked in upload cluster better
Resolved	• ema	T142076 Analyze Range requests on cache_upload frontend
Resolved	• ema	T142233 Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 307247 had a related patch set uploaded (by Ema):
Upgrade cp4007 (ulsfo cache_upload) to Varnish 4

https://gerrit.wikimedia.org/r/307247

Change 307247 merged by Ema:
Upgrade cp4007 (ulsfo cache_upload) to Varnish 4

https://gerrit.wikimedia.org/r/307247

Change 307282 had a related patch set uploaded (by Ema):
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/307282

Change 307282 merged by Ema:
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/307282

Mentioned in SAL [2016-08-29T12:42:27Z] <ema> upgrading cp4013 to Varnish 4 (T131502)

Mentioned in SAL [2016-08-29T13:36:37Z] <ema> upgrading cp4014 to Varnish 4 (T131502)

Mentioned in SAL [2016-08-29T14:20:26Z] <ema> upgrading cp4015 to Varnish 4 (T131502)

• ema mentioned this in T144257: Certain images failing to load in ulsfo.Aug 30 2016, 8:41 AM

Change 307964 had a related patch set uploaded (by Ema):
Revert "Upgrade upload ulsfo to Varnish 4"

https://gerrit.wikimedia.org/r/307964

Change 307964 merged by Ema:
Revert "Upgrade upload ulsfo to Varnish 4"

https://gerrit.wikimedia.org/r/307964

Mentioned in SAL [2016-09-01T16:09:59Z] <ema> downgrading cp4006 to varnish 3 T131502

Due to multiple issues with Varnish 4 including T144257 and recurring 503 plateaus, we've decided to downgrade most of cache_upload in ulsfo to Varnish 3. We are going to keep Varnish 4 on cp4005 only and try to observe the issues there, on a single machine that we can easily depool in case of troubles.

The current status is:

cp4005 - v4 puppet enabled
cp4006 - v3 puppet enabled
cp4007, cp4001[3-4] - v4 puppet disabled, to be downgraded soon

• MoritzMuehlenhoff subscribed.Sep 2 2016, 10:30 AM

Mentioned in SAL [2016-09-02T14:27:56Z] <ema> downgrading cp4007 to varnish 3 T131502

Mentioned in SAL [2016-09-02T15:33:09Z] <ema> downgrading cp4013 to varnish 3 T131502

Mentioned in SAL [2016-09-02T16:10:40Z] <ema> downgrading cp4014 to varnish 3 T131502

Mentioned in SAL [2016-09-02T16:48:59Z] <ema> downgrading cp4015 to varnish 3 T131502

We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users. The following plan has been devised to achieve that goal:

Rollback ulsfo to v3-only by downgrading cp4005
Route ulsfo straight to eqiad, without going through codfw
Connect codfw directly to swift
Upgrade codfw to v4

This way, we will have codfw running a v4-only stack connected straight to swift with real traffic hitting v4 directly.

Change 308560 had a related patch set uploaded (by Ema):
Revert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"

https://gerrit.wikimedia.org/r/308560

Change 308560 merged by Ema:
Revert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"

https://gerrit.wikimedia.org/r/308560

Mentioned in SAL [2016-09-05T12:41:42Z] <ema> downgrading cp4005 to varnish 3 T131502

In T131502#2608658, @ema wrote:

We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users. The following plan has been devised to achieve that goal:

Rollback ulsfo to v3-only by downgrading cp4005

Route ulsfo straight to eqiad, without going through codfw

Connect codfw directly to swift

re: replication and swift, since https://gerrit.wikimedia.org/r/#/c/293272/ mw is writing to both codfw and swift synchronously, with swiftrepl doing the catch up on transient errors. Therefore having codfw cache_upload talk to swift codfw should work

Change 308582 had a related patch set uploaded (by Ema):
cache_upload: route around codfw in cache::route_table

https://gerrit.wikimedia.org/r/308582

Change 308582 merged by Ema:
cache_upload: route around codfw in cache::route_table

https://gerrit.wikimedia.org/r/308582

Change 308588 had a related patch set uploaded (by Ema):
cache_upload: route codfw straight to applayer

https://gerrit.wikimedia.org/r/308588

Change 308588 merged by Ema:
cache_upload: route codfw straight to applayer

https://gerrit.wikimedia.org/r/308588

Change 308593 had a related patch set uploaded (by Ema):
Upgrade upload codfw to Varnish 4

https://gerrit.wikimedia.org/r/308593

Change 308593 merged by Ema:
Upgrade upload codfw to Varnish 4

https://gerrit.wikimedia.org/r/308593

Mentioned in SAL [2016-09-05T17:35:05Z] <ema> upgrading cp2022 to varnish 4 T131502

Mentioned in SAL [2016-09-05T18:31:23Z] <ema> upgrading cp2026 to varnish 4 T131502

Mentioned in SAL [2016-09-05T19:22:03Z] <ema> upgrading cp2024 to varnish 4 T131502

Mentioned in SAL [2016-09-05T19:45:39Z] <ema> upgrading cp2020 to varnish 4 T131502

Mentioned in SAL [2016-09-05T20:41:21Z] <ema> upgrading cp2017 to varnish 4 T131502

Mentioned in SAL [2016-09-05T21:28:39Z] <ema> upgrading cp2014 to varnish 4 T131502

Mentioned in SAL [2016-09-05T21:59:32Z] <ema> upgrading cp2011 to varnish 4 T131502

Mentioned in SAL [2016-09-05T22:24:12Z] <ema> upgrading cp2008 to varnish 4 T131502

Mentioned in SAL [2016-09-05T22:49:57Z] <ema> upgrading cp2005 to varnish 4 T131502

Mentioned in SAL [2016-09-05T23:17:43Z] <ema> upgrading cp2002 to varnish 4 T131502

In T131502#2608658, @ema wrote:

We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users.

The upgrade lasted between 17:35 and 23:23. It went smoothly, without the 503 spikes we observed in ulsfo. Also, the CL:0 issue reported in T144257 does not seem to occur in codfw. This is the varnishlog command used to check that, for the record:

varnishlog -c -g request -q 'RespStatus == 200 and RespHeader ~ "Content-Length: 0" and ReqMethod ~ GET and ReqURL !~ "^/from/pybal" and ReqURL !~ wikimedia-monitoring-test and ReqURL !~ "^/check$"' -n frontend

That being said, cache_codfw backends haven't started nuking objects yet. It will be interesting to see if anything changes when that happens.

Change 308967 had a related patch set uploaded (by Ema):
depool upload in ulsfo

https://gerrit.wikimedia.org/r/308967

Change 308967 merged by Ema:
depool upload in ulsfo

https://gerrit.wikimedia.org/r/308967

Mentioned in SAL [2016-09-08T13:28:10Z] <ema> upgrading cache_upload ulsfo to varnish 4, dns depooled T131502

Change 309310 had a related patch set uploaded (by Ema):
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/309310

codfw is running fine with v4 routed straight to the applayer. We're going to upgrade ulsfo back to v4 routed to codfw to test v4<->v4 behavior.

Change 309310 merged by Ema:
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/309310

Change 309383 had a related patch set uploaded (by BBlack):
Revert "depool upload in ulsfo"

https://gerrit.wikimedia.org/r/309383

Change 309383 merged by BBlack:
Revert "depool upload in ulsfo"

https://gerrit.wikimedia.org/r/309383

Change 309928 had a related patch set uploaded (by Ema):
cache_upload esams: route to codfw

https://gerrit.wikimedia.org/r/309928

Change 309928 merged by Ema:
cache_upload esams: route to codfw

https://gerrit.wikimedia.org/r/309928

Change 309930 had a related patch set uploaded (by Ema):
Upgrade upload esams to Varnish 4

https://gerrit.wikimedia.org/r/309930

Change 309930 merged by Ema:
Upgrade upload esams to Varnish 4

https://gerrit.wikimedia.org/r/309930

Mentioned in SAL [2016-09-12T00:51:20Z] <ema> upgrade cp3034 to varnish 4 T131502

Mentioned in SAL [2016-09-12T01:28:07Z] <ema> upgrade cp3035 to varnish 4 T131502

Mentioned in SAL [2016-09-12T02:06:23Z] <ema> upgrade cp3036 to varnish 4 T131502

Mentioned in SAL [2016-09-12T02:34:44Z] <bblack> upgrade cp3037 to varnish 4 T131502

Mentioned in SAL [2016-09-12T02:49:19Z] <bblack> upgrade cp3038 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:05:03Z] <bblack> upgrade cp3039 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:21:33Z] <bblack> upgrade cp3044 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:36:47Z] <bblack> upgrade cp3045 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:51:17Z] <bblack> upgrade cp3046 to varnish 4 T131502

Mentioned in SAL [2016-09-12T04:06:57Z] <bblack> upgrade cp3047 to varnish 4 T131502

Mentioned in SAL [2016-09-12T04:20:12Z] <bblack> upgrade cp3048 to varnish 4 T131502

Mentioned in SAL [2016-09-12T04:34:50Z] <bblack> upgrade cp3049 to varnish 4 T131502

Change 309964 had a related patch set uploaded (by Ema):
Upgrade upload eqiad to Varnish 4

https://gerrit.wikimedia.org/r/309964

Change 309964 merged by Ema:
Upgrade upload eqiad to Varnish 4

https://gerrit.wikimedia.org/r/309964

Mentioned in SAL [2016-09-12T10:48:05Z] <ema> upgrade cp1048 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:03:26Z] <ema> upgrade cp1049 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:17:52Z] <ema> upgrade cp1050 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:32:51Z] <ema> upgrade cp1062 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:47:44Z] <ema> upgrade cp1063 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:00:15Z] <ema> upgrade cp1064 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:11:45Z] <ema> upgrade cp1071 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:30:42Z] <ema> upgrade cp1072 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:47:34Z] <ema> upgrade cp1073 to varnish 4 T131502

Mentioned in SAL [2016-09-12T13:04:56Z] <ema> upgrade cp1074 to varnish 4 T131502

Mentioned in SAL [2016-09-12T13:20:58Z] <ema> upgrade cp1099 to varnish 4 T131502

Change 309993 had a related patch set uploaded (by BBlack):
cache_upload: restore normal inter-DC routing

https://gerrit.wikimedia.org/r/309993

Change 309993 merged by BBlack:
cache_upload: restore normal inter-DC routing

https://gerrit.wikimedia.org/r/309993

To recap the current state of affairs and recent investigation/experimentation:

We finished upgrading the remaining DCs to Varnish4 earlier today, as shown in the SAL entries above. Cache routing is also restored to normal (ulsfo->codfw->eqiad, esams->eqiad).
We've seen some small 503 spikes (different than the earlier small plateaus) that log as connection failures for varnish<->varnish, and we've deployed what seems to be a mitigating fix: raising the connection limit for varnish-be->varnish-be by an order of magnitude in https://gerrit.wikimedia.org/r/#/c/309982/ . This is plausibly because of Varnish 4's default streaming mode resulting in more backend connection parallelism, or the pass for all Range reqs, or other such behavioral changes from our previous V3 setup.
There were 2x frontend child process crashes in eqiad, and it's because netmapper doesn't handle NULL input, which is a new thing in V4 for one of several reasons. We've got a VCL workaround applied in https://gerrit.wikimedia.org/r/#/c/310025/ , and a new version of vmod_netmapper with a proper fix is packaged but undeployed ( https://gerrit.wikimedia.org/r/#/c/310019/ ). It's not deployed yet because that itself is painful under V4 due to the vmod-upgrade issues related to https://github.com/varnishcache/varnish-cache/issues/2041
Some issues appeared to be related to the VSLP director and health-checks, so we tried disabling varnish<->varnish healthcheck probes, but that didn't fix anything, so that was reverted.
At one point we had ulsfo backends failing in a way that looked related to coalesce concurrency (high waiting threads, etc), and it seemed likely that our previous workarounds to avoid caching of CL:0+Status:200 responses could be causing that. The CL:0+Status:200 cache prevention was reverted ( https://gerrit.wikimedia.org/r/#/c/310023/ ), and then later unreverted ( https://gerrit.wikimedia.org/r/#/c/310155/ ) when we started observing cached bad responses again. The bad ones were banned post-revert.
A common theme in the causal (lowest-cache-layer) 503 plateaus tends to be logs of LRU_Failed. We've chased that thread a bit and found that it's likely that the file storage backend works nothing like malloc, and apparently upstream is aware that it may not work right at all for some combinations of workload and storage size, and that LRU issues are related once the storage gets filled up. We've tried raising lru_interval and nuke_limit significant (to 73 and 1000, respectively), but that doesn't seem to have any effect.

Since LRU-related things work completely-differently in the deprecated persistent storage backend that we were trying to move away from towards file, and the old persistent backed in V3 was known to handle our situation well, it seems like our best recourse at present is to begin converting all the backends back to persistent. I reverted the storage engine change in https://gerrit.wikimedia.org/r/#/c/310161/ , and I'm in the process of iterating through the cache clusters (start at eqiad then codfw) and wiping/restarting backends to put the persistent engine back into place.

ArielGlenn subscribed.Sep 12 2016, 11:35 PM

Since the above update:

Tried converting eqiad + codfw to persistent, and fighting through a bit on levels of 503s in case they were just from aggressive rollout schedule. In the end, even after both were converted and "stable"-ish in many other respects, the persistent storage kept panicing/crashing the child processes too quickly to be a realistic option. So I reverted back to file storage at both.
As a first pass at reducing fe->be request rates by increasing cache hit%, I progressively dropped the FE cacheable object size limit from 50MB to 8MB and then to 1MB. No negative fallout, possibly mild positive fallout, but only longer-term trends will tell. My intuition based on the varnishstat numbers is that our optimum is an average FE object size of ~128KB or less, but not sure how that corresponds for upper limit on CL. It's may be that we have to drop it further before we see larger improvements.

Other ideas for improving the LRU_Fail situation directly or indirectly: decreasing the file storage size (yuck), decreasing the TTL caps (maybe? low-probability idea), and/or deploying the 2-hit-wonder patch (again on the theory that great FE hitrate == less contention in the backends for LRU nuking).

With persistent apparently a bad idea, if we can't get file-storage past the LRU_Fail problem, we don't have many great ideas left for a stable Varnish 4 install here...

Also, after the various restarts above, I re-set (at runtime with varnishadm) all cache_upload backends to 31 seconds for lru_interval in the hope that that's still helpful, even if it's not a full cure.

Change 310227 had a related patch set uploaded (by Ema):
New upstream release

https://gerrit.wikimedia.org/r/310227

Change 310227 merged by Ema:
New upstream release

https://gerrit.wikimedia.org/r/310227

Change 310266 had a related patch set uploaded (by Ema):
varnish: bump nuke_limit

https://gerrit.wikimedia.org/r/310266

Change 310266 merged by Ema:
varnish: bump nuke_limit

https://gerrit.wikimedia.org/r/310266

Change 310551 had a related patch set uploaded (by Ema):
cache_upload: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310551

Change 310551 merged by Ema:
cache_upload: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310551

• ema mentioned this in T145661: varnish backends start returning 503s after ~6 days uptime.Sep 14 2016, 4:59 PM

Change 310767 had a related patch set uploaded (by Ema):
cache_upload fe: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310767

Change 310767 merged by Ema:
cache_upload fe: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310767

• ema moved this task from Upcoming to Done on the Traffic board.Sep 22 2016, 9:29 AM

BBlack moved this task from Done to Backlog on the Traffic board.Sep 30 2016, 1:18 PM

• ema moved this task from Backlog to Varnish v4 on the Traffic board.Sep 30 2016, 2:43 PM

Change 314658 had a related patch set uploaded (by Ema):
cache_upload: remove varnish3 VCL compat

https://gerrit.wikimedia.org/r/314658

Change 314658 merged by BBlack:
cache_upload: remove varnish3 VCL compat

https://gerrit.wikimedia.org/r/314658

BBlack closed this task as Resolved.Oct 10 2016, 1:55 PM

• ema mentioned this in T131503: Convert text cluster to Varnish 4.Nov 2 2016, 3:44 PM

Convert upload cluster to Varnish 4Closed, ResolvedPublicActions

Details

Related ObjectsSearch...

Event Timeline

Convert upload cluster to Varnish 4
Closed, ResolvedPublic
Actions

Related Objects
Search...