Page MenuHomePhabricator

Convert upload cluster to Varnish 4
Closed, ResolvedPublic

Details

Related Gerrit Patches:
operations/puppet : productioncache_upload: remove varnish3 VCL compat
operations/puppet : productioncache_upload fe: do not set do_stream=true on Varnish 4
operations/puppet : productioncache_upload: do not set do_stream=true on Varnish 4
operations/puppet : productionvarnish: bump nuke_limit
operations/software/varnish/libvmod-netmapper : debianNew upstream release
operations/puppet : productioncache_upload: restore normal inter-DC routing
operations/puppet : productionUpgrade upload eqiad to Varnish 4
operations/puppet : productionUpgrade upload esams to Varnish 4
operations/puppet : productioncache_upload esams: route to codfw
operations/dns : masterRevert "depool upload in ulsfo"
operations/puppet : productionUpgrade upload ulsfo to Varnish 4
operations/dns : masterdepool upload in ulsfo
operations/puppet : productionUpgrade upload codfw to Varnish 4
operations/puppet : productioncache_upload: route codfw straight to applayer
operations/puppet : productioncache_upload: route around codfw in cache::route_table
operations/puppet : productionRevert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"
operations/puppet : productionRevert "Upgrade upload ulsfo to Varnish 4"
operations/puppet : productionUpgrade upload ulsfo to Varnish 4
operations/puppet : productionUpgrade cp4007 (ulsfo cache_upload) to Varnish 4
operations/puppet : productionUpgrade cp4006 (ulsfo cache_upload) to Varnish 4
operations/puppet : productioncache_upload varnishtest: pass Range requests
operations/puppet : productionUpgrade cp4005 (ulsfo cache_upload) to Varnish 4
operations/puppet : productioncache_upload: persistent storage backend naming on v4
operations/puppet : productionupload VCL: X-Range hack for V4
operations/puppet : productionVCL: add call for cluster/layer vcl_backend_fetch for V4
operations/puppet : productioncache_upload VCL forward port to Varnish 4
operations/puppet : productionupload VCL: prep for easier V4 migration

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 307247 had a related patch set uploaded (by Ema):
Upgrade cp4007 (ulsfo cache_upload) to Varnish 4

https://gerrit.wikimedia.org/r/307247

Change 307247 merged by Ema:
Upgrade cp4007 (ulsfo cache_upload) to Varnish 4

https://gerrit.wikimedia.org/r/307247

Change 307282 had a related patch set uploaded (by Ema):
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/307282

Change 307282 merged by Ema:
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/307282

Mentioned in SAL [2016-08-29T12:42:27Z] <ema> upgrading cp4013 to Varnish 4 (T131502)

Mentioned in SAL [2016-08-29T13:36:37Z] <ema> upgrading cp4014 to Varnish 4 (T131502)

Mentioned in SAL [2016-08-29T14:20:26Z] <ema> upgrading cp4015 to Varnish 4 (T131502)

Change 307964 had a related patch set uploaded (by Ema):
Revert "Upgrade upload ulsfo to Varnish 4"

https://gerrit.wikimedia.org/r/307964

Change 307964 merged by Ema:
Revert "Upgrade upload ulsfo to Varnish 4"

https://gerrit.wikimedia.org/r/307964

Mentioned in SAL [2016-09-01T16:09:59Z] <ema> downgrading cp4006 to varnish 3 T131502

ema added a comment.Sep 2 2016, 9:38 AM

Due to multiple issues with Varnish 4 including T144257 and recurring 503 plateaus, we've decided to downgrade most of cache_upload in ulsfo to Varnish 3. We are going to keep Varnish 4 on cp4005 only and try to observe the issues there, on a single machine that we can easily depool in case of troubles.

The current status is:

cp4005 - v4 puppet enabled
cp4006 - v3 puppet enabled
cp4007, cp4001[3-4] - v4 puppet disabled, to be downgraded soon

Mentioned in SAL [2016-09-02T14:27:56Z] <ema> downgrading cp4007 to varnish 3 T131502

Mentioned in SAL [2016-09-02T15:33:09Z] <ema> downgrading cp4013 to varnish 3 T131502

Mentioned in SAL [2016-09-02T16:10:40Z] <ema> downgrading cp4014 to varnish 3 T131502

Mentioned in SAL [2016-09-02T16:48:59Z] <ema> downgrading cp4015 to varnish 3 T131502

ema added a comment.EditedSep 5 2016, 12:33 PM

We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users. The following plan has been devised to achieve that goal:

  • Rollback ulsfo to v3-only by downgrading cp4005
  • Route ulsfo straight to eqiad, without going through codfw
  • Connect codfw directly to swift
  • Upgrade codfw to v4

This way, we will have codfw running a v4-only stack connected straight to swift with real traffic hitting v4 directly.

Change 308560 had a related patch set uploaded (by Ema):
Revert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"

https://gerrit.wikimedia.org/r/308560

Change 308560 merged by Ema:
Revert "Upgrade cp4005 (ulsfo cache_upload) to Varnish 4"

https://gerrit.wikimedia.org/r/308560

Mentioned in SAL [2016-09-05T12:41:42Z] <ema> downgrading cp4005 to varnish 3 T131502

We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users. The following plan has been devised to achieve that goal:

  • Rollback ulsfo to v3-only by downgrading cp4005
  • Route ulsfo straight to eqiad, without going through codfw
  • Connect codfw directly to swift

re: replication and swift, since https://gerrit.wikimedia.org/r/#/c/293272/ mw is writing to both codfw and swift synchronously, with swiftrepl doing the catch up on transient errors. Therefore having codfw cache_upload talk to swift codfw should work

Change 308582 had a related patch set uploaded (by Ema):
cache_upload: route around codfw in cache::route_table

https://gerrit.wikimedia.org/r/308582

Change 308582 merged by Ema:
cache_upload: route around codfw in cache::route_table

https://gerrit.wikimedia.org/r/308582

Change 308588 had a related patch set uploaded (by Ema):
cache_upload: route codfw straight to applayer

https://gerrit.wikimedia.org/r/308588

Change 308588 merged by Ema:
cache_upload: route codfw straight to applayer

https://gerrit.wikimedia.org/r/308588

Change 308593 had a related patch set uploaded (by Ema):
Upgrade upload codfw to Varnish 4

https://gerrit.wikimedia.org/r/308593

Change 308593 merged by Ema:
Upgrade upload codfw to Varnish 4

https://gerrit.wikimedia.org/r/308593

Mentioned in SAL [2016-09-05T17:35:05Z] <ema> upgrading cp2022 to varnish 4 T131502

Mentioned in SAL [2016-09-05T18:31:23Z] <ema> upgrading cp2026 to varnish 4 T131502

Mentioned in SAL [2016-09-05T19:22:03Z] <ema> upgrading cp2024 to varnish 4 T131502

Mentioned in SAL [2016-09-05T19:45:39Z] <ema> upgrading cp2020 to varnish 4 T131502

Mentioned in SAL [2016-09-05T20:41:21Z] <ema> upgrading cp2017 to varnish 4 T131502

Mentioned in SAL [2016-09-05T21:28:39Z] <ema> upgrading cp2014 to varnish 4 T131502

Mentioned in SAL [2016-09-05T21:59:32Z] <ema> upgrading cp2011 to varnish 4 T131502

Mentioned in SAL [2016-09-05T22:24:12Z] <ema> upgrading cp2008 to varnish 4 T131502

Mentioned in SAL [2016-09-05T22:49:57Z] <ema> upgrading cp2005 to varnish 4 T131502

Mentioned in SAL [2016-09-05T23:17:43Z] <ema> upgrading cp2002 to varnish 4 T131502

ema added a comment.Sep 6 2016, 10:44 AM

We suspect that the bug(s) encountered while upgrading ulsfo might have been caused by running a mix of Varnish 3 and Varnish 4 through multiple layers of caches (ulsfo -> codfw -> eqiad -> swift). To confirm whether this is the case, we want to test a v4-only stack against real users.

The upgrade lasted between 17:35 and 23:23. It went smoothly, without the 503 spikes we observed in ulsfo. Also, the CL:0 issue reported in T144257 does not seem to occur in codfw. This is the varnishlog command used to check that, for the record:

varnishlog -c -g request -q 'RespStatus == 200 and RespHeader ~ "Content-Length: 0" and ReqMethod ~ GET and ReqURL !~ "^/from/pybal" and ReqURL !~ wikimedia-monitoring-test and ReqURL !~ "^/check$"' -n frontend

That being said, cache_codfw backends haven't started nuking objects yet. It will be interesting to see if anything changes when that happens.

Change 308967 had a related patch set uploaded (by Ema):
depool upload in ulsfo

https://gerrit.wikimedia.org/r/308967

Change 308967 merged by Ema:
depool upload in ulsfo

https://gerrit.wikimedia.org/r/308967

Mentioned in SAL [2016-09-08T13:28:10Z] <ema> upgrading cache_upload ulsfo to varnish 4, dns depooled T131502

Change 309310 had a related patch set uploaded (by Ema):
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/309310

ema added a comment.Sep 8 2016, 1:35 PM

codfw is running fine with v4 routed straight to the applayer. We're going to upgrade ulsfo back to v4 routed to codfw to test v4<->v4 behavior.

Change 309310 merged by Ema:
Upgrade upload ulsfo to Varnish 4

https://gerrit.wikimedia.org/r/309310

Change 309383 had a related patch set uploaded (by BBlack):
Revert "depool upload in ulsfo"

https://gerrit.wikimedia.org/r/309383

Change 309383 merged by BBlack:
Revert "depool upload in ulsfo"

https://gerrit.wikimedia.org/r/309383

Change 309928 had a related patch set uploaded (by Ema):
cache_upload esams: route to codfw

https://gerrit.wikimedia.org/r/309928

Change 309928 merged by Ema:
cache_upload esams: route to codfw

https://gerrit.wikimedia.org/r/309928

Change 309930 had a related patch set uploaded (by Ema):
Upgrade upload esams to Varnish 4

https://gerrit.wikimedia.org/r/309930

Change 309930 merged by Ema:
Upgrade upload esams to Varnish 4

https://gerrit.wikimedia.org/r/309930

Mentioned in SAL [2016-09-12T00:51:20Z] <ema> upgrade cp3034 to varnish 4 T131502

Mentioned in SAL [2016-09-12T01:28:07Z] <ema> upgrade cp3035 to varnish 4 T131502

Mentioned in SAL [2016-09-12T02:06:23Z] <ema> upgrade cp3036 to varnish 4 T131502

Mentioned in SAL [2016-09-12T02:34:44Z] <bblack> upgrade cp3037 to varnish 4 T131502

Mentioned in SAL [2016-09-12T02:49:19Z] <bblack> upgrade cp3038 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:05:03Z] <bblack> upgrade cp3039 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:21:33Z] <bblack> upgrade cp3044 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:36:47Z] <bblack> upgrade cp3045 to varnish 4 T131502

Mentioned in SAL [2016-09-12T03:51:17Z] <bblack> upgrade cp3046 to varnish 4 T131502

Mentioned in SAL [2016-09-12T04:06:57Z] <bblack> upgrade cp3047 to varnish 4 T131502

Mentioned in SAL [2016-09-12T04:20:12Z] <bblack> upgrade cp3048 to varnish 4 T131502

Mentioned in SAL [2016-09-12T04:34:50Z] <bblack> upgrade cp3049 to varnish 4 T131502

Change 309964 had a related patch set uploaded (by Ema):
Upgrade upload eqiad to Varnish 4

https://gerrit.wikimedia.org/r/309964

Change 309964 merged by Ema:
Upgrade upload eqiad to Varnish 4

https://gerrit.wikimedia.org/r/309964

Mentioned in SAL [2016-09-12T10:48:05Z] <ema> upgrade cp1048 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:03:26Z] <ema> upgrade cp1049 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:17:52Z] <ema> upgrade cp1050 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:32:51Z] <ema> upgrade cp1062 to varnish 4 T131502

Mentioned in SAL [2016-09-12T11:47:44Z] <ema> upgrade cp1063 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:00:15Z] <ema> upgrade cp1064 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:11:45Z] <ema> upgrade cp1071 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:30:42Z] <ema> upgrade cp1072 to varnish 4 T131502

Mentioned in SAL [2016-09-12T12:47:34Z] <ema> upgrade cp1073 to varnish 4 T131502

Mentioned in SAL [2016-09-12T13:04:56Z] <ema> upgrade cp1074 to varnish 4 T131502

Mentioned in SAL [2016-09-12T13:20:58Z] <ema> upgrade cp1099 to varnish 4 T131502

Change 309993 had a related patch set uploaded (by BBlack):
cache_upload: restore normal inter-DC routing

https://gerrit.wikimedia.org/r/309993

Change 309993 merged by BBlack:
cache_upload: restore normal inter-DC routing

https://gerrit.wikimedia.org/r/309993

BBlack added a comment.EditedSep 12 2016, 11:33 PM

To recap the current state of affairs and recent investigation/experimentation:

  1. We finished upgrading the remaining DCs to Varnish4 earlier today, as shown in the SAL entries above. Cache routing is also restored to normal (ulsfo->codfw->eqiad, esams->eqiad).
  2. We've seen some small 503 spikes (different than the earlier small plateaus) that log as connection failures for varnish<->varnish, and we've deployed what seems to be a mitigating fix: raising the connection limit for varnish-be->varnish-be by an order of magnitude in https://gerrit.wikimedia.org/r/#/c/309982/ . This is plausibly because of Varnish 4's default streaming mode resulting in more backend connection parallelism, or the pass for all Range reqs, or other such behavioral changes from our previous V3 setup.
  3. There were 2x frontend child process crashes in eqiad, and it's because netmapper doesn't handle NULL input, which is a new thing in V4 for one of several reasons. We've got a VCL workaround applied in https://gerrit.wikimedia.org/r/#/c/310025/ , and a new version of vmod_netmapper with a proper fix is packaged but undeployed ( https://gerrit.wikimedia.org/r/#/c/310019/ ). It's not deployed yet because that itself is painful under V4 due to the vmod-upgrade issues related to https://github.com/varnishcache/varnish-cache/issues/2041
  4. Some issues appeared to be related to the VSLP director and health-checks, so we tried disabling varnish<->varnish healthcheck probes, but that didn't fix anything, so that was reverted.
  5. At one point we had ulsfo backends failing in a way that looked related to coalesce concurrency (high waiting threads, etc), and it seemed likely that our previous workarounds to avoid caching of CL:0+Status:200 responses could be causing that. The CL:0+Status:200 cache prevention was reverted ( https://gerrit.wikimedia.org/r/#/c/310023/ ), and then later unreverted ( https://gerrit.wikimedia.org/r/#/c/310155/ ) when we started observing cached bad responses again. The bad ones were banned post-revert.
  6. A common theme in the causal (lowest-cache-layer) 503 plateaus tends to be logs of LRU_Failed. We've chased that thread a bit and found that it's likely that the file storage backend works nothing like malloc, and apparently upstream is aware that it may not work right at all for some combinations of workload and storage size, and that LRU issues are related once the storage gets filled up. We've tried raising lru_interval and nuke_limit significant (to 73 and 1000, respectively), but that doesn't seem to have any effect.

Since LRU-related things work completely-differently in the deprecated persistent storage backend that we were trying to move away from towards file, and the old persistent backed in V3 was known to handle our situation well, it seems like our best recourse at present is to begin converting all the backends back to persistent. I reverted the storage engine change in https://gerrit.wikimedia.org/r/#/c/310161/ , and I'm in the process of iterating through the cache clusters (start at eqiad then codfw) and wiping/restarting backends to put the persistent engine back into place.

Since the above update:

  1. Tried converting eqiad + codfw to persistent, and fighting through a bit on levels of 503s in case they were just from aggressive rollout schedule. In the end, even after both were converted and "stable"-ish in many other respects, the persistent storage kept panicing/crashing the child processes too quickly to be a realistic option. So I reverted back to file storage at both.
  2. As a first pass at reducing fe->be request rates by increasing cache hit%, I progressively dropped the FE cacheable object size limit from 50MB to 8MB and then to 1MB. No negative fallout, possibly mild positive fallout, but only longer-term trends will tell. My intuition based on the varnishstat numbers is that our optimum is an average FE object size of ~128KB or less, but not sure how that corresponds for upper limit on CL. It's may be that we have to drop it further before we see larger improvements.

Other ideas for improving the LRU_Fail situation directly or indirectly: decreasing the file storage size (yuck), decreasing the TTL caps (maybe? low-probability idea), and/or deploying the 2-hit-wonder patch (again on the theory that great FE hitrate == less contention in the backends for LRU nuking).

With persistent apparently a bad idea, if we can't get file-storage past the LRU_Fail problem, we don't have many great ideas left for a stable Varnish 4 install here...

Also, after the various restarts above, I re-set (at runtime with varnishadm) all cache_upload backends to 31 seconds for lru_interval in the hope that that's still helpful, even if it's not a full cure.

Change 310227 had a related patch set uploaded (by Ema):
New upstream release

https://gerrit.wikimedia.org/r/310227

Change 310227 merged by Ema:
New upstream release

https://gerrit.wikimedia.org/r/310227

Change 310266 had a related patch set uploaded (by Ema):
varnish: bump nuke_limit

https://gerrit.wikimedia.org/r/310266

Change 310266 merged by Ema:
varnish: bump nuke_limit

https://gerrit.wikimedia.org/r/310266

Change 310551 had a related patch set uploaded (by Ema):
cache_upload: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310551

Change 310551 merged by Ema:
cache_upload: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310551

Change 310767 had a related patch set uploaded (by Ema):
cache_upload fe: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310767

Change 310767 merged by Ema:
cache_upload fe: do not set do_stream=true on Varnish 4

https://gerrit.wikimedia.org/r/310767

ema moved this task from Up Next to Done on the Traffic board.Sep 22 2016, 9:29 AM
BBlack moved this task from Done to Triage on the Traffic board.Sep 30 2016, 1:18 PM
ema moved this task from Triage to Varnish v4 on the Traffic board.Sep 30 2016, 2:43 PM

Change 314658 had a related patch set uploaded (by Ema):
cache_upload: remove varnish3 VCL compat

https://gerrit.wikimedia.org/r/314658

Change 314658 merged by BBlack:
cache_upload: remove varnish3 VCL compat

https://gerrit.wikimedia.org/r/314658

BBlack closed this task as Resolved.Oct 10 2016, 1:55 PM