Page MenuHomePhabricator

cache_misc's misc_fetch_large_objects has issues
Closed, ResolvedPublic

Description

Just taking down notes for now, will investigate deeper later:

So @Smalyshev figured out that TE:chunked isn't actually a problem for our varnish3, it's just a problem for our misc-web VCL (thanks!). We need to fix it for WDQS, since their output is always chunked.

The std.integer() wrappers on the Content-Length checks makes them succeed for chunked responses which have no Content-Length. This is the actual problem we had in the 15.wikipedia.org issue we blamed on TE:chunked as well.

One of the reasons for a size-based hit_for_pass limit in cache_misc historically was that it was a small memory-only cluster. It's now a proper cluster, but some of the frontends still have relatively small RAM, so we can't just bump the memory storage size universally either. What we could do is only apply the size limit on frontends, though, and let backend disk hits take the larger objects. Doing that in turn will of course break the cache_local_random thing for passes, so it may need disabling on the frontends like cache_upload, although really we should find a better way to deal with that problem (e.g. by marking layer-differential passes for special treatment, so that other passes can still randomize).

Event Timeline

BBlack created this task.Mar 3 2016, 11:47 PM
Restricted Application added a project: Operations. · View Herald TranscriptMar 3 2016, 11:47 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack added a comment.EditedMar 6 2016, 3:06 PM

So to re-state some of the summary, the reverse dependency chain is:

  1. misc_fetch_large_objects causes unnecessary hit-for-pass on "large" objects, where the definition of "large" is in fact quite small, and also includes "any chunked response". We want to remove that restriction, and improve the stream/hit-for-pass logic here in general, but...
  2. part of the reason for that restriction's implementation is that historically misc-web had fewer services, very small memory, no backend disk caches, and fewer total machines. As a result the frontend memory size was fixed at a modest 8GB. We should raise this memory size to a more-sensible value before we remove the size limit, but...
  3. while most of misc-web is on better machines now, the esams portion of the cluster is still lagging behind on machines which only have 16GB of total RAM. We already have am existing plan to fix this as part of T125485 .

Turning the order above back around we get:

  • 1. Get through the parts of T125485 necessary to improve the memory size of misc-web esams fronts. This is steps 1-3/6 there (I re-ordered them to prioritize this, as the other two steps are still pseudo-blocked on T109162)
  • 2. Raise misc's frontend memory size from fixed 8GB to a reasonable fraction of total RAM like the bigger clusters
  • 3. Refactor/improve misc_fetch_large_objects to remove/raise the all-layers hit-for-pass size limit and/or make it less-sensitive to chunked. A lot of this comes down to aligning it better with the upload caches' size-sensitive do_stream/hfp implementation which splits its behavior on layer boundaries and lacks the chunked problem for backends. While not directly part of resolving this ticket, this may lead to better understanding and improvement to upload's behavior as well, and we may eventually want to move this to shared wikimedia.vcl code if it makes universal sense (even on text, it may improve the situation for large article content fetches).

Change 279460 had a related patch set uploaded (by BBlack):
cache_misc: raise fe cache to half of memory

https://gerrit.wikimedia.org/r/279460

Change 279460 merged by BBlack:
cache_misc: raise fe cache to half of memory

https://gerrit.wikimedia.org/r/279460

Change 281499 had a related patch set uploaded (by BBlack):
VCL: remove all non-default between_bytes_timeout

https://gerrit.wikimedia.org/r/281499

Change 281499 merged by BBlack:
VCL: remove all non-default between_bytes_timeout

https://gerrit.wikimedia.org/r/281499

Change 282716 had a related patch set uploaded (by BBlack):
misc large objects refactor

https://gerrit.wikimedia.org/r/282716

Change 282895 had a related patch set uploaded (by Ema):
New misc VTC test: 09-chunked-response-add-cl.vtc

https://gerrit.wikimedia.org/r/282895

Change 282716 merged by BBlack:
misc large objects refactor

https://gerrit.wikimedia.org/r/282716

BBlack closed this task as Resolved.Apr 12 2016, 1:57 PM
BBlack claimed this task.

This should be resolved now, modulo perhaps some cache entries that need to fall out in the near future from caching the old behavior with hit-for-pass. It won't be performance-ideal on varnish3, but the code that's there should survive the transition and get better with varnish4.

Change 282895 merged by Ema:
New misc VTC test: 09-chunked-response-add-cl.vtc

https://gerrit.wikimedia.org/r/282895

Change 287623 had a related patch set uploaded (by BBlack):
Remove X-Pass-Stream support from cache_misc size stream VCL

https://gerrit.wikimedia.org/r/287623

Change 287623 merged by BBlack:
Remove X-Pass-Stream support from cache_misc size stream VCL

https://gerrit.wikimedia.org/r/287623

Change 287633 had a related patch set uploaded (by BBlack):
cache_misc: remove all do_stream=true

https://gerrit.wikimedia.org/r/287633

Change 287633 merged by BBlack:
cache_misc: remove all do_stream=true

https://gerrit.wikimedia.org/r/287633