cache_misc's misc_fetch_large_objects has issues
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BBlack
	Mar 3 2016, 11:47 PM

Description

Just taking down notes for now, will investigate deeper later:

So @Smalyshev figured out that TE:chunked isn't actually a problem for our varnish3, it's just a problem for our misc-web VCL (thanks!). We need to fix it for WDQS, since their output is always chunked.

The std.integer() wrappers on the Content-Length checks makes them succeed for chunked responses which have no Content-Length. This is the actual problem we had in the 15.wikipedia.org issue we blamed on TE:chunked as well.

One of the reasons for a size-based hit_for_pass limit in cache_misc historically was that it was a small memory-only cluster. It's now a proper cluster, but some of the frontends still have relatively small RAM, so we can't just bump the memory storage size universally either. What we could do is only apply the size limit on frontends, though, and let backend disk hits take the larger objects. Doing that in turn will of course break the cache_local_random thing for passes, so it may need disabling on the frontends like cache_upload, although really we should find a better way to deal with that problem (e.g. by marking layer-differential passes for special treatment, so that other passes can still randomize).

Details

Subject	Repo	Branch	Lines +/-
cache_misc: remove all do_stream=true	operations/puppet	production	+0 -14
Remove X-Pass-Stream support from cache_misc size+stream VCL	operations/puppet	production	+2 -2
New misc VTC test: 09-chunked-response-add-cl.vtc	operations/puppet	production	+25 -0
misc large objects refactor	operations/puppet	production	+43 -18
VCL: remove all non-default between_bytes_timeout	operations/puppet	production	+5 -16
cache_misc: raise fe cache to half of memory	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Yurik	T126741 Add support for the wikidata's Sparql queries to graphs
Resolved	Smalyshev	T126730 [RFC] Caching for results of wikidata Sparql queries
Resolved	BBlack	T127014 Empty result on a tree query
Resolved	BBlack	T128813 cache_misc's misc_fetch_large_objects has issues
Resolved	• ema	T131501 Convert misc cluster to Varnish 4
Resolved	• ema	T134989 WDQS empty response - transfer clsoed with 15042 bytes remaining to read

Event Timeline

BBlack created this task.Mar 3 2016, 11:47 PM

Restricted Application added a project: SRE. · View Herald TranscriptMar 3 2016, 11:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Smalyshev mentioned this in T128656: Make WDQS responses cacheable by eliminating chunking.Mar 3 2016, 11:48 PM

Smalyshev added a parent task: T126730: [RFC] Caching for results of wikidata Sparql queries.

So to re-state some of the summary, the reverse dependency chain is:

misc_fetch_large_objects causes unnecessary hit-for-pass on "large" objects, where the definition of "large" is in fact quite small, and also includes "any chunked response". We want to remove that restriction, and improve the stream/hit-for-pass logic here in general, but...
part of the reason for that restriction's implementation is that historically misc-web had fewer services, very small memory, no backend disk caches, and fewer total machines. As a result the frontend memory size was fixed at a modest 8GB. We should raise this memory size to a more-sensible value before we remove the size limit, but...
while most of misc-web is on better machines now, the esams portion of the cluster is still lagging behind on machines which only have 16GB of total RAM. We already have am existing plan to fix this as part of T125485 .

Turning the order above back around we get:

1. Get through the parts of T125485 necessary to improve the memory size of misc-web esams fronts. This is steps 1-3/6 there (I re-ordered them to prioritize this, as the other two steps are still pseudo-blocked on T109162)
2. Raise misc's frontend memory size from fixed 8GB to a reasonable fraction of total RAM like the bigger clusters
3. Refactor/improve misc_fetch_large_objects to remove/raise the all-layers hit-for-pass size limit and/or make it less-sensitive to chunked. A lot of this comes down to aligning it better with the upload caches' size-sensitive do_stream/hfp implementation which splits its behavior on layer boundaries and lacks the chunked problem for backends. While not directly part of resolving this ticket, this may lead to better understanding and improvement to upload's behavior as well, and we may eventually want to move this to shared wikimedia.vcl code if it makes universal sense (even on text, it may improve the situation for large article content fetches).

BBlack mentioned this in T127014: Empty result on a tree query .Mar 23 2016, 3:55 PM

BBlack added a parent task: T127014: Empty result on a tree query .Mar 23 2016, 9:25 PM

Change 279460 had a related patch set uploaded (by BBlack):
cache_misc: raise fe cache to half of memory

https://gerrit.wikimedia.org/r/279460

Change 279460 merged by BBlack:
cache_misc: raise fe cache to half of memory

https://gerrit.wikimedia.org/r/279460

• ema added a project: Varnish.Apr 1 2016, 1:36 PM

JanZerebecki subscribed.Apr 3 2016, 10:25 PM

BBlack mentioned this in T131761: Solve large-object/stream/pass/chunked in upload cluster better.Apr 4 2016, 2:35 PM

BBlack created subtask T131761: Solve large-object/stream/pass/chunked in upload cluster better.

Change 281499 had a related patch set uploaded (by BBlack):
VCL: remove all non-default between_bytes_timeout

https://gerrit.wikimedia.org/r/281499

BBlack edited subtasks, added: T131501: Convert misc cluster to Varnish 4; removed: T131761: Solve large-object/stream/pass/chunked in upload cluster better.Apr 5 2016, 12:51 PM

Change 281499 merged by BBlack:
VCL: remove all non-default between_bytes_timeout

https://gerrit.wikimedia.org/r/281499

BBlack merged a task: T124195: compressed http responses without content-length not cached by varnish.Apr 6 2016, 1:36 PM

BBlack added subscribers: elukey, Dzahn, ori and 2 others.

Change 282716 had a related patch set uploaded (by BBlack):
misc large objects refactor

https://gerrit.wikimedia.org/r/282716

Change 282895 had a related patch set uploaded (by Ema):
New misc VTC test: 09-chunked-response-add-cl.vtc

https://gerrit.wikimedia.org/r/282895

Change 282716 merged by BBlack:
misc large objects refactor

https://gerrit.wikimedia.org/r/282716

This should be resolved now, modulo perhaps some cache entries that need to fall out in the near future from caching the old behavior with hit-for-pass. It won't be performance-ideal on varnish3, but the code that's there should survive the transition and get better with varnish4.

Change 282895 merged by Ema:
New misc VTC test: 09-chunked-response-add-cl.vtc

https://gerrit.wikimedia.org/r/282895

Mushroom mentioned this in T133490: Wikidata Query Service REST endpoint returns truncated results.Apr 24 2016, 3:08 PM

Change 287623 had a related patch set uploaded (by BBlack):
Remove X-Pass-Stream support from cache_misc size stream VCL

https://gerrit.wikimedia.org/r/287623

Change 287623 merged by BBlack:
Remove X-Pass-Stream support from cache_misc size stream VCL

https://gerrit.wikimedia.org/r/287623

Change 287633 had a related patch set uploaded (by BBlack):
cache_misc: remove all do_stream=true

https://gerrit.wikimedia.org/r/287633

Change 287633 merged by BBlack:
cache_misc: remove all do_stream=true

https://gerrit.wikimedia.org/r/287633

• MZMcBride subscribed.May 9 2016, 2:46 PM