Page MenuHomePhabricator

requestctl can't act on cache hits
Closed, ResolvedPublic

Description

Last night and the night before, we were paged for network saturation due to what looks like a hotlinked image.

Ideally, we'd like to be able to serve that kind of traffic, but given our current resources and as an emergency measure, requestctl should be able to throttle it, in order to keep our bandwidth within limits. Unlike the use case where requestctl protects the app servers from a cache-busting attack, here the problematic traffic is all cache hits.

Currently, requestctl's generated VCL filters are only invoked in cluster_fe_ratelimit, which is only called from cluster_fe_miss or cluster_fe_pass. There's no entry point from vcl_hit (wikimedia-frontend.vcl.erb) so requestctl can't affect traffic on the hit path.

Note that the requestctl schema does contain a field cache_miss_only, which is documented to mean that cache hits are filtered when the field is set false -- but it was added optimistically and currently doesn't do anything (and can't, until the control flows are changed as above). If we can't figure out a way to actually implement this behavior in the near term, we should remove that field or at least add a prominent warning label.

[ ] Patch requestctl to check the value of internal cache disposition header req.http.X-CDIS against the action's cache_miss_only as part of evaluation. Within our VCL, this will be hit on a hit, miss on a miss, etc.

  • Release requestctl.
  • Add an optional stanza, inclusion controlled by the value of a new hiera variable, to *-frontend.inc.vcl.erb's cluster_fe_ratelimit_hit sub, that just's a include "requestctl-filters.inc.vcl"; as already happens in cluster_fe_ratelimit.
  • Enable that hiera bit on a few upload cps in eqsin as an initial evaluation and some protection against that same image file hotlink.
  • Enable that hiera bit on a few text cps, watching for any performance impact.
  • Roll it out to the fleet.

Event Timeline

RLazarus triaged this task as Medium priority.

Going to be bold and append to the task description with what we discussed in the cachebust WG meeting today (so that anyone can update it and tick boxes as we go).

Change 832268 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines

https://gerrit.wikimedia.org/r/832268

Change 832631 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:cache::varnish::frontend: Add parameter to enable requestctl on hits

https://gerrit.wikimedia.org/r/832631

Change 832268 merged by Jbond:

[operations/puppet@production] C:varnish: Add cluster_fe_hit and cluster_fe_ratelimit_hits subroutines

https://gerrit.wikimedia.org/r/832268

While implementing the the dry-run version of the current hotlinking patch i found out that its not possible to update the resp object in the vcl_hit hook. As such its not possible to update the X-Analytics header with the current suggested approach. This impacts two aspects of the design

Patch requestctl to check the value of internal cache disposition header req.http.X-CDIS against the action's cache_miss_only as part of evaluation. Within our VCL, this will be hit on a hit, miss on a miss, etc.

With the above approach i think the idea is that we could use the exact same bit of vcl code in both the hit and miss paths but control which rule is hit based on the X-CDIS. however the current vcl code that we drop in sets req.http.X-Requestctl on every rule match which will not be possible in the vcl_hit block of code. I think we could resolve this by, instead of mapping the hit/miss parameter to the vcl X-CDIS header, we create two functions, requestcl_ratelimit_miss` and requestcl_ratelimit_hit, in requestctl-filters.inc.vcl. we can then call requestcl_ratelimit_* from the correct blocks(s) in the main vcl code. e.g.

varnish-frontend-requestctl-filters.vcl.tpl.erb
sub requestcl_ratelimit_hit {
  <% rule.filter(cache_miss_only == false).each |rule| %>
  // all the hit rules
  <%= rule %>
}
sub requestcl_ratelimit_miss {
  // all the miss rules
   <% rule.filter(cache_miss_only == true).each |rule| %>
  // all the hit rules
  <%= rule %>
}
wikimedia-frontend.vcl.erb
include "requestctl-filters.inc.vcl";
...
sub vcl_hit {
  call requestcl_ratelimit_hit;
  ...
}
sub vcl_miss {
  call requestcl_ratelimit_miss;
  ...
}

This still doesn't help with how we can record the rules match in the vcl_hit block. from what i could see its just not possible to modify the request at this point in the vcl code so im not sure what to do to resolve this. however..

Create placeholder-for-now cluster_fe_{ratelimit_,}hit subroutines in our VCL https://gerrit.wikimedia.org/r/c/operations/puppet/+/832268

this is not in implemented but i wonder if vcl_hit is the best place to block cache hits. Specifically i dont think we want to rate-limit cache misses vs cache hits, what we want to do is ratelimit cache misses vs anything else. As such i wonder if we should instead put hooks in vcl_miss for cache misses and in vcl_recv for everything else. This has the benefit that we start ratlimiting the everything else requests e.g. hotlinks much earlier in the processing and although we still cant modify resp we can modify req which may be enough to add some analytics signalling

to be explicit the proposal would be to move to something like

varnish-frontend-requestctl-filters.vcl.tpl.erb
sub requestcl_ratelimit_recv {
  <% rule.filter(cache_miss_only == false).each |rule| %>
  // all the hit rules
  <%= rule %>
}
sub requestcl_ratelimit_miss {
  // all the miss rules
   <% rule.filter(cache_miss_only == true).each |rule| %>
  // all the hit rules
  <%= rule %>
}
wikimedia-frontend.vcl.erb
include "requestctl-filters.inc.vcl";
...
sub vcl_recv {
  call requestcl_ratelimit_recv;
  ...
}
sub vcl_miss {
  call requestcl_ratelimit_miss;
  ...
}
akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam

While implementing the the dry-run version of the current hotlinking patch i found out that its not possible to update the resp object in the vcl_hit hook. As such its not possible to update the X-Analytics header with the current suggested approach. This impacts two aspects of the design

This is apparently not true, as we now have code that does exactly that. And the varnish documentation suggests the same:

req.http.*
    Type: HEADER
    Readable from: client
    Writable from: client
    Unsetable from: client

    The headers of request, things like req.http.date.

    The RFCs allow multiple headers with the same name, and both set and unset will remove all headers with the name given.

and client is defined as:

Which operations are possible on each variable is described below, often with the shorthand “backend” which covers the vcl_backend_* methods and “client” which covers the rest, except vcl_init and vcl_fini

Change #1054081 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] varnish: add requestctl filters for cache hits

https://gerrit.wikimedia.org/r/1054081

To clarify a bit, I didn't take the route described in the task. In fact, we want:

  • Rules with cache_miss_only: true to only be processed in vcl_pass/vcl_miss
  • Rules with cache_miss_only:false to be processed both in vcl_pass/vcl_miss and vcl_hit

So we've just separated rules so that they have a different patch on etcd - /requestctl-vcl/cache-text/global for miss-only things and /requestctl-vcl/cache-text/hit-global. This way we can filter which rules we include where and we don't need to mess with X-CDIS.

I also don't think we need protection against errors and merging in just a few servers, but I can be convinced otherwise.

Change #1054081 merged by Giuseppe Lavagetto:

[operations/puppet@production] varnish: add requestctl filters for cache hits

https://gerrit.wikimedia.org/r/1054081

Joe updated the task description. (Show Details)