Page MenuHomePhabricator

Unconditional return(deliver) in vcl_hit
Open, MediumPublic

Description

Currently our vcl_hit subroutines return deliver without checking the object ttl/grace both at the frontend and backend layer.

The VCL built into varnish, instead, returns miss if the object lifetime is past both its ttl and ttl+grace:

sub vcl_hit {
    if (obj.ttl >= 0s) {
        // A pure unadulterated hit, deliver it
        return (deliver);
    }
    if (obj.ttl + obj.grace > 0s) {
        // Object is in grace, deliver it
        // Automatically triggers a background fetch
        return (deliver);
    }
    // fetch & deliver once we get the result
    return (miss);
}

The difference between miss and deliver in the context of vcl_hit is as follows:

  • miss: Synchronously refresh the object from the backend despite the cache hit. Control will eventually pass to vcl_miss.
  • deliver: Deliver the object. If it is stale, a background fetch to refresh it is triggered.

See https://varnish-cache.org/docs/5.1/users-guide/vcl-built-in-subs.html#vcl-hit

To better understand the behavior of varnish considering our unconditional deliver in vcl_miss we have written the following test case:

varnishtest "return deliver in vcl_hit (using grace)"

server s1 {
    rxreq
    txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 1" -body "Test 1"

    rxreq
    txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 2" -body "Test 2"
} -start

varnish v1 -vcl+backend {
    sub vcl_backend_response {
        set beresp.grace = 5s;
        set beresp.keep = 0s;
    }

    sub vcl_hit {
        return(deliver);
    }
} -start

# First request, cache miss
client c1 {
    txreq
    rxresp
    expect resp.http.X-Resp-ID == 1
} -run

varnish v1 -expect cache_hit == 0
varnish v1 -expect cache_miss == 1

# Second request, cache hit
client c2 {
    txreq
    rxresp
    expect resp.http.X-Resp-ID == 1
} -run

varnish v1 -expect cache_hit == 1                                                                                                                             
varnish v1 -expect cache_miss == 1                                                                                                                            
                                                                                                                                                              
# Third request waiting 3s to make the 2s ttl expire. This will result in a                                                                                   
# bgfetch and the stale object being returned. Still considered a cache hit.                                                                                  
client c3 {                                                                                                                                                   
    delay 3                                                                                                                                                   
    txreq                                                                                                                                                     
    rxresp                                                                                                                                                    
    expect resp.http.X-Resp-ID == 1                                                                                                                           
} -run                                                                                                                                                        
                                                                                                                                                              
varnish v1 -expect cache_hit == 2                                                                                                                             
varnish v1 -expect cache_miss == 1                                                                                                                            
                                                                                                                                                              
# Fourth request. This one returns the second object and is considered                                                                                        
# a cache hit as well.                                                                                                                                        
client c4 {                                                                                                                                                   
    txreq                                                                                                                                                     
    rxresp                                                                                                                                                    
    expect resp.http.X-Resp-ID == 2                                                                                                                           
} -run                                                                                                                                                        
                                                                                                                                                              
varnish v1 -expect cache_hit == 3                                                                                                                             
varnish v1 -expect cache_miss == 1                                                                                                                            
                                                                                                                                                              
varnish v1 -expect n_expired == 0

If we use keep instead of grace in vcl_backend_response, the test behaves exactly the same (not only in that it passes, but in the sense that varnishtest -v yields the same output except for timestamps/port numbers):

sub vcl_backend_response {
    set beresp.grace = 0s;
    set beresp.keep = 5s;
}

It thus seems that the distinction between grace and keep is not somehow internal to varnish. Rather, it needs to be implemented via VCL.

Because of this, our CDN is presently making keep behave as grace.

  • We know that the expiry thread is often unable to cope with load on our varnish backends. When that happens, objects stick around in cache well past their ttl+grace+keep lifetime. Do those also behave like graced objects?
  • Only one request is considered a cache miss in the vtc test above, although there's been two full fetches from the origin server. Does that mean that our stats are currently skewed towards hit?

The test behavior changes by delaying the origin server response. In that case, request number 4 occurs while the varnish<->origin bgfetch is in flight, and it results in the first object being returned.

--- x.vtc	2018-04-18 15:42:57.704383375 +0200
+++ y.vtc	2018-04-18 15:45:11.499248078 +0200
@@ -5,6 +5,7 @@
     txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 1" -body "Test 1"
 
     rxreq
+    delay 1
     txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 2" -body "Test 2"
 } -start
 
@@ -51,12 +52,12 @@
 varnish v1 -expect cache_hit == 2
 varnish v1 -expect cache_miss == 1
 
-# Fourth request. This one returns the second object and is considered
-# a cache hit as well.
+# Fourth request. This one returns the first object while the bgfetch from the
+# origin server is in progress. This is considered a cache hit as well.
 client c4 {
     txreq
     rxresp
-    expect resp.http.X-Resp-ID == 2
+    expect resp.http.X-Resp-ID == 1
 } -run
 
 varnish v1 -expect cache_hit == 3

That seems to indicate that we might be responding with (very, if the expiry mailbox theory is confirmed) stale objects to quite a few requests, while the bgfetch is in progress.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Apr 17 2018, 3:12 PM

As of now, returning deliver instead of miss is a valid mitigation for #1799. The main drawback, as mentioned in this ticket, is that we're potentially returning stale objects.

Next steps:

  1. Deploy varnish 5.1.3-1wm7
  2. Return miss in vcl_hit for objects past grace
  3. Set a short grace for healthy backends, to be increased in case of sickness

Mentioned in SAL (#wikimedia-operations) [2018-04-25T14:32:59Z] <ema> cp3030: upgrade varnish to 5.1.3-1wm7 T192368

Change 429395 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] Add cache_hit_grace counter

https://gerrit.wikimedia.org/r/429395

Change 429395 merged by Ema:
[operations/debs/varnish4@debian-wmf] Add cache_hit_grace counter

https://gerrit.wikimedia.org/r/429395

Change 429839 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm8: add patches included in 4.1.10

https://gerrit.wikimedia.org/r/429839

Change 429839 merged by Ema:
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm8: add patches included in 4.1.10

https://gerrit.wikimedia.org/r/429839

Mentioned in SAL (#wikimedia-operations) [2018-05-02T08:22:05Z] <ema> varnish 5.1.3-1wm8 uploaded to apt.w.o T192368

Mentioned in SAL (#wikimedia-operations) [2018-05-03T08:13:11Z] <ema> cp-misc: upgrade varnish to 5.1.3-1wm8 T192368

Mentioned in SAL (#wikimedia-operations) [2018-05-03T14:00:47Z] <ema> cp3030 (text): upgrade varnish to 5.1.3-1wm8 T192368

Mentioned in SAL (#wikimedia-operations) [2018-05-07T09:30:35Z] <ema> cp-text/upload: start varnish upgrades to 5.1.3-1wm8 T192368

Change 433338 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: invoke builtin vcl_hit

https://gerrit.wikimedia.org/r/433338

Change 433338 abandoned by Ema:
vcl: invoke builtin vcl_hit

Reason:
expired

https://gerrit.wikimedia.org/r/433338

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!