Page MenuHomePhabricator

Unconditional return(deliver) in vcl_hit
Open, MediumPublic

Description

Currently our vcl_hit subroutines return deliver without checking the object ttl/grace both at the frontend and backend layer.

The VCL built into varnish, instead, returns miss if the object lifetime is past both its ttl and ttl+grace:

sub vcl_hit {
    if (obj.ttl >= 0s) {
        // A pure unadulterated hit, deliver it
        return (deliver);
    }
    if (obj.ttl + obj.grace > 0s) {
        // Object is in grace, deliver it
        // Automatically triggers a background fetch
        return (deliver);
    }
    // fetch & deliver once we get the result
    return (miss);
}

The difference between miss and deliver in the context of vcl_hit is as follows:

  • miss: Synchronously refresh the object from the backend despite the cache hit. Control will eventually pass to vcl_miss.
  • deliver: Deliver the object. If it is stale, a background fetch to refresh it is triggered.

See https://varnish-cache.org/docs/5.1/users-guide/vcl-built-in-subs.html#vcl-hit

To better understand the behavior of varnish considering our unconditional deliver in vcl_miss we have written the following test case:

varnishtest "return deliver in vcl_hit (using grace)"

server s1 {
    rxreq
    txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 1" -body "Test 1"

    rxreq
    txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 2" -body "Test 2"
} -start

varnish v1 -vcl+backend {
    sub vcl_backend_response {
        set beresp.grace = 5s;
        set beresp.keep = 0s;
    }

    sub vcl_hit {
        return(deliver);
    }
} -start

# First request, cache miss
client c1 {
    txreq
    rxresp
    expect resp.http.X-Resp-ID == 1
} -run

varnish v1 -expect cache_hit == 0
varnish v1 -expect cache_miss == 1

# Second request, cache hit
client c2 {
    txreq
    rxresp
    expect resp.http.X-Resp-ID == 1
} -run

varnish v1 -expect cache_hit == 1                                                                                                                             
varnish v1 -expect cache_miss == 1                                                                                                                            
                                                                                                                                                              
# Third request waiting 3s to make the 2s ttl expire. This will result in a                                                                                   
# bgfetch and the stale object being returned. Still considered a cache hit.                                                                                  
client c3 {                                                                                                                                                   
    delay 3                                                                                                                                                   
    txreq                                                                                                                                                     
    rxresp                                                                                                                                                    
    expect resp.http.X-Resp-ID == 1                                                                                                                           
} -run                                                                                                                                                        
                                                                                                                                                              
varnish v1 -expect cache_hit == 2                                                                                                                             
varnish v1 -expect cache_miss == 1                                                                                                                            
                                                                                                                                                              
# Fourth request. This one returns the second object and is considered                                                                                        
# a cache hit as well.                                                                                                                                        
client c4 {                                                                                                                                                   
    txreq                                                                                                                                                     
    rxresp                                                                                                                                                    
    expect resp.http.X-Resp-ID == 2                                                                                                                           
} -run                                                                                                                                                        
                                                                                                                                                              
varnish v1 -expect cache_hit == 3                                                                                                                             
varnish v1 -expect cache_miss == 1                                                                                                                            
                                                                                                                                                              
varnish v1 -expect n_expired == 0

If we use keep instead of grace in vcl_backend_response, the test behaves exactly the same (not only in that it passes, but in the sense that varnishtest -v yields the same output except for timestamps/port numbers):

sub vcl_backend_response {
    set beresp.grace = 0s;
    set beresp.keep = 5s;
}

It thus seems that the distinction between grace and keep is not somehow internal to varnish. Rather, it needs to be implemented via VCL.

Because of this, our CDN is presently making keep behave as grace.

  • We know that the expiry thread is often unable to cope with load on our varnish backends. When that happens, objects stick around in cache well past their ttl+grace+keep lifetime. Do those also behave like graced objects?
  • Only one request is considered a cache miss in the vtc test above, although there's been two full fetches from the origin server. Does that mean that our stats are currently skewed towards hit?

The test behavior changes by delaying the origin server response. In that case, request number 4 occurs while the varnish<->origin bgfetch is in flight, and it results in the first object being returned.

--- x.vtc	2018-04-18 15:42:57.704383375 +0200
+++ y.vtc	2018-04-18 15:45:11.499248078 +0200
@@ -5,6 +5,7 @@
     txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 1" -body "Test 1"
 
     rxreq
+    delay 1
     txresp -hdr "Cache-Control: s-maxage=2" -hdr "Last-Modified: Tue, 03 Apr 2018 21:35:51 +0200" -hdr "X-Resp-ID: 2" -body "Test 2"
 } -start
 
@@ -51,12 +52,12 @@
 varnish v1 -expect cache_hit == 2
 varnish v1 -expect cache_miss == 1
 
-# Fourth request. This one returns the second object and is considered
-# a cache hit as well.
+# Fourth request. This one returns the first object while the bgfetch from the
+# origin server is in progress. This is considered a cache hit as well.
 client c4 {
     txreq
     rxresp
-    expect resp.http.X-Resp-ID == 2
+    expect resp.http.X-Resp-ID == 1
 } -run
 
 varnish v1 -expect cache_hit == 3

That seems to indicate that we might be responding with (very, if the expiry mailbox theory is confirmed) stale objects to quite a few requests, while the bgfetch is in progress.

Details

Related Gerrit Patches:
operations/puppet : productionvcl: invoke builtin vcl_hit
operations/debs/varnish4 : debian-wmf5.1.3-1wm8: add patches included in 4.1.10
operations/debs/varnish4 : debian-wmfAdd cache_hit_grace counter

Event Timeline

ema created this task.Apr 17 2018, 3:11 PM
Restricted Application added a project: Operations. · View Herald TranscriptApr 17 2018, 3:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Apr 17 2018, 3:12 PM
ema moved this task from Triage to Caching on the Traffic board.Apr 18 2018, 1:30 PM
ema updated the task description. (Show Details)Apr 18 2018, 1:49 PM
ema added a comment.EditedApr 24 2018, 1:11 PM

As of now, returning deliver instead of miss is a valid mitigation for #1799. The main drawback, as mentioned in this ticket, is that we're potentially returning stale objects.

Next steps:

  1. Deploy varnish 5.1.3-1wm7
  2. Return miss in vcl_hit for objects past grace
  3. Set a short grace for healthy backends, to be increased in case of sickness

Mentioned in SAL (#wikimedia-operations) [2018-04-25T14:32:59Z] <ema> cp3030: upgrade varnish to 5.1.3-1wm7 T192368

Change 429395 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] Add cache_hit_grace counter

https://gerrit.wikimedia.org/r/429395

Change 429395 merged by Ema:
[operations/debs/varnish4@debian-wmf] Add cache_hit_grace counter

https://gerrit.wikimedia.org/r/429395

Change 429839 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm8: add patches included in 4.1.10

https://gerrit.wikimedia.org/r/429839

Change 429839 merged by Ema:
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm8: add patches included in 4.1.10

https://gerrit.wikimedia.org/r/429839

Mentioned in SAL (#wikimedia-operations) [2018-05-02T08:22:05Z] <ema> varnish 5.1.3-1wm8 uploaded to apt.w.o T192368

Mentioned in SAL (#wikimedia-operations) [2018-05-03T08:13:11Z] <ema> cp-misc: upgrade varnish to 5.1.3-1wm8 T192368

Mentioned in SAL (#wikimedia-operations) [2018-05-03T14:00:47Z] <ema> cp3030 (text): upgrade varnish to 5.1.3-1wm8 T192368

Mentioned in SAL (#wikimedia-operations) [2018-05-07T09:30:35Z] <ema> cp-text/upload: start varnish upgrades to 5.1.3-1wm8 T192368

ema updated the task description. (Show Details)May 16 2018, 9:10 AM

Change 433338 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: invoke builtin vcl_hit

https://gerrit.wikimedia.org/r/433338