Page MenuHomePhabricator

Discarded VCL files stuck in auto/busy state cause high number of backend probe requests
Closed, ResolvedPublic

Description

I've noticed a weird amount of request on cp5007 ats-be instance compared to ats-tls. ats-be is reporting ~8k rps while ats-tls is only reporting around ~3k rps. A quick check on the graph "Varnish backend Sessions accepted" shows that is happening for every text node on the text cluster on eqsin. After a quick check with varnishtop, it looks like the culprit are requests against http://varnishcheck/check.

Currently varnish-frontend on cp5008-5012 is generating around ~1.5-1.7k rps requests against http//varnishcheck/check:

timeout 1 varnishncsa -q "ReqUrl ~ check" > check.log
vgutierrez@cp5008:~$ cut -f1 -d' ' check.log |sort |uniq -c |sort -rn
   1778 10.132.0.112
   1727 10.132.0.108
   1684 10.132.0.111
   1573 10.132.0.109
   1560 10.132.0.110
     30 10.132.0.107

This traffic is triggered by the varnish probe configured here:

probe varnish {
	.request =
		"GET /check HTTP/1.1"
		"Host: varnishcheck"
		"User-agent: Varnish backend check"
		"Connection: close";
	.timeout = <%= @vcl_config.fetch('varnish_probe_ms') %>ms;
	.interval = 100ms;
	.window = 5;
	.threshold = 3;
}

As a quick test I've restarted varnish-frontend on cp5008 at 03:35 UTC, that dropped the amount of requests on the backend layer from 8k rps to 6.5krps:

vgutierrez@cp5008:~$ timeout 1 varnishncsa -q "ReqUrl ~ check" > check-after-restart.log
vgutierrez@cp5008:~$ cut -f1 -d' ' check-after-restart.log |sort |uniq -c |sort -rn
   1770 10.132.0.112
   1667 10.132.0.111
   1587 10.132.0.110
   1574 10.132.0.109
     30 10.132.0.107
     20 10.132.0.108

A quick check on grafana shows that the other DCs are also affected by this varnish-fe behaviour on the text cluster

Event Timeline

On cache_text we have a fairly significant number of VCL files stuck in the "auto/busy" state after having been discarded by our reload script. As an example, right now we have 10 VCLs in such state on cp3050 (text), and only 2 on cp3057 (upload). They can be seen with varnishadm -n frontend vcl.list. Each VCL file keeps on running all its probes, causing the requests mentioned in this ticket. The issue seems to be known upstream but "timed out": https://github.com/varnishcache/varnish-cache/issues/2228

ema renamed this task from varnish-fe is flooding the text backend caching layer with backend probe requests to Discarded VCL files stuck in auto/busy state cause high number of backend probe requests.Oct 29 2019, 10:47 AM
ema triaged this task as Medium priority.
ema moved this task from Backlog to Caching on the Traffic board.

Change 596182 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: add stp script to debug vcl reference leak

https://gerrit.wikimedia.org/r/596182

Change 596182 merged by Ema:
[operations/puppet@production] varnish: add stp script to debug vcl reference leak

https://gerrit.wikimedia.org/r/596182

Change 596237 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm15: add 0037-force-discard.patch

https://gerrit.wikimedia.org/r/596237

Change 596469 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl_reference_leak: instrument VCL_Poll

https://gerrit.wikimedia.org/r/596469

Change 596469 merged by Ema:
[operations/puppet@production] vcl_reference_leak: instrument VCL_Poll

https://gerrit.wikimedia.org/r/596469

Change 596237 merged by Ema:
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm15: add 0037-force-discard.patch

https://gerrit.wikimedia.org/r/596237

Mentioned in SAL (#wikimedia-operations) [2020-05-15T08:51:18Z] <ema> cp2029: try out varnish 5.1.3-1wm15 T236754

Mentioned in SAL (#wikimedia-operations) [2020-05-15T09:21:06Z] <ema> cp2029: attempt forced discard of stuck VCL T236754

Change 596626 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm15: don't set temperature to cold

https://gerrit.wikimedia.org/r/596626

What is happening is quite clear: due to some sort of race when reloading and/or discarding VCL, vcl->busy isn't decremented as it should be. Perhaps this is happening due to calls to VCL_Ref not followed by a matching VCL_Rel. In any case, when VCL_Poll goes through all lodaded VCL files, it finds busy > 0 and never garbage collects those. What I am currently trying to do is setting busy=0 with systemtap in guru mode for those VCLs, which does result in them being removed, and seeing if/what/when/how crashes:

# Discard one stuck VCL such as:
# discarded   auto/busy       1286 vcl-2fabe8de-afbe-4a56-97dc-d4a4ed803d0e
#
# Skip those in "warm" state to avoid varnishd child crashes. For example:
# discarded   warm/busy       1208 vcl-955b696e-16fa-4c59-bba4-f3c2d752dcbf
#
probe process("/usr/sbin/varnishd").statement("VCL_Poll@cache/cache_vcl.c:794") {
    if ($vcl->discard == 1 && $vcl->busy > 0 && user_string_warn($vcl->state) != "warm") {
        printf("Setting busy=0 on %s %s\n", user_string_warn($vcl->loaded_name), user_string_warn($vcl->state))
        $vcl->busy = 0
        exit()
    }
}

My current theory is that setting busy to 0 for discarded files in auto/busy should be safe (how safe it is to alter memory of another process in production is up for discussion), while doing so for those in warm/busy probably isn't.

If things don't fail spectacularly I'll update https://gerrit.wikimedia.org/r/#/c/operations/debs/varnish4/+/596237/3/debian/patches/0037-force-discard.patch to skip warm stuff so that at least we have a CLI tool to get rid of (some) stuck VCLs.

I have tried getting rid of vcl-e7ac6c17-ad61-4947-afdb-835f4eee6caa on cp3050 at ~09:12 by setting busy to 0. After about one hour, varnishd crashed in VCL_Rel as follows:

May 18 10:12:09 cp3050 varnishd[22025]: Child (22080) not responding to CLI, killed it.
May 18 10:12:09 cp3050 varnishd[22025]: Unexpected reply from ping: 400 CLI communication error (hdr)
May 18 10:12:09 cp3050 varnishd[22025]: Child (22080) not responding to CLI, killed it.
May 18 10:12:09 cp3050 varnishd[22025]: Unexpected reply from ping: 400 CLI communication error
May 18 10:12:11 cp3050 varnishd[22025]: Child (22080) died signal=11
May 18 10:12:11 cp3050 varnishd[22025]: Child (22080) Panic at: Mon, 18 May 2020 10:12:11 GMT
                                        Assert error in VCL_Rel(), cache/cache_vcl.c line 313:
                                          Condition((vcl)->magic == 0x214188f2) not true.
                                        version = varnish-5.1.3 revision NOGIT, vrt api = 6.0
                                        ident = Linux,4.19.0-8-amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
                                        now = 7644167.589495 (mono), 1589796696.800837 (real)
                                        Backtrace:
                                          0x55fade2727a1: /usr/sbin/varnishd(+0x4b7a1) [0x55fade2727a1]
                                          0x55fade281fdc: /usr/sbin/varnishd(+0x5afdc) [0x55fade281fdc]
                                          0x55fade274db5: /usr/sbin/varnishd(Req_Cleanup+0xa5) [0x55fade274db5]
                                          0x55fade2977ae: /usr/sbin/varnishd(+0x707ae) [0x55fade2977ae]
                                          0x55fade297919: /usr/sbin/varnishd(+0x70919) [0x55fade297919]
                                          0x55fade298948: /usr/sbin/varnishd(+0x71948) [0x55fade298948]
                                          0x55fade28e9ea: /usr/sbin/varnishd(+0x679ea) [0x55fade28e9ea]
                                          0x55fade28f0db: /usr/sbin/varnishd(+0x680db) [0x55fade28f0db]
                                          0x7f1890153fa3: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f1890153fa3]
                                          0x7f18900844cf: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f18900844cf]
                                        thread = (cache-worker)
                                        thr.req = 0x7efe4f727aa0 {
                                          vxid = 542771724, transport = HTTP/1 {
                                            state = HTTP1::Busy
                                          }
                                          step = R_STP_LOOKUP,
                                          req_body = R_BODY_NONE,
                                          restarts = 0, esi_level = 0,
                                          sp = 0x7f17a481c020 {
                                            fd = -1, vxid = 542771723,
                                            t_open = 1589562603.139946,
                                            t_idle = 1589562603.139946,
                                            transport = HTTP/1 {
                                              state = HTTP1::Busy
                                            }
                                            client = 10.20.0.50 19185,
                                            privs = 0x7f17a481c088 {
                                              priv {p 0x7efe4f72b050 l 24 f (nil)} vcl 0x7f188f8d5f00 id 7efe4f727aa0 vmod 7f188f9f6700
                                            },
                                          },
[...] logs continue with request details including PII

The request was stuck there for 3 days (see t_open = 1589562603.139946 above):

$ date -d@1589562603
Fri 15 May 2020 05:10:03 PM UTC

Change 597091 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm15: add 0038-vcl_active-lock.patch

https://gerrit.wikimedia.org/r/597091

Change 596626 merged by Ema:
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm15: don't set temperature to cold

https://gerrit.wikimedia.org/r/596626

Change 597091 merged by Ema:
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm15: add 0038-vcl_active-lock.patch

https://gerrit.wikimedia.org/r/597091

Change 608606 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm16: add 0039-probe-cold-state-race.patch

https://gerrit.wikimedia.org/r/c/operations/debs/varnish4/ /608606

Change 608846 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: add SystemTap script to debug VCL discard issues

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608846

Change 608846 merged by Ema:
[operations/puppet@production] varnish: add SystemTap script to debug VCL discard issues

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608846

Change 608606 abandoned by Ema:
[operations/debs/varnish4@debian-wmf] 5.1.3-1wm16: add discard patches

Reason:
The added patches unfortunately do not help w/ T236754

https://gerrit.wikimedia.org/r/608606

Mentioned in SAL (#wikimedia-operations) [2020-07-27T12:23:16Z] <ema> A:cp rolling varnish-frontend restart to actually discard old VCL still pointing at varnishcheck/check T255015 T236754

The upgrade to Varnish 6 (T263557) seems to have fixed this, or at least I could not reproduce the problem by issuing various VCL reloads.

As expected, properly discarding old VCL files has a very positive impact on the number of requests received by ATS backends.

Screenshot from 2020-09-25 10-05-25.png (1×2 px, 362 KB)

ema claimed this task.

Confirmed on a busy text@esams node (cp3050): issuing various VCL reloads, all old VCLs get discarded properly. Closing!