Page MenuHomePhabricator

ulsfo varnish-fe vcache processes overflow on FDs
Open, MediumPublic

Description

cp4029 was reported throwing errors by @bd808 on IRC, upon a quick look it was clear that the host was not performing well so @CDanis depooled it.

From grafana and syslog it seems that the host started to log errors at around 20:39 UTC.

Investigation afterwards shows that the uid=vcache varnishd process is consuming approximately 499.8k file descriptors (all but a hundred or so are sockets). However, said sockets do not show up in any of the usual socket-reporting tools like ss or netstat.

Currently the host is depooled pending further investigation.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans triaged this task as Medium priority.Jan 24 2020, 10:03 PM
Volans updated the task description. (Show Details)
Volans added a subscriber: bd808.

I've noticed that on Icinga the 3 checks that matches *_fifo_* were in unknown state with a timeout message.
@CDanis found that the lsof commands in those checks were taking a lot of resources and taking a very long time. Moreover the varnishd process (the child one, the one for the vcache user) was having a very high number of file descriptors, a number very close to 500k.

From a quick analysis of the number of FDs in the cache text cluster around the various PoPs I've found very different numbers, I guess they might also depend from the last restart time, but right now they don't show a clear pattern to me, here the data:

eqiad: all hosts under 2k
codfw: all hosts under 200
esams: all hosts between 2k and 6k
eqsin: all hosts under 1.6k
ulsfo: 2 hosts under 1k, 2 hosts ~15k, 1 host 47k, 1 host 500k

Raw data for ulsfo:

$ sudo cumin 'A:cp-text and A:ulsfo' "ls -1 /proc/\$(ps faxu | awk '/v[c]ache/ { print \$2 }')/fd | wc -l"
6 hosts will be targeted:
cp[4027-4032].ulsfo.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) cp4027.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
761
===== NODE GROUP =====
(1) cp4031.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
47000
===== NODE GROUP =====
(1) cp4030.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
14616
===== NODE GROUP =====
(1) cp4032.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
318
===== NODE GROUP =====
(1) cp4029.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
499817
===== NODE GROUP =====
(1) cp4028.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
15164
================

I can't even tell what the sockets in question are -- they don't show up in ss or netstat; there's an order of magnitude fewer things output there.

1✔️ cdanis@cp4029.ulsfo.wmnet ~ 🕔🍺 sudo ls -l /proc/42249/fd | wc -l
2499813
3✔️ cdanis@cp4029.ulsfo.wmnet ~ 🕔🍺 sudo ls -l /proc/42249/fd | grep -v socket: |wc -l
4ls: cannot access '/proc/42249/fd/495526': No such file or directory
519
6✔️ cdanis@cp4029.ulsfo.wmnet ~ 🕔🍺 sudo ss -a | wc -l
739859

The "cannot access" message from the second ls invocation does suggest that said sockets are still being created and destroyed. (OK, I'm sure someone will point out it only confirms that sockets are being destroyed -- but since there have been approximately 499.8k sockets this whole time, new ones must be created too.)

Krinkle renamed this task from cp4029 varnishfe freakout to cp4029 varnish-fe freakout.Jan 24 2020, 10:18 PM

The "churn rate" on sockets isn't nearly high enough to explain how many are missing from the output of other tools:

for i in `seq 1 10`; do
  sudo ss -a > ss.$(date -Ins) ; sudo ss -a > ss.$(date -Ins)
  FIRST=$(ls -1 ss.* | tail -n2 | head -n1)
  SECOND=$(ls -1 ss.* | tail -n1)
  echo $(comm -3 <(sort $FIRST) <(sort $SECOND) | wc -l) new sockets in $(date -d "$(echo $SECOND | cut -c4-) $(date -d $(echo $FIRST | cut -c4-) +%s.%N) seconds ago" +%s.%N) seconds
done

75 new sockets in 0.118499193 seconds
73 new sockets in 0.118641325 seconds
66 new sockets in 0.116087835 seconds
83 new sockets in 0.116218283 seconds
91 new sockets in 0.122720340 seconds
71 new sockets in 0.122068922 seconds
226 new sockets in 0.123974580 seconds
75 new sockets in 0.120227779 seconds
64 new sockets in 0.118951529 seconds
82 new sockets in 0.121228181 seconds

It's worth noting that this saga began with me invoking a lsof -n that consumed over 24 minutes of CPU time and 19.5GB of RSS before I gave up and killed it.

Mentioned in SAL (#wikimedia-operations) [2020-01-27T13:36:40Z] <vgutierrez> restarting varnish-fe on cp4029 - T243634

Mentioned in SAL (#wikimedia-operations) [2020-01-27T13:54:39Z] <vgutierrez> restarting varnish-fe on cp4030 - T243634

Mentioned in SAL (#wikimedia-operations) [2020-01-30T05:12:00Z] <vgutierrez> restarting varnish-frontend and repooling cp4029 - T243634

Mentioned in SAL (#wikimedia-operations) [2020-01-30T06:23:15Z] <vgutierrez> restarting varnish-frontend on cp4030 before it crashes - T243634

after depooling cp4029, the issue moved to cp4030, and upon the restart of varnish-fe on cp4030, now the number of fds is increasing on cp4031 (22k right now)

Mentioned in SAL (#wikimedia-operations) [2020-01-30T17:02:56Z] <vgutierrez> restarting varnish-frontend on cp4031 before it crashes - T243634

Mentioned in SAL (#wikimedia-operations) [2020-01-30T18:07:54Z] <vgutierrez> depool cp4032 and perform a rolling restart of varnish-fe at cp4027-cp4031 - T243634

ulsfo is the only DC where we are seeing this issue, and at the same time it's the DC where we are testing the cache nodes buster upgrades (T242093).

To discard that cp4032 (buster text node) could be related to this issue, we've depooled cp4032 and after being depooled we're performing a rollback restart of varnish-frontend in the other hosts with the following cumin cmd:

sudo -i cumin -b1 -s300 'cp[4027-4031].ulsfo.wmnet' 'varnish-frontend-restart'

after finishing the rolling restart, this is current amount of fds on varnish-frontend per node:

===== NODE GROUP =====
(1) cp4028.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
341
===== NODE GROUP =====
(1) cp4030.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
1308
===== NODE GROUP =====
(1) cp4031.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
313
===== NODE GROUP =====
(1) cp4032.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
19446
===== NODE GROUP =====
(1) cp4029.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
1400
===== NODE GROUP =====
(1) cp4027.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
1505
================

in ~30 minutes cp4029 has gone from 1400 to ~8600 so it doesn't look like cp4032 is at fault here:

(1) cp4029.ulsfo.wmnet
----- OUTPUT of 'ls -1 /proc/$(ps...2 }')/fd | wc -l' -----
8570
================

We also tested depooling just port 80 yesterday, which didn't affect anything (fd leak was still growing), which means this isn't driven by external->:80 traffic. cp4029 was at ~400K fds this morning, so I depooled its frontend services and restarted ats-tls + varnish-fe to reset the clock. I also repooled cp4032 while I was in there.

I did a little bit of digging, but I still don't have a firm grasp on the root cause. I do think the leaked fds are related to ats-tls->varnish-fe connections though, as there's a very high rate of those being opened and closed in general.

CDanis renamed this task from cp4029 varnish-fe freakout to ulsfo varinsh-fe vcache processes overflow on FDs.Jan 31 2020, 6:56 PM

Change 569271 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] ulsfo cp-text: Prometheus export # of vcache fds

https://gerrit.wikimedia.org/r/569271

Change 569271 merged by CDanis:
[operations/puppet@production] ulsfo cp-text: Prometheus export # of vcache fds

https://gerrit.wikimedia.org/r/569271

Change 569285 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] prom-file-count: include symlinks & other special files

https://gerrit.wikimedia.org/r/569285

Change 569285 merged by CDanis:
[operations/puppet@production] prom-file-count: include symlinks & other special files

https://gerrit.wikimedia.org/r/569285

Here's some lsof output from a faulty-looking vcache process, showing garbage-y sockets that aren't actually associated with any TCP stream:

cache-mai 16000 vcache *870u     sock        0,8       0t0 3187279203 protocol: TCP
cache-mai 16000 vcache *871u     sock        0,8       0t0 3187361951 protocol: TCP
cache-mai 16000 vcache *872u     sock        0,8       0t0 3187321806 protocol: TCP
cache-mai 16000 vcache *873u     sock        0,8       0t0 3187357891 protocol: TCP
cache-mai 16000 vcache *874u     sock        0,8       0t0 3187340906 protocol: TCP
cache-mai 16000 vcache *875u     sock        0,8       0t0 3187347127 protocol: TCP
cache-mai 16000 vcache *876u     sock        0,8       0t0 3187333841 protocol: TCP
cache-mai 16000 vcache *877u     sock        0,8       0t0 3187150760 protocol: TCP
cache-mai 16000 vcache *878u     sock        0,8       0t0 3187361952 protocol: TCP

There are a lot of these:

✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕔🍺 grep 'protocol: TCP$' lsof.2020-01-31T22:12:01,057104937+00:00.txt | wc -l
89106
✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕠🍺 grep -v 'protocol: TCP$' lsof.2020-01-31T22:12:01,057104937+00:00.txt | wc -l
899

AIUI one of the ways such sockets get created is if you create a socket, but never connect() or bind(). So I guess some varnish codepath is creating them and then leaking them?

Mentioned in SAL (#wikimedia-operations) [2020-02-01T17:46:18Z] <cdanis> T243634 ✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕐☕ sudo varnish-frontend-restart

Change 569513 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: collect varnish fd count everywhere

https://gerrit.wikimedia.org/r/569513

Mentioned in SAL (#wikimedia-operations) [2020-02-03T14:18:19Z] <cdanis> T243634 ✔️ cdanis@cp4031.ulsfo.wmnet ~ 🕤☕ sudo varnish-frontend-restart

Change 569513 merged by Ema:
[operations/puppet@production] cache: collect varnish fd count everywhere

https://gerrit.wikimedia.org/r/569513

Mentioned in SAL (#wikimedia-operations) [2020-02-04T00:02:46Z] <volans> depool, varnish-frontend-restart, pool on cp4029 (~242k fds) - T243634

Vgutierrez renamed this task from ulsfo varinsh-fe vcache processes overflow on FDs to ulsfo varnish-fe vcache processes overflow on FDs.Feb 4 2020, 8:53 AM

Mentioned in SAL (#wikimedia-operations) [2020-02-04T10:35:15Z] <ema> cp4032: varnish-frontend-restart T243634

We now know that this is 100% traffic induced, and the culprit seems to be FortiGate. Compare the last 24 hours of FD growth:

Screenshot from 2020-02-04 15-48-22.png (1×2 px, 576 KB)

And ulsfo requests with UA: FortiGate during the same timeframe:

Screenshot from 2020-02-04 15-48-29.png (1×2 px, 290 KB)

The sequence of hosts affected matches exactly: cp4029 till 00:02 (when @Volans restarted it), cp4032 till 10:35 (when I restarted it), and currently cp4030.

We are now trying to understand what distinguishes fortigate requests hitting ulsfo and similar ones routed to other DCs.

Mentioned in SAL (#wikimedia-operations) [2020-02-05T02:38:42Z] <cdanis> T243634 ✔️ cdanis@cp4030.ulsfo.wmnet ~ 🕤🍺 sudo varnish-frontend-restart

Mentioned in SAL (#wikimedia-operations) [2020-02-05T09:05:01Z] <ema> add individual FortiGate IPs hitting ulsfo (currently cp4028) to vcl blocked_nets -- trying to identify problematic traffic T243634

Change 570247 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: apply blocked_nets acl before request normalization

https://gerrit.wikimedia.org/r/570247

Change 570247 merged by Ema:
[operations/puppet@production] vcl: block requests before Host normalization and switching vcl

https://gerrit.wikimedia.org/r/570247

Change 570257 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: temporarily skip Host header normalization for FortiGate

https://gerrit.wikimedia.org/r/570257

Change 570257 merged by Ema:
[operations/puppet@production] vcl: temporarily skip Host header normalization for FortiGate

https://gerrit.wikimedia.org/r/570257

Change 570278 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: temporarily skip Host header normalization for FortiGate"

https://gerrit.wikimedia.org/r/570278

Change 570278 merged by Ema:
[operations/puppet@production] Revert "vcl: temporarily skip Host header normalization for FortiGate"

https://gerrit.wikimedia.org/r/570278

I blocked a number of IPs manually on cr3 and cr4 for ulsfo. Command was

set policy-options prefix-list blackhole4 <prefix>

for 5 IPs. The prefix list was empty before that. Commit is commented with this task

@ayounsi ^. Let me know if it's preferable I do this some other way next time (like homer?)

Mentioned in SAL (#wikimedia-operations) [2020-02-05T10:43:23Z] <ema> cp4028: varnish-frontend-restart T243634

Today we've been tackling the "FortiGate" angle (correlation described in T243634#5848297). The host in trouble this morning was cp4028, with 140k FDs at 10:30. In total, 5 different "fortigate" IPs were hashed to the host.

  1. Suspecting the interaction between VCL-generated 400 synth responses and the fortigate traffic, we have tried to add all 5 IPs to acl blocked_nets to return 403 instead of 400. FD pileup on cp4028 continued.
  2. We tried skipping host header normalization for fortigate requests altogether, also to no avail.
  3. Blocking all 5 IPs on the routers. The change was applied by @akosiaris (thanks!) at 10:35, and still FD pileup continued.

At 10:43 I've restarted varnish-fe on cp4028. The issue has now moved to cp4031, proving I think conclusively that the fortigate correlation was just that, and we need another theory.

I just reverted the cr3, cr4 uslfo change.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!