Page MenuHomePhabricator

varnish 5.1.3 frontend child restarted
Closed, ResolvedPublic

Description

Our "Varnish frontend child restarted" Icinga check detected two unexpected restarts this morning: one on cp3036 and the other on cp3049. The check looks for the number of times a varnish child process has been spawned since manager process startup, expecting such number to be exactly 1, given that we never administratively restart the child without restarting the whole service. Looking at the logs, the events of today might actually be due to two different issues, although the failures happened only a few hours apart after ~11 days of uneventful uptime since the esams upload upgrade to Varnish 5.

The varnish-fe child on cp3049 just died without any prior logs:

Jan 29 09:57:41 cp3049 varnishd[41745]: Child (41757) died signal=9
Jan 29 09:57:41 cp3049 varnishd[41745]: Child cleanup complete
Jan 29 09:57:41 cp3049 varnishd[41745]: Child (26540) Started
Jan 29 09:57:41 cp3049 varnishd[41745]: Child (26540) said Child starts

On cp3036, instead, the varnish-fe manager process somehow failed to communicate with its child and attempted to kill it.

Jan 29 11:32:14 cp3036 varnishd[34561]: Failed to kill child with PID 34570: Operation not permitted
Jan 29 11:32:14 cp3036 varnishd[34561]: Unexpected reply from ping: 400 CLI communication error (hdr)
Jan 29 11:32:14 cp3036 varnishd[34561]: Failed to kill child with PID 34570: Operation not permitted
Jan 29 11:32:14 cp3036 varnishd[34561]: Unexpected reply from ping: 400 CLI communication error
Jan 29 11:32:14 cp3036 varnishd[34561]: Child (34570) died signal=9
Jan 29 11:32:14 cp3036 varnishd[34561]: Child cleanup complete

Note how the management process failed to kill its child in cp3036's case, possibly because CAP_KILL is not in the Capability Bounding Set defined in varnish-frontend systemd unit.

There is a known bug introduced in Varnish 5 new VSM/VSC code that looks just like the problem described here, at the very least in cp3036's case. The bug report mentions that both 5.1.3 and 5.2.1 are affected, although the issue shows up significantly less often with 5.1.3 (which is the version we're running). Varnish bug https://github.com/varnishcache/varnish-cache/issues/2518 also tracks the same problem and has been closed with a patch for 5.2.1 which does not apply to 5.1.3. Such patch alleviates the issue but does not fully fix it.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.Jan 29 2018, 11:56 PM
ema updated the task description. (Show Details)

In both cases the child was killed with signal 9 by the kernel oom-killer. It may be the case that our memory cache sizing is very tight in general, and that overheads have increased with v5, requiring us to reduce the number a bit?

ema claimed this task.

I haven't seen this happening anymore in the past 3 months. Closing for now but feel free to reopen should the issue come back.

ema reopened this task as Open.EditedMar 16 2020, 7:51 AM

The issue occurred again on cp4025. Reopening.

[Sat Mar 14 15:51:23 2020] Out of memory: Kill process 20592 (cache-main) score 965 or sacrifice child
[Sat Mar 14 15:51:23 2020] Killed process 20592 (cache-main) total-vm:1082065904kB, anon-rss:380490324kB, file-rss:0kB, shmem-rss:82536kB
[Sat Mar 14 15:51:59 2020] oom_reaper: reaped process 20592 (cache-main), now anon-rss:0kB, file-rss:0kB, shmem-rss:82004kB
Mar 14 15:51:49 cp4025 varnishd[20511]: Child (20592) not responding to CLI, killed it.
Mar 14 15:51:49 cp4025 varnishd[20511]: Unexpected reply from ping: 400 CLI communication error (hdr)
Mar 14 15:51:49 cp4025 varnishd[20511]: Child (20592) not responding to CLI, killed it.
Mar 14 15:51:49 cp4025 varnishd[20511]: Unexpected reply from ping: 400 CLI communication error
Mar 14 15:51:49 cp4025 varnishd[20511]: Child (20592) died signal=9
Mar 14 15:51:49 cp4025 varnishd[20511]: Child cleanup complete
Mar 14 15:51:49 cp4025 varnishd[20511]: Child (12216) Started
Mar 14 15:51:49 cp4025 varnishd[20511]: Child (12216) said Child starts

Mentioned in SAL (#wikimedia-operations) [2020-03-16T07:52:42Z] <ema> cp4025: restart varnish-fe to clear 'child restarted' alert T185968

The OOM killer intervened due to "Normal" (non-DMA) free memory on NUMA node 0 going below min (1380412 < 1387544):

[Sat Mar 14 15:51:23 2020] Node 0 Normal free:1380412kB min:1387544kB low:1734428kB high:2081312kB active_anon:258543724kB inactive_anon:195548kB active_file:20kB inactive_file:16kB unevictable:0kB writepending:0kB present:266338304kB managed:262119284kB mlocked:0kB kernel_stack:17644kB pagetables:1122824kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

I've now added a graph tracking Free memory by NUMA node and indeed on cp4025 node 0 didn't have much memory available at the time of the crash, while node 1 had more than 20G free:

Screenshot from 2020-03-16 10-41-21.png (1×2 px, 178 KB)

Considering that the amount of total memory on node 0 is about 260G, perhaps the 209G currently assigned for caching to varnish-fe on cp4025 are a little too much.

root@cp4025:~# grep MemTotal /sys/devices/system/node/node[0-9]/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemTotal:       264054892 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemTotal:       132115076 kB

Also worth mentioning that in the specific case of cp4025, the trouble was caused by a sudden 30G transient storage allocation.

Screenshot from 2020-03-16 11-07-16.png (1×2 px, 221 KB)

Change 579906 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: decrease varnish-frontend malloc cache size

https://gerrit.wikimedia.org/r/579906

Change 579907 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: limit upload transient storage usage

https://gerrit.wikimedia.org/r/579907

Change 579906 merged by Ema:
[operations/puppet@production] cache: decrease varnish-frontend malloc cache size

https://gerrit.wikimedia.org/r/579906

Change 579907 merged by Ema:
[operations/puppet@production] cache: limit upload transient storage usage

https://gerrit.wikimedia.org/r/579907

Mentioned in SAL (#wikimedia-operations) [2020-03-17T09:21:46Z] <ema> cp: rolling varnish-frontend-restart to decrease memory usage and apply transient storage limits T185968

Change 587551 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: double upload transient storage limit

https://gerrit.wikimedia.org/r/587551

Change 587551 merged by Ema:
[operations/puppet@production] cache: double upload transient storage limit

https://gerrit.wikimedia.org/r/587551

Mentioned in SAL (#wikimedia-operations) [2020-04-08T16:16:11Z] <ema> cache_upload: rolling varnish-fe restarts to bump transient storage limit T185968

@ema: All related patches in Gerrit have been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

We haven't seen this happening anymore after setting transient storage limits. Closing.