Page MenuHomePhabricator

varnish-fe crashes due to "Error in munmap(): Cannot allocate memory"
Closed, ResolvedPublic

Description

We have seen several varnish-fe crashes with the following errors:

Jan 08 15:05:40 cp3050 varnishd[3571]: Child (3611) said <jemalloc>: Error in munmap(): Cannot allocate memory
Jan 08 15:06:22 cp3050 varnishd[3571]: Failed to kill child with PID 3611: Operation not permitted
Jan 08 15:06:22 cp3050 varnishd[3571]: Unexpected reply from ping: 400 CLI communication error (hdr)
Jan 08 15:06:22 cp3050 varnishd[3571]: Failed to kill child with PID 3611: Operation not permitted
Jan 08 15:06:22 cp3050 varnishd[3571]: Unexpected reply from ping: 400 CLI communication error
Jan 08 15:06:24 cp3050 varnishd[3571]: Child (3611) died signal=6
Jan 08 15:06:24 cp3050 varnishd[3571]: Child (3611) Panic at: Wed, 08 Jan 2020 15:06:24 GMT
                                       Assert error in VGZ_NewGzip(), cache/cache_gzip.c line 142:
                                         Condition(Z_OK == i) not true.
[...]

Note that this is not the same error message as T241593. Also T185968 is different, in that the OOM-killer has nothing to do with the crashes observed during the past days.

Perhaps interestingly, the following text was removed in 2012 from the Varnish troubleshooting guide:

Specifically if you see the "Error in munmap" error on Linux you might
want to increase the amount of maps available. Linux is limited to a
maximum of 64k maps. Setting vm.max_max_count i sysctl.conf will
enable you to increase this limit. You can inspect the number of maps
your program is consuming by counting the lines in /proc/$PID/maps.

The actual sysctl setting name is vm.max_map_count.

Event Timeline

ema created this task.Jan 10 2020, 11:31 AM
Restricted Application added a project: Operations. · View Herald TranscriptJan 10 2020, 11:31 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Jan 10 2020, 11:31 AM
ema updated the task description. (Show Details)Jan 10 2020, 11:33 AM
ema moved this task from Triage to Caching on the Traffic board.Jan 10 2020, 2:41 PM

Change 564616 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: raise vm.max_map_count sysctl

https://gerrit.wikimedia.org/r/564616

Change 564616 merged by Ema:
[operations/puppet@production] cache: raise vm.max_map_count sysctl

https://gerrit.wikimedia.org/r/564616

Change 564696 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] prometheus: collect varnishd_mmap_count for varnish-frontend

https://gerrit.wikimedia.org/r/564696

Change 566463 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: icinga check for high varnishd_mmap_count

https://gerrit.wikimedia.org/r/566463

Change 566463 merged by Ema:
[operations/puppet@production] cache: icinga check for high varnishd_mmap_count

https://gerrit.wikimedia.org/r/566463

ema closed this task as Resolved.Jan 22 2020, 12:00 PM
ema claimed this task.

Raised vm.max_map_count and added an icinga check alerting if the number of memory map areas used by varnish is getting close to the value. Closing.