cp1083: ats-tls and varnish-fe crashed due to insufficient memory
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	• ema
	Dec 30 2019, 3:58 PM

Description

On cp1083, ats-tls crashed at at 2019-12-29 20:41 with the following message:

Dec 29 20:41:36 cp1083 traffic_server[200306]: Fatal: couldn't allocate 524288 bytes at alignment 4096 - insufficient memory
Dec 29 20:41:50 cp1083 traffic_manager[200292]: [Dec 29 20:41:50.248] {0x2abc5a3fda80} ERROR: wrote crash log to /srv/trafficserver/tls/var/log/crash-2019-12-29-204135.log
Dec 29 20:41:50 cp1083 traffic_manager[200292]: traffic_server: received signal 6 (Aborted)

Some hours later, varnish-fe also crashed for the same reason:

Dec 30 07:52:58 cp1083 varnishd[2809]: Child (2842) died signal=6
Dec 30 07:52:58 cp1083 varnishd[2809]: Child (2842) Panic at: Mon, 30 Dec 2019 07:52:58 GMT
                                       Assert error in mpl_alloc(), cache/cache_mempool.c line 80:
                                         Condition((mi) != 0) not true.
                                       version = varnish-5.1.3 revision NOGIT, vrt api = 6.0
                                       ident = Linux,4.9.0-11-amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
                                       now = 2323463.229078 (mono), 1577692337.881260 (real)
                                       Backtrace:
                                         0x559fbf3ce5c4: /usr/sbin/varnishd(+0x4b5c4) [0x559fbf3ce5c4]
                                         0x559fbf3ca6aa: /usr/sbin/varnishd(+0x476aa) [0x559fbf3ca6aa]
                                         0x559fbf3cb241: /usr/sbin/varnishd(MPL_Get+0x181) [0x559fbf3cb241]
                                         0x559fbf3d0525: /usr/sbin/varnishd(Req_New+0x75) [0x559fbf3d0525]
                                         0x559fbf3a7cdd: /usr/sbin/varnishd(+0x24cdd) [0x559fbf3a7cdd]
                                         0x559fbf3ea692: /usr/sbin/varnishd(+0x67692) [0x559fbf3ea692]
                                         0x559fbf3eadab: /usr/sbin/varnishd(+0x67dab) [0x559fbf3eadab]
                                         0x7fa924b964a4: /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4) [0x7fa924b964a4]
                                         0x7fa9248d8d0f: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fa9248d8d0f]
                                       errno = 12 (Cannot allocate memory)
                                       thread = (cache-worker)
                                       thr.req = (nil) {
                                       },
                                       thr.busyobj = (nil) {
                                       },

This is surprising given that the memory utilization graph on grafana seems to indicate that there were some 80G available when ats-tls crashed, and a bit more than that when varnish-fe did.

Screenshot from 2019-12-30 16-57-01.png (1×2 px, 191 KB)

Details

	Subject	Repo	Branch	Lines +/-
	Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be"	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T185968: varnish 5.1.3 frontend child restarted
T242417: varnish-fe crashes due to "Error in munmap(): Cannot allocate memory"
Mentioned Here: T185968: varnish 5.1.3 frontend child restarted
T241573: Investigate all-site outage on 2019-12-30 (HTTP 504 error)

Event Timeline

• ema created this task.Dec 30 2019, 3:58 PM

Restricted Application added a project: SRE. · View Herald TranscriptDec 30 2019, 3:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema triaged this task as High priority.Dec 30 2019, 3:58 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-30T16:00:28Z] <ema> cp1083: restart ats-tls and varnish-fe after crashes - T241593

• ema moved this task from Backlog to Caching on the Traffic board.Dec 31 2019, 8:28 AM

@ema maybe could be related to NUMA utilization? Having a quick look at numastat (both -n and -m) there is a general imbalance between the two nodes (that I think is mostly on purpose due to our custom config), and the varnish process seems the one mostly responsible for it. But there was no spike in the graph either.

I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was reporting multiple GBs of memory free at the time of the crashes:

2020-01-02-115118_1141x863_scrot.png (863×1 px, 93 KB)

Ditto for ats:

2020-01-02-115524_1143x890_scrot.png (890×1 px, 91 KB)

Change 562849 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be"

https://gerrit.wikimedia.org/r/562849

gerritbot added a project: Patch-For-Review.Jan 8 2020, 3:26 PM

Change 562849 merged by Ema:
[operations/puppet@production] Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be"

https://gerrit.wikimedia.org/r/562849

Maintenance_bot removed a project: Patch-For-Review.Jan 8 2020, 4:10 PM

• ema mentioned this in T242417: varnish-fe crashes due to "Error in munmap(): Cannot allocate memory".Jan 10 2020, 11:31 AM

Looks like follow-up from T241573.

Mentioned in SAL (#wikimedia-operations) [2020-03-16T08:02:48Z] <ema> cp4025 restart trafficserver-tls to clear 'tls process restarted' alert T241593 T185968

Stashbot mentioned this in T185968: varnish 5.1.3 frontend child restarted.Mar 16 2020, 8:02 AM

Untagging observability for now since there doesn't seem to be any action

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

BBlack moved this task from Caching to Bug Reports on the Traffic board.Oct 2 2020, 1:59 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

This ticket is in a weird place: ats-tls is no longer in use, but the fact that *both* ats-tls and varnish-fe had these same failures points to a more generalized issue. However, this hasn't had any progress for 3 years... and I don't see any incident report for 2019-12-30.

I'm going to close this with the expectation that, if this is still an issue, it will be re-opened.

	F31498056: 2020-01-02-115524_1143x890_scrot.png
	Jan 2 2020, 10:58 AM

	F31498054: 2020-01-02-115118_1141x863_scrot.png
	Jan 2 2020, 10:58 AM

	F31494820: Screenshot from 2019-12-30 16-57-01.png
	Dec 30 2019, 3:58 PM

cp1083: ats-tls and varnish-fe crashed due to insufficient memoryClosed, InvalidPublicActions

Description

Details

Related Objects

Event Timeline

cp1083: ats-tls and varnish-fe crashed due to insufficient memory
Closed, InvalidPublic
Actions