Page MenuHomePhabricator

cp1083: ats-tls and varnish-fe crashed due to insufficient memory
Closed, InvalidPublic


On cp1083, ats-tls crashed at at 2019-12-29 20:41 with the following message:

Dec 29 20:41:36 cp1083 traffic_server[200306]: Fatal: couldn't allocate 524288 bytes at alignment 4096 - insufficient memory
Dec 29 20:41:50 cp1083 traffic_manager[200292]: [Dec 29 20:41:50.248] {0x2abc5a3fda80} ERROR: wrote crash log to /srv/trafficserver/tls/var/log/crash-2019-12-29-204135.log
Dec 29 20:41:50 cp1083 traffic_manager[200292]: traffic_server: received signal 6 (Aborted)

Some hours later, varnish-fe also crashed for the same reason:

Dec 30 07:52:58 cp1083 varnishd[2809]: Child (2842) died signal=6
Dec 30 07:52:58 cp1083 varnishd[2809]: Child (2842) Panic at: Mon, 30 Dec 2019 07:52:58 GMT
                                       Assert error in mpl_alloc(), cache/cache_mempool.c line 80:
                                         Condition((mi) != 0) not true.
                                       version = varnish-5.1.3 revision NOGIT, vrt api = 6.0
                                       ident = Linux,4.9.0-11-amd64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll
                                       now = 2323463.229078 (mono), 1577692337.881260 (real)
                                         0x559fbf3ce5c4: /usr/sbin/varnishd(+0x4b5c4) [0x559fbf3ce5c4]
                                         0x559fbf3ca6aa: /usr/sbin/varnishd(+0x476aa) [0x559fbf3ca6aa]
                                         0x559fbf3cb241: /usr/sbin/varnishd(MPL_Get+0x181) [0x559fbf3cb241]
                                         0x559fbf3d0525: /usr/sbin/varnishd(Req_New+0x75) [0x559fbf3d0525]
                                         0x559fbf3a7cdd: /usr/sbin/varnishd(+0x24cdd) [0x559fbf3a7cdd]
                                         0x559fbf3ea692: /usr/sbin/varnishd(+0x67692) [0x559fbf3ea692]
                                         0x559fbf3eadab: /usr/sbin/varnishd(+0x67dab) [0x559fbf3eadab]
                                         0x7fa924b964a4: /lib/x86_64-linux-gnu/ [0x7fa924b964a4]
                                         0x7fa9248d8d0f: /lib/x86_64-linux-gnu/ [0x7fa9248d8d0f]
                                       errno = 12 (Cannot allocate memory)
                                       thread = (cache-worker)
                                       thr.req = (nil) {
                                       thr.busyobj = (nil) {

This is surprising given that the memory utilization graph on grafana seems to indicate that there were some 80G available when ats-tls crashed, and a bit more than that when varnish-fe did.

Screenshot from 2019-12-30 16-57-01.png (1×2 px, 191 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Dec 30 2019, 3:58 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-30T16:00:28Z] <ema> cp1083: restart ats-tls and varnish-fe after crashes - T241593

@ema maybe could be related to NUMA utilization? Having a quick look at numastat (both -n and -m) there is a general imbalance between the two nodes (that I think is mostly on purpose due to our custom config), and the varnish process seems the one mostly responsible for it. But there was no spike in the graph either.

I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was reporting multiple GBs of memory free at the time of the crashes:

2020-01-02-115118_1141x863_scrot.png (863×1 px, 93 KB)

Ditto for ats:

2020-01-02-115524_1143x890_scrot.png (890×1 px, 91 KB)

Change 562849 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be"

Change 562849 merged by Ema:
[operations/puppet@production] Revert "ATS: assign 8G instead of 2G to RAM caches on ats-be"

Mentioned in SAL (#wikimedia-operations) [2020-03-16T08:02:48Z] <ema> cp4025 restart trafficserver-tls to clear 'tls process restarted' alert T241593 T185968

Untagging observability for now since there doesn't seem to be any action

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall subscribed.

This ticket is in a weird place: ats-tls is no longer in use, but the fact that *both* ats-tls and varnish-fe had these same failures points to a more generalized issue. However, this hasn't had any progress for 3 years... and I don't see any incident report for 2019-12-30.

I'm going to close this with the expectation that, if this is still an issue, it will be re-opened.