Page MenuHomePhabricator

ats-be occasional system CPU usage increase
Closed, ResolvedPublic

Description

Starting today 2020-10-15, we have noticed an increase in system CPU usage by the systemd unit trafficserver.service. The first time this happened was between 09:19 and 11:30:

Screenshot from 2020-10-15 17-06-24.png (1×2 px, 197 KB)

The problem seems to be affecting cache_text nodes in multiple DCs. We have observed it in eqsin, ulsfo, and esams. While the issue is ongoing frontend traffic is not negatively affected.

While this is happening, purged does not manage to send PURGEs quickly enough and starts queuing locally to the point of triggering the following alert, which resolves itself after a couple of minutes:

14:59 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams 
                   https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
JMeybohm triaged this task as Medium priority.Oct 16 2020, 10:28 AM

This happened again last night at 2020-11-27T00:08, we had alerts on cp1089, cp1077, cp1087, cp1083 and cp1075 in eqiad, cp2029 (codfw), cp3062 and cp3064 (esams), and cp5009 (eqsin):

00:08 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp1089 is CRITICAL: cluster=cache_text instance=cp1089 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089
00:09 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077
00:09 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087
00:09 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009
00:10 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp2029 is CRITICAL: cluster=cache_text instance=cp2029 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2029
00:10 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp1083 is CRITICAL: cluster=cache_text instance=cp1083 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083
00:10 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp3062 is CRITICAL: cluster=cache_text instance=cp3062 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3062
00:10 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp1075 is CRITICAL: cluster=cache_text instance=cp1075 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075
00:10 <+icinga-wm> PROBLEM - Number of messages locally queued by purged for processing on cp3064 is CRITICAL: cluster=cache_text instance=cp3064 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064

The issue lasted a couple of minutes, with recoveries starting to come in at 00:11.

On all text nodes we had a CPU spike for trafficserver.service with suspiciously similar shapes for the duration of the issue, correlating directly with purged workers backlog.

Screenshot from 2020-11-27 11-00-42.png (1×2 px, 205 KB)

Screenshot from 2020-11-27 11-01-51.png (1×2 px, 160 KB)

For all seemingly not affected nodes, take for instance cp4027, the CPU spike is clearly visible too, it's just that purged's backlog did not stay up long enough for the alerts to fire.

All that CPU time is spent in the kernel, and specifically calling mmap a lot. I've seen ksys_mmap_pgoff featured prominently in perf report of affected nodes, and tracing for 10 seconds how many times such function is called by various PIDs we can see this:

ema@cp3050.esams.wmnet:~$ sudo bpftrace -e 'kprobe:ksys_mmap_pgoff { @[pid]=count(); } interval:s:10 { exit(); }'
[...]
@[31316]: 759438

PID 31316 is ats-be. So on cp3050 the backend instance is calling mmap ~75K times per second, which seems excessive. Restarting the process does not change this behavior. Stopping purged does, which indicates that the issue is related to the amount of incoming requests. Another hint in that direction is the fact that the system CPU usage on different affected nodes has very similar patterns.

Screenshot from 2021-01-04 18-06-32.png (1×2 px, 286 KB)

I could not yet find out *why* mmap is being called, but it's called with len 65536.

$ sudo bpftrace -e 'kprobe:do_mmap /pid == 31316/ { @[arg2]=count(); } interval:s:1 { exit(); }'
Attaching 2 probes...
@[65536]: 64988

More digging needed.

The plot thickens. I now have more questions than I do have answers, but here's the story so far.

ats-be on cp3052 is essentially never calling mmap(once in 10 seconds).

11:50:59 ema@cp3052.esams.wmnet:~
$ sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:mmap /pid == 17554/ { @c = count() } interval:s:10 { exit(); }'
Attaching 2 probes...
@c: 1

Compare with cp3058 where, as mentioned yesterday, mmap is called thousands of times per second:

11:50:53 ema@cp3058.esams.wmnet:~
$ sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:mmap /pid == 34216/ { @c = count() } interval:s:10 { exit(); }'
Attaching 2 probes...
@c: 189000

And indeed the difference in system cpu usage is very significant (~300 ms vs 2-5 seconds):

Screenshot from 2021-01-05 12-48-51.png (1×2 px, 201 KB)

This was not always the case: on cp3052 ats-be used to show the same system cpu usage pattern as all other nodes up to 2020-10-29T15:16. Strange as it sounds, the fact that cp3052 is where we're running Varnish 5 as an experiment (T264398) is only a coincidence, I believe, given that the experiment started on October 5th and at that time ats-be's system cpu usage was in line with all other nodes.

Screenshot from 2021-01-05 12-55-53.png (1×2 px, 207 KB)

SAL for that day has:

15:06 vgutierrez: rolling restart of ATS to upgrade to trafficserver 8.0.8-1wm3 - T265911

So perhaps the issue could be due to traffic pattern changes introduced by pool/depool churn, given that as discovered yesterday restarting ats-be does not change anything per-se.

Leaving the cp3052 mystery aside for the time being, I've noticed that there's a difference between text and upload too. On upload nodes, mmap is called, albeit less often than on text (~3.5K times per second vs ~15K at the time of this writing). Also, upload nodes spend significantly less time in the kernel. I wrote the following bpftrace script and tried it on cp3058 (text) and cp3051 (upload) to measure the time spent inside mmap (pid changed as needed):

tracepoint:syscalls:sys_enter_mmap
/pid == 40982/
{
        @totcount = count();
        @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_mmap
/@start[tid]/
{
        @nsecs = sum(nsecs - @start[tid]);
        delete(@start[tid]);
}

interval:s:1 {
        clear(@start);
        exit();
}

The output gives 2822 calls on cp3051 (upload) for a total of 7 milliseconds, vs 19770 calls on cp3058 (text) for a total of 779 milliseconds, showing that not only mmap is called much more often on the text node, but the average latency is also a lot higher.

Back to the original question of what's calling mmap, I couldn't find anything meaningful in the ATS code. However, given that these are all MAP_ANONYMOUS calls (confirmed by looking at the fd argument, which is -1), I suspected malloc.

Normally, malloc() allocates memory from the heap, and adjusts the size  of  the  heap  as  re‐
quired,  using sbrk(2).  When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the
glibc malloc() implementation allocates  the  memory  as  a  private  anonymous  mapping  using
mmap(2).   MMAP_THRESHOLD  is  128 kB by default, but is adjustable using mallopt(3).  Prior to
Linux 4.7 allocations performed using mmap(2)  were  unaffected  by  the  RLIMIT_DATA  resource
limit; since Linux 4.7, this limit is also enforced for allocations performed using mmap(2).

Indeed on the text node malloc is called much more often (~70K per second vs ~7K on upload), but the amount of times MMAP_THRESHOLD is exceeded is relatively low (~1K per second) and comparable on text vs upload:

root@cp3058:~# bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /pid == 34216/ { @size = hist(arg0) } interval:s:1 { exit(); }'
Attaching 2 probes...
@size: 
[1]                   17 |                                                    |
[2, 4)                83 |                                                    |
[4, 8)                51 |                                                    |
[8, 16)             1364 |@@@                                                 |
[16, 32)            9212 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
[32, 64)           13601 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
[64, 128)           1968 |@@@@@                                               |
[128, 256)         19614 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)           565 |@                                                   |
[512, 1K)            293 |                                                    |
[1K, 2K)            1231 |@@@                                                 |
[2K, 4K)              39 |                                                    |
[4K, 8K)             477 |@                                                   |
[8K, 16K)            348 |                                                    |
[16K, 32K)            76 |                                                    |
[32K, 64K)             1 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)        1187 |@@@                                                 |
root@cp3051:~# bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /pid == 40982/ { @size = hist(arg0) } interval:s:1 { exit(); }'
@size: 
[1]                    3 |                                                    |
[2, 4)                12 |                                                    |
[4, 8)                 3 |                                                    |
[8, 16)              162 |@                                                   |
[16, 32)             451 |@@@@                                                |
[32, 64)            2282 |@@@@@@@@@@@@@@@@@@@@@@@                             |
[64, 128)            326 |@@@                                                 |
[128, 256)          5054 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)           250 |@@                                                  |
[512, 1K)             44 |                                                    |
[1K, 2K)             107 |@                                                   |
[2K, 4K)               3 |                                                    |
[4K, 8K)             191 |@                                                   |
[8K, 16K)              2 |                                                    |
[16K, 32K)            12 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)         876 |@@@@@@@@@                                           |

My take-home message would be that malloc is innocent then.

My useless contribution to the plot

Leaving the cp3052 mystery aside for the time being, I've noticed that there's a difference between text and upload too. On upload nodes, mmap is called, albeit less often than on text (~3.5K times per second vs ~15K at the time of this writing). Also, upload nodes spend significantly less time in the kernel.

Macro much-syscall:

It turns out that malloc(3) does not really say the whole truth: the threshold for choosing when to use mmap vs brk is dynamic, and not hardcoded to 128K. For instance I found values of ~400K with the following SystemTap script:

probe process("/usr/lib/x86_64-linux-gnu/libc-2.28.so").function("malloc") {
        if (pid() == 28236) {
                printf("mmap_threshold=%d\n", $mp_->mmap_threshold)
                exit()
        }
}

At any rate: sysmalloc() is never called (the libc function responsible for calling mmap) showing that indeed as mentioned yesterday malloc is not responsible for all this mmap calls.

The culprit is libluajit:

global c

probe process("/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0").function("mcode_alloc_at") {
        if (pid() == 31316) {
                c[pp()]++
        }
}

probe process("/usr/lib/x86_64-linux-gnu/libc-2.28.so").function("mmap") {
        if (pid() == 31316) {
                c[pp()]++
        }
}

probe timer.ms(100) {
        foreach (f in c-) {
                printf("%s: %d\n", f, c[f])
        }
        exit()
}

31316 is the PID of the traffic_server process for ats-be on cp3050 right now, and this is how many times lua calls mcode_alloc_at, which in turn calls mmap, in 100 milliseconds:

process("/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0").function("mcode_alloc_at@./src/lj_mcode.c:100"): 2063
process("/usr/lib/x86_64-linux-gnu/libc-2.28.so").function("__mmap64@../sysdeps/unix/sysv/linux/mmap64.c:40"): 2048

I could obtain the following backtraces, one for tslua (global):

0x7fb44df01338 : mcode_allocarea+0xb8/0x160 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df19d54 : lj_asm_trace+0x25b4/0x3460 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df381d1 : trace_state+0x201/0x1340 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44def2146 : lj_vm_cpcall+0x4a/0x5e [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44def864b : lj_trace_ins+0x5b/0x70 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df410dc : lj_dispatch_call+0x21c/0x250 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44def37ba : lj_vm_hotcall+0x1e/0x43 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df3f22c : lua_pcall+0x5c/0x80 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb52800d9a3 : globalHookHandler+0x313/0x450 [/usr/lib/trafficserver/modules/tslua.so]

And one as a continuation handler:

0x7fb44df01338 : mcode_allocarea+0xb8/0x160 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df19d54 : lj_asm_trace+0x25b4/0x3460 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df381d1 : trace_state+0x201/0x1340 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44def2146 : lj_vm_cpcall+0x4a/0x5e [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44def864b : lj_trace_ins+0x5b/0x70 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df34e17 : lj_dispatch_ins+0x177/0x1e0 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44def3741 : lj_vm_inshook+0x2d/0x44 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44df48ec1 : lj_cf_string_gsub+0x2e1/0x440 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb44def1dd7 : lj_BC_FUNCC+0x34/0x59 [/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
0x7fb52801a0fe : ts_lua_http_cont_handler+0x15e/0x410 [/usr/lib/trafficserver/modules/tslua.so]

The simple reason for cache upload being affected to a lesser extent is that we use less Lua there. Why is ats-be on cp3052 not affected remains a mystery.

Change 655043 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: make number of allowed Lua states configurable

https://gerrit.wikimedia.org/r/655043

Change 655044 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: lower number of allowed Lua states on cp3050

https://gerrit.wikimedia.org/r/655044

A low-hanging fruit when it comes to Lua overhead seems to be tuning the number of allowed Lua states. By looking at the internal tslua statistics on cp3050, it seems that most of our 256 states have a fairly low number of associated threads (most between 0 and 3, some outliers up to 6). Let's try to lower that value from the default (256) to 64 on cp3050 and evaluate the impact on kernel cpu usage.

Change 655043 merged by Ema:
[operations/puppet@production] ATS: make number of allowed Lua states configurable

https://gerrit.wikimedia.org/r/655043

Change 655044 merged by Ema:
[operations/puppet@production] ATS: lower number of allowed Lua states on cp3050

https://gerrit.wikimedia.org/r/655044

Mentioned in SAL (#wikimedia-operations) [2021-01-11T10:06:22Z] <ema> cp3050: restart ats-be to lower lua states from 256 to 64 T265625

ema raised the priority of this task from Medium to High.Jan 12 2021, 11:27 AM

Lowering the number of Lua states on cp3050 did reduce system CPU usage a bit, without having any visibile performance difference, hence it seems safe to proceed with 64 as the default everywhere. That's just a minor improvement however, and the big "mmap issue" still remains.

I found that there are other hosts mysteriously not affected by it, for example cp5007. See the number of mmap calls issued by the ats-be traffic_server process (pid=13377) on cp5008 in 10 seconds:

11:09:59 ema@cp5008.eqsin.wmnet:~
$ sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:mmap /pid==13377/ { @c[pid] = count()  } interval:s:10 { exit(); }'
Attaching 2 probes...
@c[13377]: 182281

While on cp5007 (traffic_server pid=38536) mmap is called only once during that time:

11:09:52 ema@cp5007.eqsin.wmnet:~
$ sudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:mmap /pid==38536/ { @c[pid] = count()  } interval:s:10 { exit(); }'
Attaching 2 probes...
@c[38536]: 1

And indeed the difference is very visible in the CPU usage comparison:

Screenshot from 2021-01-12 12-15-29.png (1×2 px, 296 KB)

Which maps 1:1 to TTFB on backend hits served by ats-be as reported by varnish-fe:

Screenshot from 2021-01-12 12-17-25.png (1×2 px, 278 KB)

When the CPU spikes occur we see TTFB differences of up to 40 milliseconds between cp5007 and cp5008, indicating that fixing this would be a very significant performance win. Raising priority accordingly.

Mentioned in SAL (#wikimedia-operations) [2021-01-12T15:56:06Z] <ema> cp5008: ats-backend-restart to apply jit.off(true, true) in default.lua T265625

Mentioned in SAL (#wikimedia-operations) [2021-01-12T16:37:29Z] <ema> cp5008: ats-backend-restart to apply jit.off(true, true) to all lua scripts T265625

Disabling JIT in all Lua scripts on cp5008 resulted in ats-be not calling lj_vm_hotcall/mmap anymore and CPU usage went down significantly:

Screenshot from 2021-01-12 17-52-45.png (1×2 px, 166 KB)

The current thinking is that luajit somehow keeps on identifying functions as hot and compiles them all the time instead of using the compiled code. Both ats-tls and ats-be call lj_vm_hotcall hundreds of times per second:

root@cp3050:~# bpftrace -e 'uprobe:/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0:lj_vm_hotcall { @c[pid]=count();} interval:s:1 { exit() }'
Attaching 2 probes...
@c[28236]: 210
@c[7551]: 806

That's not the case on hosts not affected by this issue (and now on cp5008). I'll let JIT disabled on the ats-be Lua scripts running on cp5008 and re-evaluate the situation tomorrow.

Mentioned in SAL (#wikimedia-operations) [2021-01-13T08:46:18Z] <ema> cp5008: re-enable puppet to undo JIT tslua experiment T265625

Disabling JIT in all Lua scripts resulted in significantly decreased CPU usage as well as TTFB:

Screenshot from 2021-01-13 09-52-02.png (1×2 px, 258 KB)

We should do that on all ATS backend instances, and probably on the TLS ones too.

Change 655870 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable JIT compiler on ats-be Lua scripts

https://gerrit.wikimedia.org/r/655870

Change 655870 merged by Ema:
[operations/puppet@production] ATS: disable JIT compiler on ats-be Lua scripts

https://gerrit.wikimedia.org/r/655870

Change 655876 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable JIT compiler on ats-tls too

https://gerrit.wikimedia.org/r/655876

Change 655876 merged by Ema:
[operations/puppet@production] ATS: disable JIT compiler on ats-tls too

https://gerrit.wikimedia.org/r/655876

ema claimed this task.

Erratic CPU usage gone:

Screenshot from 2021-01-14 11-06-11.png (1×2 px, 276 KB)

Issue reported upstream: https://github.com/apache/trafficserver/issues/7423.

Change 760889 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] Revert \"ATS: lower number of allowed Lua states on cp3050\"

https://gerrit.wikimedia.org/r/760889

Change 760889 merged by Ema:

[operations/puppet@production] Revert \"ATS: lower number of allowed Lua states on cp3050\"

https://gerrit.wikimedia.org/r/760889

Mentioned in SAL (#wikimedia-operations) [2022-02-09T09:15:51Z] <ema> cp3050: ats-backend-restart to set the number of allowed Lua states back from 64 to 256 (default) T265625