Page MenuHomePhabricator

arclamp_generate_svgs OOMs
Closed, ResolvedPublicBUG REPORT

Description

arclamp_generate_svgs is OOMing regularly, causing alert spam and (probably) truncated or missing flame graphs.

I'm not sure when the alerts started, because I haven't fully figured out how to work with alertmanager yet. But memory usage started spiking on or around 2022-06-08:

https://grafana.wikimedia.org/d/000000377/host-overview?var-server=webperf1004&var-datasource=thanos&var-cluster=webperf&viewPanel=4&from=now-90d&to=now

@dpifke had some ideas about what was causing OOMs and what to do about it in T259167. But the best use of resources and time might be to quadruple the amount of RAM available to the machine. The engineers on the performance team can save more resources by focusing on other problems.

Event Timeline

Ref T260192: More RAM needed for webperf1002 and webperf2002

Fwiw, I don't see any recent alarms from arclamp in either our IRC channel or mailing list. Perhaps it's not coming through to us?

The alerts are going to #wikimedia-operations; there were 21 alerts of this form on 2022-08-11:

[20:28:26] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
Krinkle triaged this task as Medium priority.

Ganeti req incoming, after I reproduce the issue and confirm it's normal growth and not uncontrolled.

Change 825870 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[performance/arc-lamp@master] Use in-place sort for @Data and consume %Node on the fly to use less memory

https://gerrit.wikimedia.org/r/825870

Change 825870 merged by jenkins-bot:

[performance/arc-lamp@master] Use in-place sort for @Data and consume %Node on the fly to use less memory

https://gerrit.wikimedia.org/r/825870

I downloaded one of the largest files for this week (2022-08-24.excimer.all.log) to my local. To get a sense of what might be a theoretically minimal expected memory footprint, I ran a different script instead of flamegraph.pl, namely stackcollapse-sum.pl (from FlameGraph#286) which has as its sole purpose to produce the aggregate counts, which in my understanding is a subset of what flamegraph.pl does (after that step, flamegraph.pl then plots it in SVG and shakes out peaks small than the configured minwidth).

$ ls -halF
2.2G 2022-08-24.excimer.all.log
$ cat 2022-08-24.excimer.all.log 2022-08-24.excimer.all.log > large.log
4.4G large.log
$ ./stackcollapse-sum.pl 2022-08-24.excimer.all.log > collapsed.log
916M collapsed.log 
$ ./stackcollapse-sum.pl 2022-08-24.excimer.all.log > collapsed.log
916M collapsed.log 

We now have a 4G input file, that should collapse to about 900M of data. I note that while running stackcollapse-sum.pl, memory use grew to about 1.1G which seems reasonable and suggests that did performed a chunked read of the input file. I have little familiarity with Perl, but online sources (this post, and FlameGraph#76 ) suggest that the while (<>) idiom used in that script indeed reads in a chunked way.

# macOS 10.15.7
# perl5.18
# 2.5 GHz Quad-Core Intel Core i7
# 16 GB 1600 MHz DDR3
./flamegraph.pl --minwidth=1 --title="Title" large.log > flamegraph.svg

Using flamegraph.pl from commit arc-lamp@81368e97af (upstream FlameGraph.git date: 31 Aug 2021, i.e. the version before the above patch)

Before
1m: VSZ=15G RSS=2GB CPU=100%
4m: VSZ=15G RSS=3GB CPU=99%
5m: VSZ=15G RSS=3GB CPU=100% (OS GUI: Used=20% ; Process: Memory=10GB, Real=4GB)
Runtime: 549s

With the above patch applied:

After
4m: VSZ=15G RSS=9GB CPU=99%
5m: VSZ=15G RSS=9GB CPU=100%
7m: VSZ=15G RSS=9GB CPU=100% (OS GUI: Used=55%; Process: Memory=3GB, Real=9GB)
Runtime: 476s

The first set of numbers come from ps (macOS/Darwin BSD). The parenthetical numbers come from the Activity Monitor app (macOS). I'm not entirely sure what to make of this, but apart from the (presumed, virtual) "Memory" number in the GUI app, all other numbers appear higher after the patch is applied.

The numbers:

  • With an input of 4G, it can (and is) summarised to a ~1G collapsed file in memory, which with the used Perl code takes about 1G to do.
  • The SVG output is a mere ~600K (<0.1GB)

Side-stepping the huge unknown (how flamegraph.pl actually works) it seems at least vaguely plausible that this could be done with 1 or 2 GB of memory, as opposed to 9GB or 15GB, but alas that's not how it works right now. The general inefficiency is acknowledged upstream (https://github.com/brendangregg/FlameGraph/issues/58) and is generally open to patches, but this isn't a priority right now. I believe the general approach to this is to sample it down. Given the increase in traffic to some of these endpoints, I suppose we could do that as well, e.g. reduce the Excimer frequency in production from 60s to e.g. 90s if we feel we have enough data.

I believe this would not decrease flamegraph detail per-se as we already discard a ton of precision data higher in the graph through --minwidth, and the methods lower in the graph would statistically still get represented the same more or less. Alternatively, there remains T316223: Expand RAM on arclamp hosts and move them to baremetal to just move this back to hardware and give it enough memory through a misc server dedicated for this.

Mentioned in SAL (#wikimedia-operations) [2022-08-30T00:12:18Z] <aaron@deploy1002> Started deploy [performance/arc-lamp@40cb764]: T315056

Mentioned in SAL (#wikimedia-operations) [2022-08-30T00:12:25Z] <aaron@deploy1002> Finished deploy [performance/arc-lamp@40cb764]: T315056 (duration: 00m 07s)

https://grafana.wikimedia.org/d/000000377/host-overview?var-server=webperf1004&var-datasource=thanos&var-cluster=webperf&viewPanel=4&from=now-90d&to=now&orgId=1

Screenshot.png (1×2 px, 520 KB)

No OOMs for 10 days since Aug 27.

krinkle@webperf1004:~$ sudo cat /var/log/syslog*.gz | gunzip | fgrep 'killed by the OOM killer' | sed 's/  / /' | cut -d' ' -f1-2,4,6- | sort | uniq -c
     29 Aug 7 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     32 Aug 8 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     56 Aug 9 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     58 Aug 10 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     59 Aug 11 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     58 Aug 12 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     63 Aug 13 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     60 Aug 14 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     56 Aug 15 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     57 Aug 16 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     58 Aug 17 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     56 Aug 18 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     58 Aug 19 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     58 Aug 20 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     64 Aug 21 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     55 Aug 22 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     52 Aug 23 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     57 Aug 24 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     47 Aug 25 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.
     62 Aug 26 webperf1004 arclamp_generate_svgs.service: A process of this unit has been killed by the OOM killer.

krinkle@webperf1004:~$ sudo cat /var/log/syslog /var/log/syslog.1 | fgrep 'killed by the OOM killer'
(No results.)

krinkle@webperf1004:~$ sudo head -n1 /var/log/syslog /var/log/syslog.1
==> /var/log/syslog <==
Sep  4 00:00:01 webperf1004 rsyslogd: ..

==> /var/log/syslog.1 <==
Aug 28 00:00:08 webperf1004 rsyslogd: ..

On a somewhat related note, the runtimes look like this:

$ sudo cat /var/log/syslog /var/log/syslog.1 | fgrep arclamp_generate_svgs
Sep  2 01:33:22 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 16min 54.796s CPU time.
Sep  2 01:48:13 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 01:48:13 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 15min 22.265s CPU time.
Sep  2 02:04:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 02:04:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 16min 59.589s CPU time.
Sep  2 02:22:34 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 02:22:34 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 18min 24.112s CPU time.
Sep  2 02:39:06 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 02:39:06 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 17min 4.858s CPU time.
Sep  2 02:56:42 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 02:56:42 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 18min 13.088s CPU time.
Sep  2 03:15:21 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 03:15:21 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 19min 13.664s CPU time.
Sep  2 03:34:10 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 03:34:10 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 19min 12.116s CPU time.
Sep  2 03:53:37 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 03:53:37 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 20min 2.882s CPU time.
Sep  2 04:14:09 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 04:14:09 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 21min 10.424s CPU time.
Sep  2 04:35:49 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 04:35:49 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 22min 9.483s CPU time.
Sep  2 04:57:36 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 04:57:36 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 22min 24.200s CPU time.
Sep  2 05:20:41 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 05:20:41 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 23min 40.190s CPU time.
Sep  2 05:44:27 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 05:44:27 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 24min 24.112s CPU time.
Sep  2 06:08:58 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 06:08:58 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 25min 11.945s CPU time.
Sep  2 06:35:39 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 06:35:39 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 27min 16.436s CPU time.
Sep  2 07:01:57 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 07:01:57 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 26min 59.368s CPU time.
Sep  2 07:31:15 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 07:31:15 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 29min 55.014s CPU time.
Sep  2 07:59:58 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 07:59:58 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 29min 22.768s CPU time.
Sep  2 08:30:28 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 08:30:28 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 31min 12.672s CPU time.
Sep  2 09:01:20 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 09:01:20 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 31min 33.022s CPU time.
Sep  2 09:35:58 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 09:35:59 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 35min 14.571s CPU time.
Sep  2 10:10:22 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 10:10:22 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 35min 5.198s CPU time.
Sep  2 10:45:22 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 10:45:22 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 35min 41.545s CPU time.
Sep  2 11:22:49 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 11:22:49 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 38min 13.838s CPU time.
Sep  2 12:02:14 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 12:02:14 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 40min 8.718s CPU time.
Sep  2 12:45:10 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 12:45:10 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 43min 37.389s CPU time.
Sep  2 13:27:33 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 13:27:33 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 43min 12.128s CPU time.
Sep  2 14:14:23 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 14:14:23 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 47min 33.563s CPU time.
Sep  2 15:02:13 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 15:02:13 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 48min 41.621s CPU time.
Sep  2 15:52:47 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 15:52:47 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 51min 18.065s CPU time.
Sep  2 16:43:44 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 16:43:44 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 51min 51.508s CPU time.
Sep  2 17:36:27 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 17:36:27 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 53min 38.240s CPU time.
Sep  2 18:34:28 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 18:34:28 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 58min 47.496s CPU time.
Sep  2 19:32:25 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 19:32:25 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 58min 53.703s CPU time.
Sep  2 20:31:50 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 20:31:50 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 1h 20.979s CPU time.
Sep  2 21:34:01 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 21:34:01 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 1h 3min 6.337s CPU time.
Sep  2 22:38:05 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 22:38:05 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 1h 5min 9.348s CPU time.
Sep  2 23:44:46 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  2 23:44:46 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 1h 7min 49.554s CPU time.
Sep  3 00:52:12 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 00:52:12 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 1h 8min 37.348s CPU time.
Sep  3 01:20:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 01:20:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 28min 38.936s CPU time.
Sep  3 01:36:03 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 01:36:03 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 15min 48.475s CPU time.
Sep  3 01:50:21 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 01:50:21 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 14min 43.932s CPU time.
Sep  3 02:05:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 02:05:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 15min 54.920s CPU time.
Sep  3 02:21:39 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 02:21:39 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 16min 27.223s CPU time.
Sep  3 02:37:04 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 02:37:04 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 16min 3.663s CPU time.
Sep  3 02:53:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 02:53:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 17min 10.497s CPU time.
Sep  3 03:10:57 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 03:10:57 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 17min 54.243s CPU time.
Sep  3 03:27:41 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 03:27:41 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 17min 16.021s CPU time.
Sep  3 03:45:01 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 03:45:01 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 17min 54.079s CPU time.
Sep  3 04:03:21 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.
Sep  3 04:03:21 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 18min 59.617s CPU time.
Sep  3 04:22:03 webperf1004 systemd[1]: arclamp_generate_svgs.service: Succeeded.

They are essentially cyclical, building up a bigger and bigger workload throughout the day until closely after midnight where they recover.

Summary
Sep  2 01:33:22 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 16min 54.796s CPU time.
..
Sep  2 02:22:34 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 18min 24.112s CPU time.
..
Sep  2 03:53:37 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 20min 2.882s CPU time.
..
Sep  2 08:30:28 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 31min 12.672s CPU time.
..
Sep  2 10:45:22 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 35min 41.545s CPU time.
..
Sep  2 12:02:14 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 40min 8.718s CPU time.
..
Sep  2 15:52:47 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 51min 18.065s CPU time.
..
Sep  3 00:52:12 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 1h 8min 37.348s CPU time.
..
Sep  3 01:20:38 webperf1004 systemd[1]: arclamp_generate_svgs.service: Consumed 28min 38.936s CPU time.
Krinkle reopened this task as Open.

Leaving open for Aaron to submit PR to upstream.

Mentioned in SAL (#wikimedia-operations) [2022-11-07T17:24:49Z] <krinkle@deploy1002> Started deploy [performance/arc-lamp@e1ac118]: https://gerrit.wikimedia.org/r/c/825870 - T322561, T315056

Mentioned in SAL (#wikimedia-operations) [2022-11-07T17:24:56Z] <krinkle@deploy1002> Finished deploy [performance/arc-lamp@e1ac118]: https://gerrit.wikimedia.org/r/c/825870 - T322561, T315056 (duration: 00m 07s)

Last step within this task is to submit the patch to the upstream on GitHub.

Future work on mitigating the memory issue is T316223: Expand RAM on arclamp hosts and move them to baremetal.