Page MenuHomePhabricator

Truncated ArcLamp output files
Closed, ResolvedPublic

Description

It looks like arclamp-generate-svgs occasionally fails in a way that leaves an output file consisting of just the gzip header:

-rw-r--r-- 1 xenon xenon     20 Jul 24 20:33 2020-06-09.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 24 20:33 2020-06-09.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 24 20:35 2020-06-09.excimer.api.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jun 12 00:10 2020-06-11.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jun 14 00:04 2020-06-13.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jun 15 00:04 2020-06-14.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jun 26 00:12 2020-06-25.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  2 00:11 2020-07-01.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  3 00:08 2020-07-02.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  4 00:00 2020-07-03.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  6 00:05 2020-07-05.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  6 00:00 2020-07-05.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  7 00:15 2020-07-06.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  8 00:03 2020-07-07.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul  9 00:15 2020-07-08.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 10 00:03 2020-07-09.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 10 00:00 2020-07-09.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 11 00:00 2020-07-10.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 12 00:07 2020-07-11.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 12 00:03 2020-07-11.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 13 00:03 2020-07-12.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 13 00:00 2020-07-12.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 14 00:11 2020-07-13.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 14 00:05 2020-07-13.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 15 00:10 2020-07-14.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 15 00:02 2020-07-14.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 16 00:12 2020-07-15.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 16 00:03 2020-07-15.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 17 00:09 2020-07-16.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 17 00:01 2020-07-16.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 18 00:01 2020-07-17.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 19 00:11 2020-07-18.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 20 00:08 2020-07-19.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 20 00:01 2020-07-19.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 21 00:03 2020-07-20.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 22 00:07 2020-07-21.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 22 00:03 2020-07-21.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 24 20:16 2020-07-23.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 24 20:16 2020-07-23.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 24 20:18 2020-07-23.excimer.RunSingleJob.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 25 00:00 2020-07-24.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 25 00:00 2020-07-24.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 25 00:02 2020-07-24.excimer.index.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 26 00:00 2020-07-25.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 26 00:00 2020-07-25.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 26 00:03 2020-07-25.excimer.api.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 26 00:02 2020-07-25.excimer.index.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 27 00:01 2020-07-26.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 27 00:01 2020-07-26.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 27 00:02 2020-07-26.excimer.index.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 27 00:03 2020-07-26.excimer.index.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 28 00:00 2020-07-27.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 28 00:01 2020-07-27.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 28 00:02 2020-07-27.excimer.index.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 28 00:03 2020-07-27.excimer.index.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 29 00:00 2020-07-28.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 29 00:00 2020-07-28.excimer.all.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 29 00:02 2020-07-28.excimer.index.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 29 17:30 2020-07-29.excimer.all.reversed.svgz
-rw-r--r-- 1 xenon xenon     20 Jul 29 17:30 2020-07-29.excimer.all.svgz

From the above, it appears to have been happening at least as far back as June 12, but with increasing frequency of late, possibly due to changes made for T253679 and T235456. Most other files from the same date range are unaffected.

Event Timeline

I'm 90% sure this is due to flamegraph.pl running out of memory. From syslog:

Jul 29 12:01:33 webperf1002 kernel: [1115210.705754] Out of memory: Kill process 21368 (flamegraph.pl) score 224 or sacrifice child
Jul 29 12:01:33 webperf1002 kernel: [1115210.707177] Killed process 21368 (flamegraph.pl) total-vm:2071364kB, anon-rss:2052648kB, file-rss:0kB, shmem-rss:0kB

As noted in T257931, the input data set started getting larger around July 1st, which would explain the increase in frequency of occurrence.

Change 617275 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/arc-lamp@master] Prevent parallel execution and check .svgz size

https://gerrit.wikimedia.org/r/617275

Change 617275 merged by jenkins-bot:
[performance/arc-lamp@master] Prevent parallel execution and check .svgz size

https://gerrit.wikimedia.org/r/617275

Krinkle triaged this task as Medium priority.Jul 31 2020, 7:56 PM

Change 617768 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/arc-lamp@master] Fix arclamp-generate-svgs with uncompressed logs

https://gerrit.wikimedia.org/r/617768

Change 617768 merged by jenkins-bot:
[performance/arc-lamp@master] Fix arclamp-generate-svgs with uncompressed logs

https://gerrit.wikimedia.org/r/617768

Mentioned in SAL (#wikimedia-operations) [2020-08-06T16:18:48Z] <dpifke@deploy1001> Started deploy [performance/arc-lamp@7838c88]: Deploying fixes for T259167

Mentioned in SAL (#wikimedia-operations) [2020-08-06T16:18:54Z] <dpifke@deploy1001> Finished deploy [performance/arc-lamp@7838c88]: Deploying fixes for T259167 (duration: 00m 05s)

Still seeing OOM errors, another patch is forthcoming:

Aug  6 16:32:47 webperf1002 kernel: [1822677.217129] Out of memory: Kill process 17071 (flamegraph.pl) score 218 or sacrifice child
Aug  6 16:32:47 webperf1002 kernel: [1822677.219568] Killed process 17071 (flamegraph.pl) total-vm:2022980kB, anon-rss:2004252kB, file-rss:0kB, shmem-rss:0kB
Aug  6 16:32:47 webperf1002 kernel: [1822677.416367] oom_reaper: reaped process 17071 (flamegraph.pl), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Change 618794 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/arc-lamp@master] Process large files serially and improve metrics

https://gerrit.wikimedia.org/r/618794

Change 618794 merged by jenkins-bot:
[performance/arc-lamp@master] Process large files serially and improve metrics

https://gerrit.wikimedia.org/r/618794

Mentioned in SAL (#wikimedia-operations) [2020-08-11T00:31:58Z] <dpifke@deploy1001> Started deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix T259167

Mentioned in SAL (#wikimedia-operations) [2020-08-11T00:33:02Z] <dpifke@deploy1001> Finished deploy [performance/arc-lamp@fc5f1c6]: Deploying latest attempt to fix T259167 (duration: 01m 03s)

And we're still running out of memory:

Aug 11 18:48:33 webperf1002 kernel: [2262819.365604] Out of memory: Kill process 14731 (flamegraph.pl) score 487 or sacrifice child
Aug 11 18:48:33 webperf1002 kernel: [2262819.367524] Killed process 14731 (flamegraph.pl) total-vm:4485228kB, anon-rss:4465516kB, file-rss:0kB, shmem-rss:0kB

I had too many zeroes in my change to serialize processing of large files; fix forthcoming.

Diagnosing this, I realized just how pathological the failure mode is:

  1. arclamp-generate-svgs spawns gziptee → 2x flamegraph.pl
  2. One flamegraph.pl gets killed.
  3. The kernel moves the subprocess group under PID 1, so now we have gzip and the other flamegraph.pl running in parallel to the original script.
  4. Despite set -e, arclamp-generate-svgs doesn't abort. It now returns to step 1.

This repeats and we end up with a bunch of flamegraph.pl instances all competing for memory. And because the OOM killer prefers newer processes, we end up with more and more subprocesses under PID 1 instead of arclamp-generate-svgs.

When the main script finally exits, it cleans up its PID file, and the next invocation doesn't detect the orphaned children.

Fixing the input size limit will prevent step 2 from occurring, so it will definitely fix the issue; I tested by manually patching on webperf2002. That said, the additional RAM wasn't for naught: flamegraph.pl is pushing ~7GB with recent daily .all files, so total system memory of 8GB wouldn't be enough. With 16GB, we should be fine. (Note: we should investigate why these logs are getting larger.)

I'm going to experiment with trap ... SIGCHLD to see if we can detect one end of the pipe getting killed and cause the script to abort instead of continue along in parallel.

I'm also tempted to have the script exit immediately not just if another copy is running, but if there's any flamegraph.pl in the process listing. This feels a little kludgy, but would also protect against this failure mode.

Change 621348 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/arc-lamp@master] Process serially above 100MB, not 1GB

https://gerrit.wikimedia.org/r/621348

Change 621348 merged by jenkins-bot:
[performance/arc-lamp@master] Process serially above 100MB, not 1GB

https://gerrit.wikimedia.org/r/621348

Mentioned in SAL (#wikimedia-operations) [2020-08-19T20:39:14Z] <dpifke@deploy1001> Started deploy [performance/arc-lamp@2ef1af7]: Deploy fixes for notifications and OOM prevention (T259167)

Mentioned in SAL (#wikimedia-operations) [2020-08-19T20:39:20Z] <dpifke@deploy1001> Finished deploy [performance/arc-lamp@2ef1af7]: Deploy fixes for notifications and OOM prevention (T259167) (duration: 00m 06s)

Change 621385 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[performance/arc-lamp@master] Skip large logs if another flamegraph.pl is active

https://gerrit.wikimedia.org/r/621385

dpifke lowered the priority of this task from Medium to Low.Aug 20 2020, 1:22 AM

Trapping SIGCHLD is tricky, so I opted instead to check if another flamegraph.pl is running, and defer large files until the next run if so. We risk files being repeatedly skipped if a background flamegraph.pl gets into an infinite loop, but we'd presumably catch that from the cron emails and/or monitoring.

This is mostly preventative medicine for if/when we outgrow 16GB of RAM; we haven't had anything OOM since change 621348 was deployed earlier today. Dropping the priority of this task.

Change 621385 abandoned by Dave Pifke:
[performance/arc-lamp@master] Skip large logs if another flamegraph.pl is active

Reason:
Not needed at the moment, OOMs haven't reoccurred.

https://gerrit.wikimedia.org/r/621385

Problem hasn't come back, no need for additional mitigation.