Page MenuHomePhabricator

webperf*002 running out of disk space (arc lamp, xhgui)
Closed, ResolvedPublic

Description

webperf1002 and webperf2002 have alerted today due to low disk space

Event Timeline

My guesses:

  • higher sampling rate from php7-excimer (compared to hhvm/xenon) is creating significantly larger files than we projected for in T199853.
  • general increase in backend traffic over the years (unlikely).
  • the now-redundant hhvm/xenon listener of arc lamp is misbehaving and doing more than waste cpu, and creating empty files and/or preventing old files from expiring (this storing not just N days of PHP7 but also the last N days of hhvm from before it stopped getting traffic, making it 2X).

While first and last point are not unlikely, it surprises me that even if both are true to cause disk space issues since we projected also for storing XHGui profiles which take up much more space and we haven't even started doing that yet.

Krinkle renamed this task from webperf* running out of disk space to webperf*002 running out of disk space (arc lamp, xhgui).Oct 14 2019, 3:48 PM

looking at them now i see they are only using 14% and 8% of / . I ran "apt-get clean" and now it's down to 12% and 6%. Alerting would be at 95% by default. So looks like somebody (or something like a cron?) already deleted stuff.

I noticed the warning this morning as well and pinged Gilles on IRC, the issue seems to be in the /srv partition:

/dev/vdb        147G  135G  4.4G  97% /srv

More specifically:

elukey@webperf1002:/srv/xenon/logs$ du -hs *
105G	daily
30G	hourly
[..]
562M	2019-09-19.excimer.all.log
598M	2019-09-21.excimer.all.log
600M	2019-09-18.excimer.all.log
601M	2019-10-09.excimer.index.log
613M	2019-10-10.excimer.index.log
640M	2019-09-25.excimer.index.log
642M	2019-10-11.excimer.index.log
643M	2019-09-22.excimer.all.log
645M	2019-10-12.excimer.index.log
684M	2019-10-13.excimer.index.log
817M	2019-09-23.excimer.all.log
1.1G	2019-10-14.excimer.all.log
1.3G	2019-09-24.excimer.all.log
1.3G	2019-09-27.excimer.all.log
1.3G	2019-09-28.excimer.all.log
1.3G	2019-09-29.excimer.all.log
1.3G	2019-09-30.excimer.all.log
1.3G	2019-10-01.excimer.all.log
1.3G	2019-10-02.excimer.all.log
1.3G	2019-10-03.excimer.all.log
1.3G	2019-10-04.excimer.all.log
1.3G	2019-10-05.excimer.all.log
1.3G	2019-10-06.excimer.all.log
1.3G	2019-10-07.excimer.all.log
1.4G	2019-09-25.excimer.all.log
1.4G	2019-09-26.excimer.all.log
1.4G	2019-10-08.excimer.all.log
1.4G	2019-10-10.excimer.all.log
1.4G	2019-10-11.excimer.all.log
1.4G	2019-10-12.excimer.all.log
1.5G	2019-10-09.excimer.all.log
1.5G	2019-10-13.excimer.all.log

sorry, i was on 1001 and 2001 vs. 1002 and 2002 and was wondering why i don't even see /srv mounted on a separate device. yes, ACK. on 1002 / 2002 it's the xenon logs.

Mentioned in SAL (#wikimedia-operations) [2019-10-14T17:56:12Z] <mutante> webperf2002 - /srv/xenon/logs/daily# gzip 2019-09*excimer*.log (T235425)

@Dzahn These are not typical "log" files, they are active application data for arc lamp, public on perf.wm.o and consumed or reacted to by automated processes. Changing their filename and content via gzip, is therefore effectively the same as deleting them, with the same side-effects.

Flame graph (SVG artefacts) were auto-deleted, and the log files were no longer accessible to tooling through the web interface (due to not matching the expected URL format).

Details:
We have a cronjob that expires old log files, and another cronjob that auto-deletes SVG artefacts (flame graphs) that no longer have an associated log file. As such, the gzipping of September's log files meant the cronjob just deleted all of September's flame graphs from performance.wikimedia.org. It also means external tools (like Toolforge/CoverMe and CLI tools can't analyse or and compare log from September as the compressed URLs are not expected.

With some manual work, we can decompress the log files and re-generate the flame graph (SVG files) by running flamegraph.pl. No big deal. Although, that will also fill up the disk again, so we can't actually do that. Instead, we'll need gradually delete single day files of all directories together (e.g. 1 day of September from logs/daily, log/hourly, svg/daily and svg/hourly). And free up space that way. That means permanent deletion, which I understand is not something you'd want to do without Perf approval, especially since we don't have any backups (other than eqiad-codfw redundancy).

As perhaps a weird compromise, for future reference, perhaps better I think to let the server fill up the disk and be unable to write new data. I'd prefer to have 2 months of perf data with today/tomorrow missing forever, than to lose last month forever. (We didn't lose anything today, so no worries, I'm comparing to as to explain my preference.)

Change 543005 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] webperf: add backups for arclamp application data

https://gerrit.wikimedia.org/r/543005

Mentioned in SAL (#wikimedia-operations) [2019-10-14T21:12:52Z] <Krinkle> Delete misc arclamp/logs and arclamp/svgs data from between 2018 and and 2019-08 on webperf1002/webperf2002, T235425

Mentioned in SAL (#wikimedia-operations) [2019-10-14T21:47:54Z] <Krinkle> Deleting 2019-09-01––2019-09-10 arclamp logs on webperf2002, and decompress the rest of 2019-09, T235425

Mentioned in SAL (#wikimedia-operations) [2019-10-14T23:26:59Z] <Krinkle> Delete 2019-09-01––2019-09-10 arclamp trace logs from webperf1002, and decompress the rest of 2019-09 (this will trigger svg re-generation), T235425

Immediate issue is resolved. But this is expected to alert again in 5-10 days unless a few days' worth are deleted again by then. To be continued at T235455: Resolve arclamp disk exhaustion problem (Oct 2019).

Screenshot 2019-10-15 at 09.39.42.png (550×1 px, 80 KB)
Screenshot 2019-10-15 at 09.40.53.png (574×1 px, 77 KB)

Change 543005 merged by Dzahn:
[operations/puppet@production] webperf: add backups for arclamp application data

https://gerrit.wikimedia.org/r/543005

This is happening again. The webperf*2 hosts are alerting in Icinga about disk space.

dpifke added a subscriber: dpifke.

This is arguably a new issue, unrelated to the last time.

I don't see anything obviously wrong with the jobs to compress and eventually expire old data. But it's not clear to me why the size would have dramatically increased of late. We did recently add the k8s pipeline, but it's relatively tiny - 100s of bytes most days, max about 200kB/day when we were doing load testing.

I'm going to change the compression job settings to compress after 3 days instead of waiting 7, which should reclaim about 50GB. We could also be more aggressive about expiring old data, since its also now being archived in Swift.

Change 738010 had a related patch set uploaded (by Dave Pifke; author: Dave Pifke):

[operations/puppet@production] arclamp: compress logs after 3 days, not 7

https://gerrit.wikimedia.org/r/738010

Change 738010 merged by Dzahn:

[operations/puppet@production] arclamp: compress logs after 3 days, not 7

https://gerrit.wikimedia.org/r/738010

Change 738013 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] webperf::arclamp: turn number of days until compressing logs into a parameter

https://gerrit.wikimedia.org/r/738013

Change 738013 merged by Dzahn:

[operations/puppet@production] webperf::arclamp: turn log compression & expiry into parameters

https://gerrit.wikimedia.org/r/738013

Before reducing compression age threshold:

Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  268G   14G  96% /srv

After:

Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  212G   70G  76% /srv