Page MenuHomePhabricator

webperf*002 running out of disk space (arc lamp, xhgui)
Closed, ResolvedPublic

Description

webperf1002 and webperf2002 have alerted today due to low disk space

Details

Related Gerrit Patches:

Event Timeline

jijiki created this task.Oct 14 2019, 11:46 AM
Restricted Application added subscribers: Gilles, Aklapper. · View Herald TranscriptOct 14 2019, 11:46 AM
Gilles triaged this task as High priority.Oct 14 2019, 12:16 PM

My guesses:

  • higher sampling rate from php7-excimer (compared to hhvm/xenon) is creating significantly larger files than we projected for in T199853.
  • general increase in backend traffic over the years (unlikely).
  • the now-redundant hhvm/xenon listener of arc lamp is misbehaving and doing more than waste cpu, and creating empty files and/or preventing old files from expiring (this storing not just N days of PHP7 but also the last N days of hhvm from before it stopped getting traffic, making it 2X).

While first and last point are not unlikely, it surprises me that even if both are true to cause disk space issues since we projected also for storing XHGui profiles which take up much more space and we haven't even started doing that yet.

Krinkle renamed this task from webperf* running out of disk space to webperf*002 running out of disk space (arc lamp, xhgui).Oct 14 2019, 3:48 PM
Dzahn added a subscriber: Dzahn.Oct 14 2019, 5:41 PM

looking at them now i see they are only using 14% and 8% of / . I ran "apt-get clean" and now it's down to 12% and 6%. Alerting would be at 95% by default. So looks like somebody (or something like a cron?) already deleted stuff.

elukey added a subscriber: elukey.Oct 14 2019, 5:44 PM

I noticed the warning this morning as well and pinged Gilles on IRC, the issue seems to be in the /srv partition:

/dev/vdb        147G  135G  4.4G  97% /srv

More specifically:

elukey@webperf1002:/srv/xenon/logs$ du -hs *
105G	daily
30G	hourly
[..]
562M	2019-09-19.excimer.all.log
598M	2019-09-21.excimer.all.log
600M	2019-09-18.excimer.all.log
601M	2019-10-09.excimer.index.log
613M	2019-10-10.excimer.index.log
640M	2019-09-25.excimer.index.log
642M	2019-10-11.excimer.index.log
643M	2019-09-22.excimer.all.log
645M	2019-10-12.excimer.index.log
684M	2019-10-13.excimer.index.log
817M	2019-09-23.excimer.all.log
1.1G	2019-10-14.excimer.all.log
1.3G	2019-09-24.excimer.all.log
1.3G	2019-09-27.excimer.all.log
1.3G	2019-09-28.excimer.all.log
1.3G	2019-09-29.excimer.all.log
1.3G	2019-09-30.excimer.all.log
1.3G	2019-10-01.excimer.all.log
1.3G	2019-10-02.excimer.all.log
1.3G	2019-10-03.excimer.all.log
1.3G	2019-10-04.excimer.all.log
1.3G	2019-10-05.excimer.all.log
1.3G	2019-10-06.excimer.all.log
1.3G	2019-10-07.excimer.all.log
1.4G	2019-09-25.excimer.all.log
1.4G	2019-09-26.excimer.all.log
1.4G	2019-10-08.excimer.all.log
1.4G	2019-10-10.excimer.all.log
1.4G	2019-10-11.excimer.all.log
1.4G	2019-10-12.excimer.all.log
1.5G	2019-10-09.excimer.all.log
1.5G	2019-10-13.excimer.all.log
Dzahn added a comment.Oct 14 2019, 5:53 PM

sorry, i was on 1001 and 2001 vs. 1002 and 2002 and was wondering why i don't even see /srv mounted on a separate device. yes, ACK. on 1002 / 2002 it's the xenon logs.

Mentioned in SAL (#wikimedia-operations) [2019-10-14T17:56:12Z] <mutante> webperf2002 - /srv/xenon/logs/daily# gzip 2019-09*excimer*.log (T235425)

@Dzahn These are not typical "log" files, they are active application data for arc lamp, public on perf.wm.o and consumed or reacted to by automated processes. Changing their filename and content via gzip, is therefore effectively the same as deleting them, with the same side-effects.

Flame graph (SVG artefacts) were auto-deleted, and the log files were no longer accessible to tooling through the web interface (due to not matching the expected URL format).

Details:
We have a cronjob that expires old log files, and another cronjob that auto-deletes SVG artefacts (flame graphs) that no longer have an associated log file. As such, the gzipping of September's log files meant the cronjob just deleted all of September's flame graphs from performance.wikimedia.org. It also means external tools (like Toolforge/CoverMe and CLI tools can't analyse or and compare log from September as the compressed URLs are not expected.

With some manual work, we can decompress the log files and re-generate the flame graph (SVG files) by running flamegraph.pl. No big deal. Although, that will also fill up the disk again, so we can't actually do that. Instead, we'll need gradually delete single day files of all directories together (e.g. 1 day of September from logs/daily, log/hourly, svg/daily and svg/hourly). And free up space that way. That means permanent deletion, which I understand is not something you'd want to do without Perf approval, especially since we don't have any backups (other than eqiad-codfw redundancy).

As perhaps a weird compromise, for future reference, perhaps better I think to let the server fill up the disk and be unable to write new data. I'd prefer to have 2 months of perf data with today/tomorrow missing forever, than to lose last month forever. (We didn't lose anything today, so no worries, I'm comparing to as to explain my preference.)

Change 543005 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] webperf: add backups for arclamp application data

https://gerrit.wikimedia.org/r/543005

Mentioned in SAL (#wikimedia-operations) [2019-10-14T21:12:52Z] <Krinkle> Delete misc arclamp/logs and arclamp/svgs data from between 2018 and and 2019-08 on webperf1002/webperf2002, T235425

Mentioned in SAL (#wikimedia-operations) [2019-10-14T21:47:54Z] <Krinkle> Deleting 2019-09-01––2019-09-10 arclamp logs on webperf2002, and decompress the rest of 2019-09, T235425

Mentioned in SAL (#wikimedia-operations) [2019-10-14T23:26:59Z] <Krinkle> Delete 2019-09-01––2019-09-10 arclamp trace logs from webperf1002, and decompress the rest of 2019-09 (this will trigger svg re-generation), T235425

Krinkle added a comment.EditedOct 15 2019, 8:40 AM

Immediate issue is resolved. But this is expected to alert again in 5-10 days unless a few days' worth are deleted again by then. To be continued at T235455: Resolve arclamp disk exhaustion problem (Oct 2019).

Krinkle claimed this task.Oct 15 2019, 8:41 AM
Gilles moved this task from Inbox to Doing on the Performance-Team board.Oct 15 2019, 7:57 PM
Krinkle closed this task as Resolved.Oct 15 2019, 9:22 PM

Change 543005 merged by Dzahn:
[operations/puppet@production] webperf: add backups for arclamp application data

https://gerrit.wikimedia.org/r/543005