Page MenuHomePhabricator

Resolve arclamp disk exhaustion problem (Oct 2019)
Closed, ResolvedPublic

Description

This is in response to: T235425: webperf*002 running out of disk space (arc lamp, xhgui).


In T199853 we found that, with the current retention rules of 90 days (daily) and 14 days (hourly), we required a fairly stable amount of disk space: about 25G, and this hadn't changed over several months.

While overall backend traffic might increase over time, we we had a buffer of 150G in addition to that 25G (reserved for XHGui profiles for T180761, but we haven't gotten to that yet). In addition to having that buffer, there is also the plan to move storage of this off of the local disk and into Swift (T200108) and to increase retention much further (preferably a year at least).

But, against all expectations, we are now in a situation where the same retention rates are taking up 4X as much (~105G instead of ~25G).

I assume this due to the php7-excimer sampling interval being much lower than it was with hhvm-xenon. I now realise this was mentioned by Tim beforehand at T205059, and I also observed this anecdotally during the HHVM-PHP7 migration at T187154#5471414.

Some ideas of what we could do (need one, or more, of the following probably)

  • Decrease the php7-excimer sampling interval?
  • Shorten the Arc Lamp retention span? – This would go against our plan to increase the retention span. Its short length is already limiting its usefulness to investigate problems - T200108.
  • Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.
  • Implement support in Arc Lamp for compressed trace files ("logs"). – Even with compression, we'd still store 2X as much as before, but we'd be well within the disk space available, so no problem; That is, until we migrate XHGui to this server – T180761.
  • Expedite migration to let Arc Lamp store (older) logs in Swift and/or migrate them by other means transparently to Ac Lamp (e.g. for files unchanged for more than 1 day, upload to Swift and somehow overlay or rewrite the static file server)

Event Timeline

Short-term decision based on today's meeting: Temporarily reduce retention from 90 days to 45 days. Hopefully only for a few weeks.

Once done, this task will remain open until we decide how to get the retention back. E.g. with compression, or more disk space, or Swift, etc.

Change 543931 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/puppet@production] Reduce arclamp .log file retention from 90 to 45 days

https://gerrit.wikimedia.org/r/543931

Change 543931 merged by CDanis:
[operations/puppet@production] Reduce arclamp .log file retention from 90 to 45 days

https://gerrit.wikimedia.org/r/543931

Emergency measure is now live. Keeping task open until we've done what's necessary to restored our retention period.

Krinkle moved this task from Doing (old) to Inbox, needs triage on the Performance-Team board.

Next step will be compressing files. Hopefully that will give us enough space to restore our retention from 45 days to the 90 days we used to have until last week.

Krinkle changed the task status from Open to Stalled.Oct 23 2019, 7:33 PM
Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.

Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.

I don't think that's true, from the looks of it, not only it was approved, but it was implemented by yours truly.

We can do that again ofc, but I have a hard time coming up with a number. Care to provide an estimation?

Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.

I don't think that's true, from the looks of it, not only it was approved, but it was implemented by yours truly.

We can do that again ofc, but I have a hard time coming up with a number. Care to provide an estimation?

The request was for 500G per task description. Of this 355G would be for Arc Lamp and 150G for XHGui.

We settled for 150G thus planning to mainly use it for XHGui with ~ 25G for Arc Lamp., thus cancelling the plan to increase perf data retention from Arc Lamp – to be accommodated by other means in "the future" via Swift (maybe).

What actually happened is that we migrated Arc Lamp and the 25G of data it needs, we still haven't migrated XHGui yet, and the same data retention length as we had 2 years ago for Arc Lamp now seems to require ~ 250G instead of ~25G due to various growth factors and improvements to the data pipeline. We still haven't increased retention yet.

Last week, following the alert (T235425), as a (hopefully very) short-term measure we have cut down our retention period even further (from 90d to 45d, whereas we want 2 years) so that we fit within the allotted 150G.

This task is about getting our retention back from 45d to 90d.

This task is about getting our retention back from 45d to 90d.

Ok, let's add another 150GB to achieve that.

Thu Oct 24 14:50:11 2019 Growing disk 1 of instance 'webperf2002.codfw.wmnet' by 150.0G to 300.0G

and

Thu Oct 24 14:49:03 2019 Growing disk 1 of instance 'webperf1002.eqiad.wmnet' by 150.0G to 300.0G

Both will require a couple of hours plus a quick reboot of both VMs

Mentioned in SAL (#wikimedia-operations) [2019-10-25T07:29:29Z] <akosiaris> reboot webperf2002 for disk resize T235455

Mentioned in SAL (#wikimedia-operations) [2019-10-25T07:35:36Z] <akosiaris> reboot webperf1002 for disk resize T235455

After the 2 reboots and a

webperf1002:~$ sudo resize2fs /dev/vdb
resize2fs 1.43.4 (31-Jan-2017)
Filesystem at /dev/vdb is mounted on /srv; on-line resizing required
old_desc_blocks = 19, new_desc_blocks = 38

on each host

we now have

webperf1002:~$ df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  128G  154G  46% /srv

and

webperf2002:~$ df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  128G  154G  46% /srv
Krinkle changed the task status from Stalled to Open.Oct 25 2019, 2:29 PM
Krinkle assigned this task to aaron.
Krinkle added a subscriber: aaron.

Signing back over to @aaron.

I compressed a sample log file from today to see what kind of compression ratios we could get:

aaron@SPECTRE-GRE3FQT:~$ time gzip -k 2020-01-27.excimer.api.log 

real	0m7,576s
user	0m7,250s
sys	0m0,266s
aaron@SPECTRE-GRE3FQT:~$ time lzma -k 2020-01-27.excimer.api.log

real	0m52,780s
user	0m51,594s
sys	0m0,922s
aaron@SPECTRE-GRE3FQT:~$ ls -lh
total 646M
-rwxrwxrwx 1 aaron aaron 592M janv. 28 15:44 2020-01-27.excimer.api.log
-rwxrwxrwx 1 aaron aaron  44M janv. 28 15:44 2020-01-27.excimer.api.log.gz
-rwxrwxrwx 1 aaron aaron  11M janv. 28 15:44 2020-01-27.excimer.api.log.lzma
Krinkle lowered the priority of this task from High to Medium.Apr 21 2020, 5:52 PM
Krinkle changed the task status from Stalled to Open.EditedJul 29 2020, 2:20 AM
Krinkle assigned this task to dpifke.

Unstalled per T235456. Should be safe to revert the above temporary measure to get us back up to 90 days.

Change 617201 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] arclamp: restore 90 day retention

https://gerrit.wikimedia.org/r/617201

Change 617201 merged by Dzahn:
[operations/puppet@production] arclamp: restore 90 day retention

https://gerrit.wikimedia.org/r/617201