Resolve arclamp disk exhaustion problem (Oct 2019)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Oct 14 2019, 10:30 PM

Description

This is in response to: T235425: webperf*002 running out of disk space (arc lamp, xhgui).

In T199853 we found that, with the current retention rules of 90 days (daily) and 14 days (hourly), we required a fairly stable amount of disk space: about 25G, and this hadn't changed over several months.

While overall backend traffic might increase over time, we we had a buffer of 150G in addition to that 25G (reserved for XHGui profiles for T180761, but we haven't gotten to that yet). In addition to having that buffer, there is also the plan to move storage of this off of the local disk and into Swift (T200108) and to increase retention much further (preferably a year at least).

But, against all expectations, we are now in a situation where the same retention rates are taking up 4X as much (~105G instead of ~25G).

I assume this due to the php7-excimer sampling interval being much lower than it was with hhvm-xenon. I now realise this was mentioned by Tim beforehand at T205059, and I also observed this anecdotally during the HHVM-PHP7 migration at T187154#5471414.

Some ideas of what we could do (need one, or more, of the following probably)

Decrease the php7-excimer sampling interval?
Shorten the Arc Lamp retention span? – This would go against our plan to increase the retention span. Its short length is already limiting its usefulness to investigate problems - T200108.
Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.
Implement support in Arc Lamp for compressed trace files ("logs"). – Even with compression, we'd still store 2X as much as before, but we'd be well within the disk space available, so no problem; That is, until we migrate XHGui to this server – T180761.
Expedite migration to let Arc Lamp store (older) logs in Swift and/or migrate them by other means transparently to Ac Lamp (e.g. for files unchanged for more than 1 day, upload to Swift and somehow overlay or rewrite the static file server)

Details

	Subject	Repo	Branch	Lines +/-
	arclamp: restore 90 day retention	operations/puppet	production	+1 -1
	Reduce arclamp .log file retention from 90 to 45 days	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• dpifke	T253160 Wall-clock Excimer profiling in production
Resolved	• dpifke	T235455 Resolve arclamp disk exhaustion problem (Oct 2019)
Resolved	• dpifke	T235456 Let Arc-Lamp store its trace "log" files in compressed format

Event Timeline

Krinkle created this task.Oct 14 2019, 10:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2019, 10:30 PM

Krinkle mentioned this in T235456: Let Arc-Lamp store its trace "log" files in compressed format.Oct 14 2019, 10:32 PM

Krinkle mentioned this in T235425: webperf*002 running out of disk space (arc lamp, xhgui).Oct 15 2019, 8:41 AM

Additional possibility: handling the compression at the OS level
Eg. https://btrfs.wiki.kernel.org/index.php/Compression

Short-term decision based on today's meeting: Temporarily reduce retention from 90 days to 45 days. Hopefully only for a few weeks.

Once done, this task will remain open until we decide how to get the retention back. E.g. with compression, or more disk space, or Swift, etc.

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Oct 15 2019, 9:24 PM

Krinkle triaged this task as High priority.Oct 16 2019, 11:02 PM

fgiunchedi subscribed.Oct 17 2019, 11:00 AM

Change 543931 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/puppet@production] Reduce arclamp .log file retention from 90 to 45 days

https://gerrit.wikimedia.org/r/543931

gerritbot added a project: Patch-For-Review.Oct 17 2019, 7:17 PM

Krinkle updated the task description. (Show Details)Oct 17 2019, 8:59 PM

Change 543931 merged by CDanis:
[operations/puppet@production] Reduce arclamp .log file retention from 90 to 45 days

https://gerrit.wikimedia.org/r/543931

Maintenance_bot removed a project: Patch-For-Review.Oct 18 2019, 6:10 PM

Emergency measure is now live. Keeping task open until we've done what's necessary to restored our retention period.

Krinkle removed Krinkle as the assignee of this task.Oct 20 2019, 1:18 AM

Krinkle moved this task from Doing (old) to Inbox, needs triage on the Performance-Team board.

Next step will be compressing files. Hopefully that will give us enough space to restore our retention from 45 days to the 90 days we used to have until last week.

Krinkle changed the task status from Open to Stalled.Oct 23 2019, 7:33 PM

Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.

Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.

I don't think that's true, from the looks of it, not only it was approved, but it was implemented by yours truly.

We can do that again ofc, but I have a hard time coming up with a number. Care to provide an estimation?

In T235455#5602334, @akosiaris wrote:

Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.

I don't think that's true, from the looks of it, not only it was approved, but it was implemented by yours truly.

We can do that again ofc, but I have a hard time coming up with a number. Care to provide an estimation?

The request was for 500G per task description. Of this 355G would be for Arc Lamp and 150G for XHGui.

We settled for 150G thus planning to mainly use it for XHGui with ~ 25G for Arc Lamp., thus cancelling the plan to increase perf data retention from Arc Lamp – to be accommodated by other means in "the future" via Swift (maybe).

What actually happened is that we migrated Arc Lamp and the 25G of data it needs, we still haven't migrated XHGui yet, and the same data retention length as we had 2 years ago for Arc Lamp now seems to require ~ 250G instead of ~25G due to various growth factors and improvements to the data pipeline. We still haven't increased retention yet.

Last week, following the alert (T235425), as a (hopefully very) short-term measure we have cut down our retention period even further (from 90d to 45d, whereas we want 2 years) so that we fit within the allotted 150G.

This task is about getting our retention back from 45d to 90d.

This task is about getting our retention back from 45d to 90d.

Ok, let's add another 150GB to achieve that.

Thu Oct 24 14:50:11 2019 Growing disk 1 of instance 'webperf2002.codfw.wmnet' by 150.0G to 300.0G

and

Thu Oct 24 14:49:03 2019 Growing disk 1 of instance 'webperf1002.eqiad.wmnet' by 150.0G to 300.0G

Both will require a couple of hours plus a quick reboot of both VMs

Mentioned in SAL (#wikimedia-operations) [2019-10-25T07:29:29Z] <akosiaris> reboot webperf2002 for disk resize T235455

Mentioned in SAL (#wikimedia-operations) [2019-10-25T07:35:36Z] <akosiaris> reboot webperf1002 for disk resize T235455

After the 2 reboots and a

webperf1002:~$ sudo resize2fs /dev/vdb
resize2fs 1.43.4 (31-Jan-2017)
Filesystem at /dev/vdb is mounted on /srv; on-line resizing required
old_desc_blocks = 19, new_desc_blocks = 38

on each host

we now have

webperf1002:~$ df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  128G  154G  46% /srv

and

webperf2002:~$ df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  128G  154G  46% /srv

Signing back over to @aaron.

Krinkle moved this task from Blocked (old) to To-do: Goals prioritized current Quarter on the Performance-Team board.Oct 25 2019, 2:29 PM

I compressed a sample log file from today to see what kind of compression ratios we could get:

aaron@SPECTRE-GRE3FQT:~$ time gzip -k 2020-01-27.excimer.api.log 

real	0m7,576s
user	0m7,250s
sys	0m0,266s
aaron@SPECTRE-GRE3FQT:~$ time lzma -k 2020-01-27.excimer.api.log

real	0m52,780s
user	0m51,594s
sys	0m0,922s
aaron@SPECTRE-GRE3FQT:~$ ls -lh
total 646M
-rwxrwxrwx 1 aaron aaron 592M janv. 28 15:44 2020-01-27.excimer.api.log
-rwxrwxrwx 1 aaron aaron  44M janv. 28 15:44 2020-01-27.excimer.api.log.gz
-rwxrwxrwx 1 aaron aaron  11M janv. 28 15:44 2020-01-27.excimer.api.log.lzma

aaron mentioned this in T244776: Swift container for performance flame graphs (ArcLamp).Feb 13 2020, 1:19 AM

Krinkle lowered the priority of this task from High to Medium.Apr 21 2020, 5:52 PM

ori added a subtask: T253160: Wall-clock Excimer profiling in production.May 20 2020, 11:54 PM

Krinkle removed a subtask: T253160: Wall-clock Excimer profiling in production.May 21 2020, 12:39 AM

Krinkle added a parent task: T253160: Wall-clock Excimer profiling in production.

Krinkle added a subtask: T227026: Deploy ArcLamp process as stateless/scalable service (Kubernetes).Jun 22 2020, 8:07 PM

Krinkle edited subtasks, added: T200108: Increase retention of ArcLamp SVGs to 2 years; removed: T227026: Deploy ArcLamp process as stateless/scalable service (Kubernetes).

Krinkle changed the task status from Open to Stalled.Jun 22 2020, 8:10 PM

Krinkle removed aaron as the assignee of this task.

Krinkle mentioned this in T200108: Increase retention of ArcLamp SVGs to 2 years.

Krinkle moved this task from To-do: Goals prioritized current Quarter to Backlog: Maintenance, non-prioritized on the Performance-Team board.

Krinkle closed subtask T235456: Let Arc-Lamp store its trace "log" files in compressed format as Resolved.Jul 29 2020, 2:18 AM