Page MenuHomePhabricator

Resolve arclamp disk exhaustion problem (Oct 2019)
Open, HighPublic

Description

This is in response to: T235425: webperf*002 running out of disk space (arc lamp, xhgui).


In T199853 we found that, with the current retention rules of 90 days (daily) and 14 days (hourly), we required a fairly stable amount of disk space: about 25G, and this hadn't changed over several months.

While overall backend traffic might increase over time, we we had a buffer of 150G in addition to that 25G (reserved for XHGui profiles for T180761, but we haven't gotten to that yet). In addition to having that buffer, there is also the plan to move storage of this off of the local disk and into Swift (T200108) and to increase retention much further (preferably a year at least).

But, against all expectations, we are now in a situation where the same retention rates are taking up 4X as much (~105G instead of ~25G).

I assume this due to the php7-excimer sampling interval being much lower than it was with hhvm-xenon. I now realise this was mentioned by Tim beforehand at T205059, and I also observed this anecdotally during the HHVM-PHP7 migration at T187154#5471414.

Some ideas of what we could do (need one, or more, of the following probably)

  • Decrease the php7-excimer sampling interval?
  • Shorten the Arc Lamp retention span? – This would go against our plan to increase the retention span. Its short length is already limiting its usefulness to investigate problems - T200108.
  • Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.
  • Implement support in Arc Lamp for compressed trace files ("logs"). – Even with compression, we'd still store 2X as much as before, but we'd be well within the disk space available, so no problem; That is, until we migrate XHGui to this server – T180761.
  • Expedite migration to let Arc Lamp store (older) logs in Swift and/or migrate them by other means transparently to Ac Lamp (e.g. for files unchanged for more than 1 day, upload to Swift and somehow overlay or rewrite the static file server)

Details

Related Gerrit Patches:

Event Timeline

Krinkle created this task.Oct 14 2019, 10:30 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2019, 10:30 PM
Gilles added a subscriber: Gilles.Oct 15 2019, 9:05 PM

Additional possibility: handling the compression at the OS level
Eg. https://btrfs.wiki.kernel.org/index.php/Compression

Krinkle claimed this task.Oct 15 2019, 9:24 PM

Short-term decision based on today's meeting: Temporarily reduce retention from 90 days to 45 days. Hopefully only for a few weeks.

Once done, this task will remain open until we decide how to get the retention back. E.g. with compression, or more disk space, or Swift, etc.

Krinkle moved this task from Inbox to Doing on the Performance-Team board.Oct 15 2019, 9:24 PM
Krinkle triaged this task as High priority.Oct 16 2019, 11:02 PM

Change 543931 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/puppet@production] Reduce arclamp .log file retention from 90 to 45 days

https://gerrit.wikimedia.org/r/543931

Krinkle updated the task description. (Show Details)Oct 17 2019, 8:59 PM

Change 543931 merged by CDanis:
[operations/puppet@production] Reduce arclamp .log file retention from 90 to 45 days

https://gerrit.wikimedia.org/r/543931

Emergency measure is now live. Keeping task open until we've done what's necessary to restored our retention period.

Krinkle removed Krinkle as the assignee of this task.Oct 20 2019, 1:18 AM
Krinkle moved this task from Doing to Inbox on the Performance-Team board.

Next step will be compressing files. Hopefully that will give us enough space to restore our retention from 45 days to the 90 days we used to have until last week.

Krinkle changed the task status from Open to Stalled.Wed, Oct 23, 7:33 PM
Krinkle moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.
akosiaris added a subscriber: akosiaris.EditedThu, Oct 24, 11:01 AM

Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.

I don't think that's true, from the looks of it, not only it was approved, but it was implemented by yours truly.

We can do that again ofc, but I have a hard time coming up with a number. Care to provide an estimation?

Krinkle added a comment.EditedThu, Oct 24, 2:22 PM

Increase disk space on the webperf*002 Ganeti VMs? – Was previously denied, at T199853.

I don't think that's true, from the looks of it, not only it was approved, but it was implemented by yours truly.
We can do that again ofc, but I have a hard time coming up with a number. Care to provide an estimation?

The request was for 500G per task description. Of this 355G would be for Arc Lamp and 150G for XHGui.

We settled for 150G thus planning to mainly use it for XHGui with ~ 25G for Arc Lamp., thus cancelling the plan to increase perf data retention from Arc Lamp – to be accommodated by other means in "the future" via Swift (maybe).

What actually happened is that we migrated Arc Lamp and the 25G of data it needs, we still haven't migrated XHGui yet, and the same data retention length as we had 2 years ago for Arc Lamp now seems to require ~ 250G instead of ~25G due to various growth factors and improvements to the data pipeline. We still haven't increased retention yet.

Last week, following the alert (T235425), as a (hopefully very) short-term measure we have cut down our retention period even further (from 90d to 45d, whereas we want 2 years) so that we fit within the allotted 150G.

This task is about getting our retention back from 45d to 90d.

This task is about getting our retention back from 45d to 90d.

Ok, let's add another 150GB to achieve that.

Thu Oct 24 14:50:11 2019 Growing disk 1 of instance 'webperf2002.codfw.wmnet' by 150.0G to 300.0G

and

Thu Oct 24 14:49:03 2019 Growing disk 1 of instance 'webperf1002.eqiad.wmnet' by 150.0G to 300.0G

Both will require a couple of hours plus a quick reboot of both VMs

Mentioned in SAL (#wikimedia-operations) [2019-10-25T07:29:29Z] <akosiaris> reboot webperf2002 for disk resize T235455

Mentioned in SAL (#wikimedia-operations) [2019-10-25T07:35:36Z] <akosiaris> reboot webperf1002 for disk resize T235455

After the 2 reboots and a

webperf1002:~$ sudo resize2fs /dev/vdb
resize2fs 1.43.4 (31-Jan-2017)
Filesystem at /dev/vdb is mounted on /srv; on-line resizing required
old_desc_blocks = 19, new_desc_blocks = 38

on each host

we now have

webperf1002:~$ df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  128G  154G  46% /srv

and

webperf2002:~$ df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb        295G  128G  154G  46% /srv
Krinkle changed the task status from Stalled to Open.Fri, Oct 25, 2:29 PM
Krinkle assigned this task to aaron.
Krinkle added a subscriber: aaron.

Signing back over to @aaron.