Page MenuHomePhabricator

webperf1002 server close to have /srv partition full
Closed, ResolvedPublic

Event Timeline

Krinkle triaged this task as High priority.Jul 14 2020, 4:05 PM

Further zoomed out:

https://grafana.wikimedia.org/d/000000377/host-overview?panelId=12&fullscreen&orgId=1&from=now-6M&to=now&var-server=webperf1002&var-datasource=eqiad%20prometheus%2Fops&var-cluster=webperf

Screenshot 2020-07-14 at 17.04.39.png (1×2 px, 278 KB)

Maybe the flame graph size isn't as constant as it once was. Or maybe it's not the flame graphs taking up the space?

Krinkle renamed this task from webpref1002 server close to have /srv partition full to webperf1002 server close to have /srv partition full.Jul 14 2020, 4:06 PM
Krinkle assigned this task to dpifke.
Krinkle added a project: Arc-Lamp.

The growth is all in the raw log files (~270GB of logs, ~360MB of SVGs, ~310MB for XHGui), which seem to be increasing in size as of the start of this month:

Size of daily_ .excimer.all.log in bytes.png (371×600 px, 12 KB)

The additional usage is around 500MB/day compared to a month ago, and we currently have about 11GB free, so we have some time this becomes critical.

Splitting out T235456 from the move to Swift for this data is probably the best short-term solution for this.

free space: /srv 7134 MB / Usage: 98.2% It will run out of space in 3 days.

Is there an easy way to grow this partition? Another 50-100GB would see us through until this data starts living in Swift later this quarter.

(Patches to support compressed log files are also coming shortly.)

Adding a second virtual disk and mounting it is relatively easy. Resizing the existing partition is not so much.

ArcLamp is a batch process, so brief downtime while we move the data over onto a new partition isn't a big deal. We could even shut down the instance if needed.

I am adding a second virtual disk right now... in progress....

I got disconnected while the command was still running and failed to run it in screen. And now "gnt-instance info" just hangs.. ugh...

We were down to ~5 GB free before the daily expiration took place, and about ~10 GB immediately after.

To ensure this won't be a problem over the weekend, I expired a couple of extra days of logs. ~27 GB is currently free, which is more than enough to last us until either my patch to add compressed log support is merged or we can get things copied over to a larger /srv partition.

An additional disk with just 20G has been created but it took forever and it won't help with this situation.

given the situation, it would be smart to start compressing older logs.

I will not act today but I think adding a simple systemd timer that fires daily and compresses all logs older than 10 days might be a good interim solution here.

@Joe: if you gzip the logs before https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/613740/ lands, they won't be expired, the related output SVG files will be immediately deleted, and the logs will be inaccessible to arclamp-grep and other tools. Please don't do that; better to expire older files early.

@Dzahn: Any progress on figuring out why Ganeti is hanging when trying to add a virtual disk?

given the situation, it would be smart to start compressing older logs.

In avoidance of doubt, these are not "log" files. This directory holds trace logs which are the primary data store of the service that this VM exists for. Compressing these ad-hoc is not supported and is not functionally different from e.g. hand compresssing internal Swift or MariaDB files. There is not the expectation of a functioning system afterwards.

Support for compression was filed as T235456 last year. After T235425, this was pushed back in favour of reducing retention from 90 days to 45 days as "temporarily" measure while we move these files off the local disk into Swift, thus making Arc Lamp more stateless.

However, due to XHGui and other work taking priority, this isn't finished yet and the disk is now full with only 45 days of trace logs. So T235456 is now being picked up again (sub task of this).

I will just acknowledge the alert given you are tracking the situation.

Btw if I remember correctly, those log files are used to regenerate svgs via a script asychronously, which is not exactly the same as a database. I assumed the script would be modified as well if we added a systemd timer.

BTW:

webperf2002 Disk space WARNING 2020-07-23 09:12:15 0d 0h 38m 11s 3/3 DISK WARNING - free space: /srv 20077 MB (6% inode=99%):

Back to DISK CRITICAL - free space: /srv 2946 MB (1% inode=99%):
ACKing the alert.
Edit: the above is for webperf1002

Dave Pifke zipped some files and is working on a patch to make arclamp support gzipped files.

Mentioned in SAL (#wikimedia-operations) [2020-07-25T00:46:30Z] <mutante> ganeti - removing disk 3 (20G) from webperf1002. the disks are 0-indexed, so the ones actually mounted are 0 (50G) and 1 (300G) (T257931)

Mentioned in SAL (#wikimedia-operations) [2020-07-25T01:52:47Z] <mutante> ganeti - also removing (unmounted) disk 2 (100G) from webperf1002. T257931

Change 616179 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] arclamp: run arclamp-compress-logs from cron

https://gerrit.wikimedia.org/r/616179

Change 616179 merged by Dzahn:
[operations/puppet@production] arclamp: run arclamp-compress-logs from cron

https://gerrit.wikimedia.org/r/616179

Change 616634 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] arclamp: send stderr of new arclamp compress cron job to a logfile

https://gerrit.wikimedia.org/r/616634

Change 616635 had a related patch set uploaded (by Dave Pifke; owner: Dave Pifke):
[operations/puppet@production] arclamp: fix cron specification

https://gerrit.wikimedia.org/r/616635

Change 616635 merged by Dzahn:
[operations/puppet@production] arclamp: fix cron specification

https://gerrit.wikimedia.org/r/616635

Change 616634 abandoned by Dzahn:
[operations/puppet@production] arclamp: send stderr of new arclamp compress cron job to a logfile

Reason:
https://gerrit.wikimedia.org/r/c/operations/puppet/ /616635

https://gerrit.wikimedia.org/r/616634