Page MenuHomePhabricator

mwlog1001 is running out of free space on /srv/mw-log
Closed, ResolvedPublic

Description

As of writing:

/dev/mapper/mwlog1001--vg-data  7.9T  7.4T  117G  99% /srv

Compared to in {T264639}, October 2020:

/dev/mapper/mwlog1001--vg-data 7.2T 5.1T 1.8T 74% /srv

And also as of 00:19, April 25, 2021.

[00:19:50] <icinga-wm> PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 274062 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops

The last week or so has seen quite an increase in disk space on the host

Screenshot 2021-04-25 at 04.32.53.png (1×2 px, 220 KB)

Screenshot 2021-04-25 at 04.33.05.png (1×2 px, 225 KB)

As per T264639, mwlog1002 is procured and in service.. What's the plan for decomming mwlog1001? I don't see a ticket, so not sure if it's just going to mostly sit around until there's reason to take it out of service.

It does seem mwlog1002 has seen a similar increase in disk space usage, but has quite a bit more total (and therefore free) space (for now).

Screenshot 2021-04-25 at 04.35.54.png (1×2 px, 227 KB)

I'm guessing something has changed with amount (frequency, data size) of stuff being logged to disk....

Event Timeline

Reedy updated the task description. (Show Details)

If people move stuff off of /srv/security we could get .5T back which would be helpful. Some of those files are from a few years ago.

The big spender in /srv/mw-log is ExternalStore.log, which uses 864G. It logs every cache miss from fetchBlob. Perhaps we could forgo this until space issues are worked out?

mwlog1002 is happier because its /srv is rather larger, but the volumes on mwlog1001 look like they cannot be expanded.

If people move stuff off of /srv/security we could get .5T back which would be helpful. Some of those files are from a few years ago.

To where though? I imagine a lot of that can probably be deleted (or at least compressed). But unless we moved it to say mwlog1002 (for now), we don't really have anywhere to keep it (T246954 stalled out).

And all files are owned by root:root, so I mostly can't do anything (only Chase had root).

I can bring it up at the team meeting next week to see if we need/want to keep this

The big spender in /srv/mw-log is ExternalStore.log, which uses 864G. It logs every cache miss from fetchBlob. Perhaps we could forgo this until space issues are worked out?

It doesn't seem like that config has changed any time recently, but at the same time, I honestly can't see why we're logging (to disk) every cache hit and miss at that rate, unless it's used for some analysis elsewhere.

We can certainly just change the log to being "error" rather than "debug"... Tagging Perf/Platform (and will stick up a changeset)

Change 682322 had a related patch set uploaded (by Reedy; author: Reedy):

[operations/mediawiki-config@master] Move ExternalStore log group from debug to error

https://gerrit.wikimedia.org/r/682322

FWIW +1 on lowering debug level, AFAIK mwlog1001 is indeed quite close to being replaced by mwlog1002 in T224565: Migrate mwlog/udp2log servers to Buster

akosiaris triaged this task as Medium priority.Apr 26 2021, 9:58 AM
lmata added a subscriber: lmata.

Change 682727 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] externalstore: convert some log messages to WARNING

https://gerrit.wikimedia.org/r/682727

Change 682720 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@wmf/1.37.0-wmf.1] externalstore: convert some log messages to WARNING

https://gerrit.wikimedia.org/r/682720

Change 682727 merged by jenkins-bot:

[mediawiki/core@master] externalstore: convert some log messages to WARNING

https://gerrit.wikimedia.org/r/682727

Change 682322 merged by jenkins-bot:

[operations/mediawiki-config@master] logging: Raise ExternalStore min level from debug to warning

https://gerrit.wikimedia.org/r/682322