Page MenuHomePhabricator

No space left on device on VRTS host
Closed, ResolvedPublic

Description

VRTS logs in it's system log:

Can't write '/opt/otrs/var/tmp/CacheFileStorable/SchedulerDB/3/d/3da3803eec1b8a9dc6ef14920593b475': No space left on device

and appears to be unable to send outgoing e-mails.

Example ticket:

https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom&TicketNumber=2025103110642361

Event Timeline

Krd triaged this task as Unbreak Now! priority.Tue, Dec 2, 2:20 AM
Krd added projects: vrts, SRE.

@Dzahn has freed up some inodes. We were not out of disk space, we were out of inodes. We are trying to free up some more but for now, we should be running again. As @AntiCompositeNumber mentioned, there has been a steady rise for a while now so we should look into that.

Mentioned in SAL (#wikimedia-operations) [2025-12-02T03:14:28Z] <mutante> vrts1003 - sudo -u otrs ./bin/otrs.Console.pl Maint::Cache::Delete (T411452)

Mentioned in SAL (#wikimedia-operations) [2025-12-02T03:15:12Z] <mutante> vrts1003 - compressed /opt/znuny-6.5.16 and .17 to .tar.gz files - then deleted uncompressed versions - freeing about 700k inodes (T411452)

in this context I found:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1009303/3/modules/vrts/manifests/init.pp

which disabled a timer that used to delete cache hourly.

related ticket: T354422

@Arnoldokoth

Dzahn lowered the priority of this task from Unbreak Now! to High.Tue, Dec 2, 3:25 AM

inode usage on / is back to 2% - exim logs show emails are going out

therefore lowering from UBN to High

There are now some tickets where responses have not been sent out. How do we find these tickets?

Disregard my last entry. Tickets have been identified.

Aklapper renamed this task from disk full at VRTS host? to No space left on device on VRTS host.Tue, Dec 2, 7:59 AM

@Dzahn Yes, I disabled that timer because it conflicted with another one run by the built-in VRTS daemon. Running both of them resulted in a bunch of noise on the emails from the daemon. I can enable it to run maybe daily and see how that goes.

As a follow up, I'll setup an alert to give us a heads up when inode usage is high.

Change #1214034 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/alerts@master] vrts: add high inode usage alert

https://gerrit.wikimedia.org/r/1214034

Change #1214129 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] vrts: re-enable cache cleanup timer

https://gerrit.wikimedia.org/r/1214129

The Inode usage grew from 2% to 10% already in the past day: https://grafana.wikimedia.org/d/000000371/vrts?orgId=1&from=now-2d&to=now&timezone=utc&var-node=vrts1003&viewPanel=panel-30. So I like the idea of enabling the cleanup job again.

+1 - it seems the cleanup job is needed

Change #1214129 merged by AOkoth:

[operations/puppet@production] vrts: re-enable cache cleanup timer

https://gerrit.wikimedia.org/r/1214129

Change #1214034 merged by jenkins-bot:

[operations/alerts@master] vrts: add high inode usage alert

https://gerrit.wikimedia.org/r/1214034

Jelto assigned this task to Arnoldokoth.
Jelto updated Other Assignee, added: Dzahn.

Thanks @Arnoldokoth for enabling the cleanup job again. The inode metrics look much better now and stabilized at around 2.5% usage. Also thanks for the new alert, which should warn us before we see the no space left on device errors.

Thanks to @Dzahn for cleaning up the host manually during the last incident.

I'll resolve this task optimistically as this issue was resolved and countermeasures were implemented.