Page MenuHomePhabricator

mwmaint1002 is out of disk space
Closed, ResolvedPublic

Description

16:55 <+icinga-wm> PROBLEM - Disk space on mwmaint1002 is CRITICAL: DISK CRITICAL - free space: / 3753 MB (3% inode=92%): /tmp 3753 MB (3% inode=92%): /var/tmp 3753 MB (3% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwmaint1002&var-datasource=eqiad+prometheus/ops

and my (failing) attempts to store data on that machine. Can someone help clean the machine up, please?

Event Timeline

Feels like something's filling things up. I removed some files I no longer need in my home, which got it at 99% and 1.9G space available. At this point, less than 900M is available (so about a GB worth of space eaten up already).

Urbanecm_WMF triaged this task as Unbreak Now! priority.Apr 28 2025, 3:59 PM

Provisionally, server fully out of space doesn't seem like a good idea. Feel free to lower if you think that's appropriate.

And we're at zero availability:

[urbanecm@mwmaint1002 ~]$ df -h
Filesystem                        Size  Used Avail Use% Mounted on
[...]
/dev/mapper/mwmaint1002--vg-root  121G  116G     0 100% /
[...]
[urbanecm@mwmaint1002 ~]$
elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail
553M	/home/ebernhardson
842M	/home/catrope
1.2G	/home/brion
1.3G	/home/tstarling
1.7G	/home/oblivian
1.7G	/home/samtar
2.1G	/home/cparle
11G	/home/legoktm
15G	/home/ladsgroup
23G	/home/zabe

The home dirs may be a quick one for a cleanup, we have a total of 62G used there, but there are few big ones.

@Zabe @Ladsgroup @Legoktm

Big directories are:

/var/log: 42GB

and

22.8 GiB [##########] /home/zabe
14.9 GiB [######    ] /home/ladsgroup
10.9 GiB [####      ] /home/legoktm

And also:

elukey@mwmaint1002:/var/log/mediawiki$ sudo du -hs * | sort -h | tail
505M	mediawiki_job_mediamoderation-hourlyScan
519M	mediawiki_job_purge_checkuser
546M	mediawiki_job_cirrus_build_completion_indices_eqiad
626M	mediawiki_job_growthexperiments-refreshLinkRecommendations-s6
732M	mediawiki_job_growthexperiments-refreshLinkRecommendations-s1
1.2G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s7
1.3G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s5
3.2G	mediawiki_job_growthexperiments-listTaskCounts
13G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s2
15G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s3

I think something is broken with log rotation. When I was checking logs for systemd timer logs, I found stuff from years ago.

I deleted my old backup logs. That saves up 14GB but logs needs to be cleaned up.

The GrowthExperiments logs seem properly rotated, there are daily logfiles going back two weeks, and the log entry dates match the file date. It just seems to be creating a huge amount of logs.

Change #1139515 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/puppet@production] mediawiki: Make refreshLinkRecommendations job less verbose

https://gerrit.wikimedia.org/r/1139515

Hmm... I just discovered mwmaint2002's disk is significantly larger than 1002's (430G vs 120G). Should we even have servers with the same role with very different diskspace?

listTaskCounts uses --output none already, that 3G is entirely job runner boilerplate (a ton of rows like Apr 18 15:11:00 mwmaint1002 mediawiki_job_growthexperiments-listTaskCounts[9828]: abwiki).

There are 454,813 "PHP Notice: fwrite(): write of 63 bytes failed with errno=28 No space left on device" errors in logspam-watch right now. It looks like the extensions/WikimediaMaintenance/migrateESRefToContentTable.php job being run by @Zabe is sending most of them.

It's almost entirely just logs from the growth experiments jobs. and under /var/log/

24G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s2
25G	mediawiki_job_growthexperiments-refreshLinkRecommendations-s3

Mentioned in SAL (#wikimedia-operations) [2025-04-28T20:30:05Z] <mutante> mwmaint1002 - manually gzipped some syslog.1 file from growthexperiment jobs that used up all disk space - systemctl start logrotate T392834

The mediawiki_job_growthexperiments-refreshLinkRecommendations-* logs should be fine to delete, if you are looking for some emergency space savings. It's the output of a job creating secondary data.

Logrotate should probably enforce some default storage quota for jobs.

Thanks for confirming that. I deleted the 2 largest syslog files, from mediawiki_job_growthexperiments-refreshLinkRecommendations-s2 and mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.

Then started logrotate service again, which succeeded this time.

Then started all other failed units.

Stopping the services mediawiki_job_growthexperiments-refreshLinkRecommendations-s2 and mediawiki_job_growthexperiments-refreshLinkRecommendations-s3 also does not properly shut them down.

There are still the processes like /bin/sh -c /usr/local/bin/foreachwikiindblist 'growthexperiments & s2' extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php --verbose.

Killed the processes for growthexperiments-refreshLinkRecommendations-s2 and growthexperiments-refreshLinkRecommendations-s3.

gzipped more syslog files.

Still not a lot of space. rsyslogd was keeping the deleted files open.

restarted rsyslogd and finally we are back to 60% usage on / now.

kill 24015
kill 24047
systemctl start logrotate
..
systemctl start prometheus-dpkg-success-textfile.service
..
start prometheus_intel_microcode.service
..
systemctl start prometheus-puppet-agent-stats.service
..
systemctl start export_smart_data_dump.service
..
root@mwmaint1002:/var/log/mediawiki# systemctl list-units --state=failed
0 loaded units listed. Pass --all to see loaded but inactive units, too.

...
/dev/mapper/mwmaint1002--vg-root  121G   73G   43G  64% /

...
21:05 <+icinga-wm> RECOVERY - Disk space on mwmaint1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
bd808 lowered the priority of this task from Unbreak Now! to High.Apr 28 2025, 11:30 PM

Dropping priority to High as it seems @Dzahn's cleanup work has taken care of the immediate problem. I'll leave it to him and others to decide if things are {{done}} here now or if there is followup to do before closing.

elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail
553M	/home/ebernhardson
842M	/home/catrope
1.2G	/home/brion
1.3G	/home/tstarling
1.7G	/home/oblivian
1.7G	/home/samtar
2.1G	/home/cparle
11G	/home/legoktm
15G	/home/ladsgroup
23G	/home/zabe

The home dirs may be a quick one for a cleanup, we have a total of 62G used there, but there are few big ones.

@Zabe @Ladsgroup @Legoktm

I cleaned up my home directory a bit, reducing its size to 5.3G, although it will continue growing again due to T183490.

elukey@mwmaint1002:/home$ sudo du -hs /home/* | sort -h | tail
...
11G	/home/legoktm

The home dirs may be a quick one for a cleanup, we have a total of 62G used there, but there are few big ones.

@Legoktm

Sorry about that, cleaned up. My home directory is 27M now.

Dropping priority to High as it seems @Dzahn's cleanup work has taken care of the immediate problem. I'll leave it to him and others to decide if things are {{done}} here now or if there is followup to do before closing.

I just wanted to help out to fix an UBN. I'll leave it to the role owners to decide about the ticket and possible follow-ups.

Change #1139515 merged by Ladsgroup:

[operations/puppet@production] mediawiki: Make refreshLinkRecommendations job less verbose

https://gerrit.wikimedia.org/r/1139515

imho we should have something that effectively notifies a team (automatic task, email) so next time we don't need to rely on manually created tickets by users

Yeah. Can you file a ticket for better monitoring?

Can you file a ticket for better monitoring?