Page MenuHomePhabricator

WikiWho API disk fills up too quickly
Open, Needs TriagePublicBUG REPORT

Description

As in the case of https://phabricator.wikimedia.org/T402594, the WikiWho service went down today, serving 500 errors for every request. This appears to be because the main disk was full again. I truncated some large logs, then did a soft reboot of the server, and it started working again.

I'm not sure what is taking up so much of the 20G root disk. The files I truncated were the biggest logs I spotted, the celery default worker logs. But there is still only 2.3G free on /, so I must be missing some big source of disk usage.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Recommendation from discord: ssh into the root directory and do du -l | sort -n

The vast majority of the storage is on the two mounted volumes, so I think a command like that would be hard to get useful info from unless it's scoped to /dev/sda1.

I've found ncdu (side note: this article really doesn't seem like it should exist) to be more useful when trying to find out what's taking so much of disk space. It also has an --exclude <pattern> option if you want to exclude the mounted volumes.

Ive seen cases where a file was deleted but a program still had a pointer to the deleted file (programming error). The file would not show up in the file listing and even in du counts, but the OS could not yet free the disk space either.

From the last incident (2026-03-14):

Sysadmin notes: We had 2.6G of old journal logs from August 2025 that apparently didn't get auto-deleted. sudo journalctl --disk-usage shows on 120M so evidently it somehow has forgotten about these older log files. Anyway, I deleted them, and just for good measure adjusted /etc/systemd/journald.conf to set a SystemMaxUse=100M and SystemMaxFileSize=10M.

Checking now, the journal logs appear to be rotating as configured.

I've set up an UptimeRobot monitor for wikiwho.wmcloud.org. This will not monitor all service disruptions imaginable but it should let us know about running out of disk space, specifically, given all traffic goes through wikiwho01. We can't use an a real URI, I think, without potentially causing unnecessary resource strain.

Maintainers also have https://wikiwho-flower.wmcloud.org/ but Flower does not feature an alert system, unfortunately.

I saw we were getting close to being full again. Since we're adding so many more languages right now (T372340), we're logging a lot more things, too! The main WikiWho API handler was printing the API params to stdout, which got logged. I don't think we need to log this information, and even if we do, it should just be the page ID or something, and not all the other API parameters that are always the same.

Anyway, I've simply remove the print, and I ran sudo truncate --size 0 /var/log/celery/*.log.. I now see the worker logs aren't being written to at all, but I think they will be when any errors occur. Meanwhile https://wikiwho-flower.wmcloud.org/ still works great, so we have already have a log of every revision/page being processed.

This keeps happening. This time around, I couldn't find any culprit in the logs, but I did find a 4.6 GB Nginx access log at /usr/share/nginx/on. After some digging, it looks like our /etc/nginx/sites-enabled/wikiwho was misconfigured with access_log on instead of an actual path. I changed it to /var/log/nginx/access.log and I can now see it's writing there instead of the on file. I then deleted the on file and the disk is now only 77% full.

Between this and T415603#11836766, I hope we're finally good, but I'm not going to close this task just yet.

I noticed also that /var/lib/postgresql/ is taking up about 5.5 GB, so if we start to run out of disk space again, maybe we should prioritize T335322.