FIRING: DiskSpace: Disk space doc1003:9100:/ 5.661% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=doc1003 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| doc: change active host to doc2002 | operations/puppet | production | +1 -1 | |
| wmnet: failover doc host | operations/dns | master | +2 -2 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Arnoldokoth | T382610 Low disk space: doc1003 / doc2002 | |||
| Resolved | None | T382964 ProbeDown - doc1003/doc2002 | |||
| Resolved | None | T383027 ProbeDown - doc1003 | |||
| Resolved | None | T383017 ProbeDown - doc1003 | |||
| Resolved | None | T382978 ProbeDown - doc1003 http |
Event Timeline
doc2002 is a bit lower but follows a similar growth pattern: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=doc2002&refresh=5m&from=now-6M&to=now&var-datasource=thanos&var-cluster=ci
I removed some logfiles and apt cache. Both hosts have 7.5G free space, this should be enough for the holiday break
Thanks @Jelto I had tried clearing the apt cache as well (on Friday) but it didn't go below the alerting threshold. To buy us more time as we figure out the next steps, I'll increase the disk space after the break.
Since the data is generated and uploaded to doc hosts automatically it means at all times doc1003 and doc2002 should be pretty identical when it comes to disk space.
And indeed, doc2002 is also at 94%.
Any action we take for doc1003 should also be done for doc2002. I renamed the ticket to reflect that.
I think the options are pretty much:
- create a new virtual disk, mount it, reboot hosts
- ask releng if this is really the space they need or if there is any potential for cleaning up old things
(Since yes, the data is all under /srv/, so from the actual service.)
Since it takes a while, I'm adding a 200g disk to both hosts concurrently. Will reboot and mount doc2002 today probably. Then based on how that goes will do doc1003 on Monday.
Sounds great! Keep in mind that it might not boot because the device names change. If that happens we have to fix that via console. (re: the warning section under https://wikitech.wikimedia.org/wiki/Ganeti#Adding_a_disk)
Icinga downtime and Alertmanager silence (ID=631e2e68-64d0-484a-bcdd-4e7124acc93b) set by aokoth@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: Disk Change
doc2002.codfw.wmnet
Icinga downtime and Alertmanager silence (ID=0425add6-9c3e-404b-a6c4-25e4b3017187) set by aokoth@cumin1002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Disk Change
doc2002.codfw.wmnet
Mentioned in SAL (#wikimedia-operations) [2025-01-06T11:47:54Z] <moritzm> fix /etc/network/interfaces on doc2002 T382610
doc2002 looks alright now.
aokoth@doc2002:~$ sudo df -ht ext4 Filesystem Size Used Avail Use% Mounted on /dev/vda1 117G 7.3G 104G 7% / /dev/vdb1 196G 97G 90G 52% /srv
The doc1003 instance was erroneously created with a single partition, the generated documentations are stored under /srv/doc and /srv should be a standalone partition the same as on doc1002 (/srv is 200G there). We have a similar issue on releases1003 with Docker material being on / T368239#10197658. But that is a side track.
Can you look at attaching a new 200G partition to doc1003 and migrate the data to it? Do note that doc1003 is the primary so you might want to switch over to doc2002 ( https://wikitech.wikimedia.org/wiki/Doc.wikimedia.org#Runbook ).
Side track: 85G of 98G usage on /srv/doc is due to mediawiki/core I have filed T383128 to clean it up :)
I have cleaned up a bunch of old MediaWiki documentation on doc1003, eventually the rsync replicated the deletions to doc2002 and they now have plenty more of disk space:
We still need /srv/doc to be on its on partition which is what @Arnoldokoth is doing :)
Change #1108812 had a related patch set uploaded (by AOkoth; author: AOkoth):
[operations/puppet@production] doc: change active host to doc2002
Change #1108814 had a related patch set uploaded (by AOkoth; author: AOkoth):
[operations/dns@master] wmnet: failover doc host
Change #1108812 merged by AOkoth:
[operations/puppet@production] doc: change active host to doc2002
doc1003 is also good now.
aokoth@doc1003:~$ sudo df -ht ext4 Filesystem Size Used Avail Use% Mounted on /dev/vda1 117G 7.7G 104G 7% / /dev/vdb1 196G 58G 129G 31% /srv
@hashar For the next time we create a new doc machine, would you still request 2 separate disks or just a single disk that is large enough?
@hashar For the next time we create a new doc machine, would you still request 2 separate disks or just a single disk that is large enough?
See above T382610#10437066: The doc1003 instance was erroneously created with a single partition
In my experience having a single partition as always led to problem given an application can fill up the disk which would then end up causing issues to OS. We had the exact same issue with releases1003 with the a Jenkins job writing 53G temporary data to /srv/jenkins-agent/workspace which filed the / partition. It was a one off issue, but that is still a problem ( T368239#10197658 ). I previously had the issue with contint / CI agents etc.
So there should be:
- / partition solely for the OS
- /srv for the ever expending data
And I think splitting the OS and applications concern to different partition should be the default rule :)
