Page MenuHomePhabricator

Low disk space: doc1003 / doc2002
Closed, ResolvedPublic

Event Timeline

I removed some logfiles and apt cache. Both hosts have 7.5G free space, this should be enough for the holiday break

Thanks @Jelto I had tried clearing the apt cache as well (on Friday) but it didn't go below the alerting threshold. To buy us more time as we figure out the next steps, I'll increase the disk space after the break.

Dzahn renamed this task from Low disk space: doc1003 to Low disk space: doc1003 / doc2002.Jan 2 2025, 4:40 PM
Dzahn subscribed.

Since the data is generated and uploaded to doc hosts automatically it means at all times doc1003 and doc2002 should be pretty identical when it comes to disk space.

And indeed, doc2002 is also at 94%.

Any action we take for doc1003 should also be done for doc2002. I renamed the ticket to reflect that.

I think the options are pretty much:

  • create a new virtual disk, mount it, reboot hosts
  • ask releng if this is really the space they need or if there is any potential for cleaning up old things

(Since yes, the data is all under /srv/, so from the actual service.)

Arnoldokoth changed the task status from Open to In Progress.Jan 3 2025, 4:59 PM

Since it takes a while, I'm adding a 200g disk to both hosts concurrently. Will reboot and mount doc2002 today probably. Then based on how that goes will do doc1003 on Monday.

Sounds great! Keep in mind that it might not boot because the device names change. If that happens we have to fix that via console. (re: the warning section under https://wikitech.wikimedia.org/wiki/Ganeti#Adding_a_disk)

Icinga downtime and Alertmanager silence (ID=631e2e68-64d0-484a-bcdd-4e7124acc93b) set by aokoth@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: Disk Change

doc2002.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=0425add6-9c3e-404b-a6c4-25e4b3017187) set by aokoth@cumin1002 for 5 days, 0:00:00 on 1 host(s) and their services with reason: Disk Change

doc2002.codfw.wmnet
LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

Mentioned in SAL (#wikimedia-operations) [2025-01-06T11:47:54Z] <moritzm> fix /etc/network/interfaces on doc2002 T382610

doc2002 looks alright now.

aokoth@doc2002:~$ sudo df -ht ext4
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       117G  7.3G  104G   7% /
/dev/vdb1       196G   97G   90G  52% /srv

The doc1003 instance was erroneously created with a single partition, the generated documentations are stored under /srv/doc and /srv should be a standalone partition the same as on doc1002 (/srv is 200G there). We have a similar issue on releases1003 with Docker material being on / T368239#10197658. But that is a side track.

Can you look at attaching a new 200G partition to doc1003 and migrate the data to it? Do note that doc1003 is the primary so you might want to switch over to doc2002 ( https://wikitech.wikimedia.org/wiki/Doc.wikimedia.org#Runbook ).

Side track: 85G of 98G usage on /srv/doc is due to mediawiki/core I have filed T383128 to clean it up :)

I have cleaned up a bunch of old MediaWiki documentation on doc1003, eventually the rsync replicated the deletions to doc2002 and they now have plenty more of disk space:

doc_disk_free.png (514×768 px, 26 KB)

We still need /srv/doc to be on its on partition which is what @Arnoldokoth is doing :)

Change #1108812 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] doc: change active host to doc2002

https://gerrit.wikimedia.org/r/1108812

Change #1108814 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/dns@master] wmnet: failover doc host

https://gerrit.wikimedia.org/r/1108814

Change #1108814 merged by AOkoth:

[operations/dns@master] wmnet: failover doc host

https://gerrit.wikimedia.org/r/1108814

Change #1108812 merged by AOkoth:

[operations/puppet@production] doc: change active host to doc2002

https://gerrit.wikimedia.org/r/1108812

doc1003 is also good now.

aokoth@doc1003:~$ sudo df -ht ext4
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       117G  7.7G  104G   7% /
/dev/vdb1       196G   58G  129G  31% /srv

@hashar For the next time we create a new doc machine, would you still request 2 separate disks or just a single disk that is large enough?

@hashar For the next time we create a new doc machine, would you still request 2 separate disks or just a single disk that is large enough?

See above T382610#10437066: The doc1003 instance was erroneously created with a single partition

In my experience having a single partition as always led to problem given an application can fill up the disk which would then end up causing issues to OS. We had the exact same issue with releases1003 with the a Jenkins job writing 53G temporary data to /srv/jenkins-agent/workspace which filed the / partition. It was a one off issue, but that is still a problem ( T368239#10197658 ). I previously had the issue with contint / CI agents etc.

So there should be:

  • / partition solely for the OS
  • /srv for the ever expending data

And I think splitting the OS and applications concern to different partition should be the default rule :)