Page MenuHomePhabricator

Monitor the free space in the dumps v1 Ceph volume
Closed, ResolvedPublic

Description

We have declare a volume size when creating a Kubernetes volume. As such, the volume can fill up if we don't monitor it. We need to get alerted when the volume fills up, so we can resize it.

Event Timeline

After a bit of investigation, it appears that the kubelet_volume_stats_available_bytes and kubelet_volume_stats_capacity_bytes Prometheus metrics are only collected when a Pod mounting the associated volume is running.

That plays in favor of having a toolbox pod ever-running, that would mount the volume. This could help us clearing the locks.

Going in the same direction, we could also have this pod run some kind of monitor, as suggested in slack by @BTullis:

However, we may also want a deployment that supports the monitor process. This is like a daemon, currently running on snapshot1010, which does some stuff with the index.html files. Maybe we could also look at deploying this and use this as a toolbox pod as well.

Change #1130642 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace

https://gerrit.wikimedia.org/r/1130642

I have manually deployed the Deployment in the mediawiki-dumps-legacy namespace, and we can see the volume capacity metrics showing up in thanos

Change #1130952 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/alerts@master] Add monitoring over the mediawiki dumps legacy CephFS PVC available space

https://gerrit.wikimedia.org/r/1130952

As per @BTullis 's comment on slack, I'm also going to increase the Cephfs volume max size from 10GB to 40TB:

I think I'd be just as happy to go straight to 40TB, or similar. It's still thinly-provisioned, so it's not like we're zeroing out 120 TB of space in advance. Just setting upper limits.

brouberol@deploy1003:~$ k get pvc
NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
mediawiki-dumps-legacy-fs   Bound    pvc-646c60c5-d272-40fc-9cb4-a0772513e1b9   10Gi       RWX            ceph-cephfs-dumps   5d17h
brouberol@deploy1003:~$ k edit pvc mediawiki-dumps-legacy-fs
persistentvolumeclaim/mediawiki-dumps-legacy-fs edited
brouberol@deploy1003:~$ k get pvc
NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS        AGE
mediawiki-dumps-legacy-fs   Bound    pvc-646c60c5-d272-40fc-9cb4-a0772513e1b9   40Ti       RWX            ceph-cephfs-dumps   5d17h

Change #1130952 merged by Brouberol:

[operations/alerts@master] Add monitoring over the mediawiki dumps legacy CephFS PVC available space

https://gerrit.wikimedia.org/r/1130952

brouberol changed the task status from Unknown Status to Resolved.Mar 25 2025, 2:41 PM

Change #1130642 merged by Brouberol:

[operations/deployment-charts@master] Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace

https://gerrit.wikimedia.org/r/1130642