We have declare a volume size when creating a Kubernetes volume. As such, the volume can fill up if we don't monitor it. We need to get alerted when the volume fills up, so we can resize it.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T88728 Improve Wikimedia dumping infrastructure | |||
| Resolved | BTullis | T352650 WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes | |||
| Resolved | brouberol | T388378 Orchestrate dumps v1 from an airflow instance | |||
| Resolved | brouberol | T389762 Monitor the free space in the dumps v1 Ceph volume |
Event Timeline
After a bit of investigation, it appears that the kubelet_volume_stats_available_bytes and kubelet_volume_stats_capacity_bytes Prometheus metrics are only collected when a Pod mounting the associated volume is running.
That plays in favor of having a toolbox pod ever-running, that would mount the volume. This could help us clearing the locks.
Going in the same direction, we could also have this pod run some kind of monitor, as suggested in slack by @BTullis:
However, we may also want a deployment that supports the monitor process. This is like a daemon, currently running on snapshot1010, which does some stuff with the index.html files. Maybe we could also look at deploying this and use this as a toolbox pod as well.
Change #1130642 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace
I have manually deployed the Deployment in the mediawiki-dumps-legacy namespace, and we can see the volume capacity metrics showing up in thanos
Change #1130952 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/alerts@master] Add monitoring over the mediawiki dumps legacy CephFS PVC available space
As per @BTullis 's comment on slack, I'm also going to increase the Cephfs volume max size from 10GB to 40TB:
I think I'd be just as happy to go straight to 40TB, or similar. It's still thinly-provisioned, so it's not like we're zeroing out 120 TB of space in advance. Just setting upper limits.
brouberol@deploy1003:~$ k get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mediawiki-dumps-legacy-fs Bound pvc-646c60c5-d272-40fc-9cb4-a0772513e1b9 10Gi RWX ceph-cephfs-dumps 5d17h brouberol@deploy1003:~$ k edit pvc mediawiki-dumps-legacy-fs persistentvolumeclaim/mediawiki-dumps-legacy-fs edited brouberol@deploy1003:~$ k get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mediawiki-dumps-legacy-fs Bound pvc-646c60c5-d272-40fc-9cb4-a0772513e1b9 40Ti RWX ceph-cephfs-dumps 5d17h
Change #1130952 merged by Brouberol:
[operations/alerts@master] Add monitoring over the mediawiki dumps legacy CephFS PVC available space
Change #1130642 merged by Brouberol:
[operations/deployment-charts@master] Define a maintenance toolbox to run in the mediawiki-dumps-legacy namespace