- Ensure that all PVs that exist are bound and in-use
- Ensure that PV(C)s get cleaned and reaped
- Check the volume size vs. physical volume space to ensure that we're using all the space we have allocated in WMCS
- 80G for legacy patchdemo
- 40G for catalyst patchdemo
- 40G for logs
- Maybe consider a rebalance here
Description
Event Timeline
On k3s.catalyst.eqiad1.wikimedia.cloud
kindrobot@k3s:~$ df -hT | grep -E 'ext4' /dev/sda1 ext4 20G 7.2G 12G 39% / /dev/sdb ext4 40G 34G 3.7G 91% /mnt/k3s-data /dev/sdc ext4 40G 42M 38G 1% /mnt/k3s-logs
/mnt/k3s-data is a 40G volume backed at 34G usage (91% used), moving into /mnt/k3s-data we see
root@k3s:/mnt/k3s-data/k3s# du -hd1 . 33M ./server 196M ./data 8.6G ./agent 25G ./storage 34G .
agent is take up 8.6G. It contains the bits for containers including fs layers. storage contains persistent volumes, in which we see
root@k3s:/mnt/k3s-data/k3s/storage# du -hd1 . 209M ./pvc-69118c3f-0190-4de3-8a04-20d0d5172f4f_cat-env_wiki-8990c278e4-8-mysql-claim 1.3G ./pvc-7e833d7f-069c-420e-99c2-8b963dab546e_patchdemo_data-patchdemo-mariadb-0 143M ./pvc-f0d8abdc-6550-4715-894c-93671d927f59_control-plane_data-catalyst-api-mariadb-0 8.0K ./pvc-0bb40fca-ad3b-440f-bd14-c242e110120f_control-plane-staging_catalyst-api-staging-catalyst-claim 155M ./pvc-01e83c79-25c0-48f8-8879-3b1ec59430a5_control-plane_data-catalyst-api-mariadb-0 1.1G ./pvc-9df26588-07bb-4c9f-b03b-f3cd4a152b06_cat-env_wiki-8990c278e4-8-mw-claim 22G ./pvc-397f0c96-d81c-45da-8b4c-6afe54817007_patchdemo_patchdemo 156M ./pvc-32211de2-6742-422a-9d54-19a47944a942_patchdemo-staging_data-patchdemo-staging-mariadb-0 2.2M ./pvc-bfb7315f-3e9a-4940-b68c-f5a9334da97d_patchdemo-staging_patchdemo-staging 8.0K ./pvc-b9edb051-399d-4960-a2be-1eb26e99c521_control-plane_catalyst-api-catalyst-claim 155M ./pvc-a99c6262-9fac-4eaf-ae4d-8898b753fbb5_control-plane-staging_data-catalyst-api-staging-mariadb-0 25G .
the lions share of the bits go to 22G ./pvc-397f0c96-d81c-45da-8b4c-6afe54817007_patchdemo_patchdemo. This pvc is where legacy wikis are stored. Going in there we see lots of folders, many empty. If we look for just ones with stuff in them:
632M ./e9aed027c3 606M ./b3cbd24019 637M ./ceecd96f6d 636M ./dbf8c3830c 631M ./2831896d67 631M ./c755858135 636M ./a6f159b1e0 594M ./12ab10c33c 631M ./da1ee0259a 631M ./eb49b041ed 633M ./2692dab28b 637M ./5be156dd1d 1.4G ./fbec298855 593M ./8e7f6f14f9 631M ./fb2748c1cf 636M ./d7609cc794 632M ./0a8ead2152 594M ./2d544aad97 1.4G ./6bc5837bc4 637M ./d55ccdd694 631M ./3ab1a2a854 229M ./c8ad5ef594 221M ./f725dee674 220M ./fa2297c389 231M ./fe564146d6 229M ./e1d6327d54 631M ./4fd2a6cca5 632M ./aa5be3754c 636M ./35df23a7dc 631M ./29bbc6fb2b 631M ./37a2ff76ff 636M ./149997aa69 636M ./0a309f9ee8 637M ./366555ae53 631M ./4226970e0f 637M ./0d6e534011
there are 36 wikis, with an average of 615mb per wiki
We also checked that all PVs are attached to pods, and confirmed that they were
kindrobot@k3s:~$ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-01e83c79-25c0-48f8-8879-3b1ec59430a5 8Gi RWO Delete Bound control-plane/data-catalyst-api-mariadb-0 local-path 138d pvc-7e833d7f-069c-420e-99c2-8b963dab546e 8Gi RWO Delete Bound patchdemo/data-patchdemo-mariadb-0 local-path 138d pvc-b9edb051-399d-4960-a2be-1eb26e99c521 100Mi RWO Delete Bound control-plane/catalyst-api-catalyst-claim local-path 40d pvc-397f0c96-d81c-45da-8b4c-6afe54817007 1Gi RWO Delete Bound patchdemo/patchdemo local-path 38d pvc-9df26588-07bb-4c9f-b03b-f3cd4a152b06 2560Mi RWO Delete Bound cat-env/wiki-8990c278e4-8-mw-claim local-path 8d pvc-69118c3f-0190-4de3-8a04-20d0d5172f4f 2560Mi RWO Delete Bound cat-env/wiki-8990c278e4-8-mysql-claim local-path 8d pvc-0bb40fca-ad3b-440f-bd14-c242e110120f 100Mi RWO Delete Bound control-plane-staging/catalyst-api-staging-catalyst-claim local-path 6d2h pvc-a99c6262-9fac-4eaf-ae4d-8898b753fbb5 8Gi RWO Delete Bound control-plane-staging/data-catalyst-api-staging-mariadb-0 local-path 6d2h pvc-bfb7315f-3e9a-4940-b68c-f5a9334da97d 1Gi RWO Delete Bound patchdemo-staging/patchdemo-staging local-path 6d2h pvc-32211de2-6742-422a-9d54-19a47944a942 8Gi RWO Delete Bound patchdemo-staging/data-patchdemo-staging-mariadb-0 local-path 6d2h
There's more research to do, but it seems like this is not an issue of kubernetes failing to reap pvcs, but instead just inadequate storage for the largest PVC, the one holding vhost-envs. vm-patchdemo has 80 gb for it's wikis. We functionally have less than 40 gb. Recommended course of action:
*resize k3s-data volume*
- delete the k3s-logs volume
- resize k3s-data to be 75 gb
- create a k3s-logs volume with 5 gb size
- reattach / fix mounts for k3s-logs volume to k3s vm