Page MenuHomePhabricator

Investigate disk space usage of k3s node
Closed, ResolvedPublic

Description

  • Ensure that all PVs that exist are bound and in-use
  • Ensure that PV(C)s get cleaned and reaped
  • Check the volume size vs. physical volume space to ensure that we're using all the space we have allocated in WMCS
    • 80G for legacy patchdemo
    • 40G for catalyst patchdemo
    • 40G for logs
    • Maybe consider a rebalance here

Event Timeline

On k3s.catalyst.eqiad1.wikimedia.cloud

kindrobot@k3s:~$ df -hT | grep -E 'ext4'
/dev/sda1      ext4       20G  7.2G   12G  39% /
/dev/sdb       ext4       40G   34G  3.7G  91% /mnt/k3s-data
/dev/sdc       ext4       40G   42M   38G   1% /mnt/k3s-logs

/mnt/k3s-data is a 40G volume backed at 34G usage (91% used), moving into /mnt/k3s-data we see

root@k3s:/mnt/k3s-data/k3s# du -hd1 .
33M	./server
196M	./data
8.6G	./agent
25G	./storage
34G	.

agent is take up 8.6G. It contains the bits for containers including fs layers. storage contains persistent volumes, in which we see

root@k3s:/mnt/k3s-data/k3s/storage# du -hd1 .
209M	./pvc-69118c3f-0190-4de3-8a04-20d0d5172f4f_cat-env_wiki-8990c278e4-8-mysql-claim
1.3G	./pvc-7e833d7f-069c-420e-99c2-8b963dab546e_patchdemo_data-patchdemo-mariadb-0
143M	./pvc-f0d8abdc-6550-4715-894c-93671d927f59_control-plane_data-catalyst-api-mariadb-0
8.0K	./pvc-0bb40fca-ad3b-440f-bd14-c242e110120f_control-plane-staging_catalyst-api-staging-catalyst-claim
155M	./pvc-01e83c79-25c0-48f8-8879-3b1ec59430a5_control-plane_data-catalyst-api-mariadb-0
1.1G	./pvc-9df26588-07bb-4c9f-b03b-f3cd4a152b06_cat-env_wiki-8990c278e4-8-mw-claim
22G	./pvc-397f0c96-d81c-45da-8b4c-6afe54817007_patchdemo_patchdemo
156M	./pvc-32211de2-6742-422a-9d54-19a47944a942_patchdemo-staging_data-patchdemo-staging-mariadb-0
2.2M	./pvc-bfb7315f-3e9a-4940-b68c-f5a9334da97d_patchdemo-staging_patchdemo-staging
8.0K	./pvc-b9edb051-399d-4960-a2be-1eb26e99c521_control-plane_catalyst-api-catalyst-claim
155M	./pvc-a99c6262-9fac-4eaf-ae4d-8898b753fbb5_control-plane-staging_data-catalyst-api-staging-mariadb-0
25G	.

the lions share of the bits go to 22G ./pvc-397f0c96-d81c-45da-8b4c-6afe54817007_patchdemo_patchdemo. This pvc is where legacy wikis are stored. Going in there we see lots of folders, many empty. If we look for just ones with stuff in them:

632M ./e9aed027c3
606M ./b3cbd24019
637M ./ceecd96f6d
636M ./dbf8c3830c
631M ./2831896d67
631M ./c755858135
636M ./a6f159b1e0
594M ./12ab10c33c
631M ./da1ee0259a
631M ./eb49b041ed
633M ./2692dab28b
637M ./5be156dd1d
1.4G ./fbec298855
593M ./8e7f6f14f9
631M ./fb2748c1cf
636M ./d7609cc794
632M ./0a8ead2152
594M ./2d544aad97
1.4G ./6bc5837bc4
637M ./d55ccdd694
631M ./3ab1a2a854
229M ./c8ad5ef594
221M ./f725dee674
220M ./fa2297c389
231M ./fe564146d6
229M ./e1d6327d54
631M ./4fd2a6cca5
632M ./aa5be3754c
636M ./35df23a7dc
631M ./29bbc6fb2b
631M ./37a2ff76ff
636M ./149997aa69
636M ./0a309f9ee8
637M ./366555ae53
631M ./4226970e0f
637M ./0d6e534011

there are 36 wikis, with an average of 615mb per wiki

We also checked that all PVs are attached to pods, and confirmed that they were

kindrobot@k3s:~$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                       STORAGECLASS   REASON   AGE
pvc-01e83c79-25c0-48f8-8879-3b1ec59430a5   8Gi        RWO            Delete           Bound    control-plane/data-catalyst-api-mariadb-0                   local-path              138d
pvc-7e833d7f-069c-420e-99c2-8b963dab546e   8Gi        RWO            Delete           Bound    patchdemo/data-patchdemo-mariadb-0                          local-path              138d
pvc-b9edb051-399d-4960-a2be-1eb26e99c521   100Mi      RWO            Delete           Bound    control-plane/catalyst-api-catalyst-claim                   local-path              40d
pvc-397f0c96-d81c-45da-8b4c-6afe54817007   1Gi        RWO            Delete           Bound    patchdemo/patchdemo                                         local-path              38d
pvc-9df26588-07bb-4c9f-b03b-f3cd4a152b06   2560Mi     RWO            Delete           Bound    cat-env/wiki-8990c278e4-8-mw-claim                          local-path              8d
pvc-69118c3f-0190-4de3-8a04-20d0d5172f4f   2560Mi     RWO            Delete           Bound    cat-env/wiki-8990c278e4-8-mysql-claim                       local-path              8d
pvc-0bb40fca-ad3b-440f-bd14-c242e110120f   100Mi      RWO            Delete           Bound    control-plane-staging/catalyst-api-staging-catalyst-claim   local-path              6d2h
pvc-a99c6262-9fac-4eaf-ae4d-8898b753fbb5   8Gi        RWO            Delete           Bound    control-plane-staging/data-catalyst-api-staging-mariadb-0   local-path              6d2h
pvc-bfb7315f-3e9a-4940-b68c-f5a9334da97d   1Gi        RWO            Delete           Bound    patchdemo-staging/patchdemo-staging                         local-path              6d2h
pvc-32211de2-6742-422a-9d54-19a47944a942   8Gi        RWO            Delete           Bound    patchdemo-staging/data-patchdemo-staging-mariadb-0          local-path              6d2h

There's more research to do, but it seems like this is not an issue of kubernetes failing to reap pvcs, but instead just inadequate storage for the largest PVC, the one holding vhost-envs. vm-patchdemo has 80 gb for it's wikis. We functionally have less than 40 gb. Recommended course of action:

*resize k3s-data volume*

  • delete the k3s-logs volume
  • resize k3s-data to be 75 gb
  • create a k3s-logs volume with 5 gb size
  • reattach / fix mounts for k3s-logs volume to k3s vm
SDunlap edited projects, added Catalyst (AGL); removed Catalyst (Camp Muir).