After the last round of lvextend on prometheus100[56] I noticed vg0 doesn't have a whole lot of space left:
root@prometheus1005:~# vgs VG #PV #LV #SN Attr VSize VFree vg0 1 13 0 wz--n- <5.24t <342.55g
And this is the filesystem situation
root@prometheus1005:~# df -h | grep vg0 /dev/mapper/vg0-root 73G 17G 53G 24% / /dev/mapper/vg0-srv 84G 11M 80G 1% /srv /dev/mapper/vg0-prometheus--analytics 89G 40G 49G 46% /srv/prometheus/analytics /dev/mapper/vg0-prometheus--ext 59G 266M 59G 1% /srv/prometheus/ext /dev/mapper/vg0-prometheus--k8s 767G 690G 77G 90% /srv/prometheus/k8s /dev/mapper/vg0-prometheus--k8s--aux 49G 18G 32G 36% /srv/prometheus/k8s-aux /dev/mapper/vg0-prometheus--k8s--dse 49G 32G 18G 65% /srv/prometheus/k8s-dse /dev/mapper/vg0-prometheus--k8s--mlserve 276G 182G 94G 66% /srv/prometheus/k8s-mlserve /dev/mapper/vg0-prometheus--k8s--staging 99G 53G 46G 54% /srv/prometheus/k8s-staging /dev/mapper/vg0-prometheus--services 196G 182G 15G 93% /srv/prometheus/services /dev/mapper/vg0-prometheus--ops 3.1T 2.8T 286G 91% /srv/prometheus/ops /dev/mapper/vg0-prometheus--cloud 98G 6.4G 92G 7% /srv/prometheus/cloud
I'm leaving this open ended in terms of solutions since there's a few we can explore at this point.
Discussed at the o11y meeting, solutions include:
- Start capping the biggest instances by space
- start alerting on the actual days that are kept in storage
- Audit the biggest metrics
- Add more SSDs to the prometheus hosts (if applicable, to be researched)