Page MenuHomePhabricator

LVM vg0 close to getting full on prometheus eqiad
Closed, ResolvedPublic

Description

After the last round of lvextend on prometheus100[56] I noticed vg0 doesn't have a whole lot of space left:

root@prometheus1005:~# vgs
  VG  #PV #LV #SN Attr   VSize  VFree   
  vg0   1  13   0 wz--n- <5.24t <342.55g

And this is the filesystem situation

root@prometheus1005:~# df -h | grep vg0
/dev/mapper/vg0-root                       73G   17G   53G  24% /
/dev/mapper/vg0-srv                        84G   11M   80G   1% /srv
/dev/mapper/vg0-prometheus--analytics      89G   40G   49G  46% /srv/prometheus/analytics
/dev/mapper/vg0-prometheus--ext            59G  266M   59G   1% /srv/prometheus/ext
/dev/mapper/vg0-prometheus--k8s           767G  690G   77G  90% /srv/prometheus/k8s
/dev/mapper/vg0-prometheus--k8s--aux       49G   18G   32G  36% /srv/prometheus/k8s-aux
/dev/mapper/vg0-prometheus--k8s--dse       49G   32G   18G  65% /srv/prometheus/k8s-dse
/dev/mapper/vg0-prometheus--k8s--mlserve  276G  182G   94G  66% /srv/prometheus/k8s-mlserve
/dev/mapper/vg0-prometheus--k8s--staging   99G   53G   46G  54% /srv/prometheus/k8s-staging
/dev/mapper/vg0-prometheus--services      196G  182G   15G  93% /srv/prometheus/services
/dev/mapper/vg0-prometheus--ops           3.1T  2.8T  286G  91% /srv/prometheus/ops
/dev/mapper/vg0-prometheus--cloud          98G  6.4G   92G   7% /srv/prometheus/cloud

I'm leaving this open ended in terms of solutions since there's a few we can explore at this point.

Discussed at the o11y meeting, solutions include:

  • Start capping the biggest instances by space
    • start alerting on the actual days that are kept in storage
  • Audit the biggest metrics
  • Add more SSDs to the prometheus hosts (if applicable, to be researched)

Event Timeline

Prometheus1005 is an R440 which should have 10 total 2.5" bays, today there are (6) 2T SSDs installed. I think it'd be worth getting the ball rolling on adding another (4) SSDs.

Another hardware option that comes to mind is switching to an alternate RAID level. For instance RAID 50 (two 3x2T RAID5 sets, with a RAID0 combining them) would take us from 6T usable to 8T (with iop tradeoff for capacity). With a fully populated chassis the difference is more significant at 10T usable in RAID10 vs 16T usable in RAID50. We use this approach on the kafka-logging cluster with success, although there we do it with hardware RAID which I don't think the prometheus hosts are currently outfitted for.

fgiunchedi mentioned this in Unknown Object (Task).Nov 20 2023, 2:36 PM

Prometheus1005 is an R440 which should have 10 total 2.5" bays, today there are (6) 2T SSDs installed. I think it'd be worth getting the ball rolling on adding another (4) SSDs.

Did so in {T351645} though I think we can add 2x SSDs per host not 4x

Change 975832 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-o11y: alert on Prometheus storing a few days of data

https://gerrit.wikimedia.org/r/975832

Change 975832 merged by Filippo Giunchedi:

[operations/alerts@master] team-o11y: alert on Prometheus storing a few days of data

https://gerrit.wikimedia.org/r/975832

Change 977592 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove unused parameter

https://gerrit.wikimedia.org/r/977592

Change 977593 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use per-instance retention hiera variables

https://gerrit.wikimedia.org/r/977593

Change 977594 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: cap prometheus size for k8s and ops instances

https://gerrit.wikimedia.org/r/977594

Change 977592 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove unused parameter

https://gerrit.wikimedia.org/r/977592

Change 977593 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use per-instance retention hiera variables

https://gerrit.wikimedia.org/r/977593

Change 977667 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: fix storage_retention_size for k8s

https://gerrit.wikimedia.org/r/977667

Change 977667 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: fix storage_retention_size for k8s

https://gerrit.wikimedia.org/r/977667

Change 977594 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: cap prometheus size for ops instance

https://gerrit.wikimedia.org/r/977594

Change 977670 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: restore default retention for k8s prometheus

https://gerrit.wikimedia.org/r/977670

Change 977670 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: restore default retention for k8s prometheus

https://gerrit.wikimedia.org/r/977670

Mentioned in SAL (#wikimedia-operations) [2023-11-27T13:37:46Z] <godog> roll-restart prometheus/ops in eqiad/codfw to apply space-based retention - T351179

Change 977686 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pontoon: set new prometheus defaults

https://gerrit.wikimedia.org/r/977686

Change 977687 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] k8s: allow setting prometheus retention in cluster definition

https://gerrit.wikimedia.org/r/977687

Change 977688 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set 850GB retention for prometheus@k8s

https://gerrit.wikimedia.org/r/977688

Change 977686 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: set new prometheus defaults

https://gerrit.wikimedia.org/r/977686

@fgiunchedi how long do you estimate before we can observe the impact is for the new defaults?

Change 979110 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: lower space-based retention to 2800GB

https://gerrit.wikimedia.org/r/979110

Change 979110 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: lower space-based retention to 2800GB

https://gerrit.wikimedia.org/r/979110

Mentioned in SAL (#wikimedia-operations) [2023-11-30T14:48:22Z] <godog> roll-restart prometheus/ops in eqiad/codfw to apply new size-based retention - T351179

@fgiunchedi how long do you estimate before we can observe the impact is for the new defaults?

Impact will be immediate (i.e. we're already seeing it)

Change 977687 merged by Filippo Giunchedi:

[operations/puppet@production] k8s: allow setting prometheus retention in cluster definition

https://gerrit.wikimedia.org/r/977687

Change 977688 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set 850GB retention for prometheus@k8s

https://gerrit.wikimedia.org/r/977688

Mentioned in SAL (#wikimedia-operations) [2023-12-04T09:57:39Z] <godog> roll-restart prometheus/k8s to apply size-based retention - T351179

Change 979898 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: adjust prometheus k8s retention to current utilization

https://gerrit.wikimedia.org/r/979898

Change 979898 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: adjust prometheus k8s retention to current utilization

https://gerrit.wikimedia.org/r/979898

Change 979943 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: update kubernetes::clusters in CI

https://gerrit.wikimedia.org/r/979943

Change 979943 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: update kubernetes::clusters in CI

https://gerrit.wikimedia.org/r/979943

The two biggest prometheus instances (k8s, ops) have been size-capped. Leaving the task open since we have related/followup tasks

fgiunchedi claimed this task.

This is done, we have more space for prometheus