LVM vg0 close to getting full on prometheus eqiad
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Nov 14 2023, 8:58 AM

Description

After the last round of lvextend on prometheus100[56] I noticed vg0 doesn't have a whole lot of space left:

root@prometheus1005:~# vgs
  VG  #PV #LV #SN Attr   VSize  VFree   
  vg0   1  13   0 wz--n- <5.24t <342.55g

And this is the filesystem situation

root@prometheus1005:~# df -h | grep vg0
/dev/mapper/vg0-root                       73G   17G   53G  24% /
/dev/mapper/vg0-srv                        84G   11M   80G   1% /srv
/dev/mapper/vg0-prometheus--analytics      89G   40G   49G  46% /srv/prometheus/analytics
/dev/mapper/vg0-prometheus--ext            59G  266M   59G   1% /srv/prometheus/ext
/dev/mapper/vg0-prometheus--k8s           767G  690G   77G  90% /srv/prometheus/k8s
/dev/mapper/vg0-prometheus--k8s--aux       49G   18G   32G  36% /srv/prometheus/k8s-aux
/dev/mapper/vg0-prometheus--k8s--dse       49G   32G   18G  65% /srv/prometheus/k8s-dse
/dev/mapper/vg0-prometheus--k8s--mlserve  276G  182G   94G  66% /srv/prometheus/k8s-mlserve
/dev/mapper/vg0-prometheus--k8s--staging   99G   53G   46G  54% /srv/prometheus/k8s-staging
/dev/mapper/vg0-prometheus--services      196G  182G   15G  93% /srv/prometheus/services
/dev/mapper/vg0-prometheus--ops           3.1T  2.8T  286G  91% /srv/prometheus/ops
/dev/mapper/vg0-prometheus--cloud          98G  6.4G   92G   7% /srv/prometheus/cloud

I'm leaving this open ended in terms of solutions since there's a few we can explore at this point.

Discussed at the o11y meeting, solutions include:

Start capping the biggest instances by space
- start alerting on the actual days that are kept in storage
Audit the biggest metrics
Add more SSDs to the prometheus hosts (if applicable, to be researched)

Details

Subject	Repo	Branch	Lines +/-
k8s: allow setting prometheus retention in cluster definition	operations/puppet	production	+13 -2
hieradata: update kubernetes::clusters in CI	operations/puppet	production	+2 -0
hieradata: adjust prometheus k8s retention to current utilization	operations/puppet	production	+1 -1
hieradata: set 850GB retention for prometheus@k8s	operations/puppet	production	+1 -0
prometheus: lower space-based retention to 2800GB	operations/puppet	production	+1 -1
pontoon: set new prometheus defaults	operations/puppet	production	+6 -6
hieradata: cap prometheus size for ops instance	operations/puppet	production	+4 -0
hieradata: restore default retention for k8s prometheus	operations/puppet	production	+0 -1
prometheus: fix storage_retention_size for k8s	operations/puppet	production	+9 -9
prometheus: use per-instance retention hiera variables	operations/puppet	production	+34 -26
prometheus: remove unused parameter	operations/puppet	production	+0 -1
team-o11y: alert on Prometheus storing a few days of data	operations/alerts	master	+16 -0

Related Objects

Mentioned In: T351935: Audit Prometheus metrics size/label values

Event Timeline

fgiunchedi created this task.Nov 14 2023, 8:58 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 14 2023, 8:58 AM

fgiunchedi updated the task description. (Show Details)Nov 15 2023, 3:24 PM

fgiunchedi added a project: User-fgiunchedi.

Prometheus1005 is an R440 which should have 10 total 2.5" bays, today there are (6) 2T SSDs installed. I think it'd be worth getting the ball rolling on adding another (4) SSDs.

Another hardware option that comes to mind is switching to an alternate RAID level. For instance RAID 50 (two 3x2T RAID5 sets, with a RAID0 combining them) would take us from 6T usable to 8T (with iop tradeoff for capacity). With a fully populated chassis the difference is more significant at 10T usable in RAID10 vs 16T usable in RAID50. We use this approach on the kafka-logging cluster with success, although there we do it with hardware RAID which I don't think the prometheus hosts are currently outfitted for.

fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Nov 20 2023, 1:36 PM

In T351179#9334591, @herron wrote:

Prometheus1005 is an R440 which should have 10 total 2.5" bays, today there are (6) 2T SSDs installed. I think it'd be worth getting the ball rolling on adding another (4) SSDs.

Did so in {T351645} though I think we can add 2x SSDs per host not 4x

Change 975832 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-o11y: alert on Prometheus storing a few days of data

https://gerrit.wikimedia.org/r/975832

gerritbot added a project: Patch-For-Review.Nov 20 2023, 3:06 PM

Change 975832 merged by Filippo Giunchedi:

[operations/alerts@master] team-o11y: alert on Prometheus storing a few days of data

https://gerrit.wikimedia.org/r/975832

Maintenance_bot removed a project: Patch-For-Review.Nov 23 2023, 9:10 AM

fgiunchedi mentioned this in T351935: Audit Prometheus metrics size/label values.Nov 24 2023, 3:51 PM

fgiunchedi moved this task from Up next to Doing on the User-fgiunchedi board.Nov 27 2023, 8:51 AM

Change 977592 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: remove unused parameter

https://gerrit.wikimedia.org/r/977592

Change 977593 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: use per-instance retention hiera variables

https://gerrit.wikimedia.org/r/977593

Change 977594 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: cap prometheus size for k8s and ops instances

https://gerrit.wikimedia.org/r/977594

Change 977592 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: remove unused parameter

https://gerrit.wikimedia.org/r/977592

Change 977593 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: use per-instance retention hiera variables

https://gerrit.wikimedia.org/r/977593

Change 977667 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: fix storage_retention_size for k8s

https://gerrit.wikimedia.org/r/977667

Change 977667 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: fix storage_retention_size for k8s

https://gerrit.wikimedia.org/r/977667

Change 977594 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: cap prometheus size for ops instance

https://gerrit.wikimedia.org/r/977594

Maintenance_bot removed a project: Patch-For-Review.Nov 27 2023, 1:10 PM

Change 977670 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: restore default retention for k8s prometheus

https://gerrit.wikimedia.org/r/977670

Change 977670 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: restore default retention for k8s prometheus

https://gerrit.wikimedia.org/r/977670

Maintenance_bot removed a project: Patch-For-Review.Nov 27 2023, 1:30 PM

Mentioned in SAL (#wikimedia-operations) [2023-11-27T13:37:46Z] <godog> roll-restart prometheus/ops in eqiad/codfw to apply space-based retention - T351179

Change 977686 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pontoon: set new prometheus defaults

https://gerrit.wikimedia.org/r/977686

Change 977687 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] k8s: allow setting prometheus retention in cluster definition

https://gerrit.wikimedia.org/r/977687

Change 977688 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: set 850GB retention for prometheus@k8s

https://gerrit.wikimedia.org/r/977688

Change 977686 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: set new prometheus defaults

https://gerrit.wikimedia.org/r/977686

@fgiunchedi how long do you estimate before we can observe the impact is for the new defaults?

Change 979110 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: lower space-based retention to 2800GB

https://gerrit.wikimedia.org/r/979110

Change 979110 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: lower space-based retention to 2800GB

https://gerrit.wikimedia.org/r/979110

Mentioned in SAL (#wikimedia-operations) [2023-11-30T14:48:22Z] <godog> roll-restart prometheus/ops in eqiad/codfw to apply new size-based retention - T351179

In T351179#9360230, @lmata wrote:

@fgiunchedi how long do you estimate before we can observe the impact is for the new defaults?

Impact will be immediate (i.e. we're already seeing it)

Change 977687 merged by Filippo Giunchedi:

[operations/puppet@production] k8s: allow setting prometheus retention in cluster definition

https://gerrit.wikimedia.org/r/977687

Change 977688 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: set 850GB retention for prometheus@k8s

https://gerrit.wikimedia.org/r/977688

Mentioned in SAL (#wikimedia-operations) [2023-12-04T09:57:39Z] <godog> roll-restart prometheus/k8s to apply size-based retention - T351179

Change 979898 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: adjust prometheus k8s retention to current utilization

https://gerrit.wikimedia.org/r/979898

Change 979898 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: adjust prometheus k8s retention to current utilization

https://gerrit.wikimedia.org/r/979898

Maintenance_bot removed a project: Patch-For-Review.Dec 4 2023, 10:30 AM

Change 979943 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: update kubernetes::clusters in CI

https://gerrit.wikimedia.org/r/979943

gerritbot added a project: Patch-For-Review.Dec 4 2023, 1:05 PM

Change 979943 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: update kubernetes::clusters in CI

https://gerrit.wikimedia.org/r/979943

The two biggest prometheus instances (k8s, ops) have been size-capped. Leaving the task open since we have related/followup tasks

fgiunchedi removed projects: Patch-For-Review, User-fgiunchedi.Dec 4 2023, 2:57 PM

lmata added a project: SRE Observability (FY2023/2024-Q2).Dec 5 2023, 4:29 PM

fgiunchedi edited projects, added SRE Observability (FY2023/2024-Q3); removed SRE Observability (FY2023/2024-Q2).Jan 15 2024, 11:12 AM

fgiunchedi edited projects, added SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).Mar 26 2024, 2:57 PM

This is done, we have more space for prometheus

LVM vg0 close to getting full on prometheus eqiadClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

LVM vg0 close to getting full on prometheus eqiad
Closed, ResolvedPublic
Actions