Page MenuHomePhabricator

Prometheus PoPs disk space utilization
Closed, ResolvedPublic

Description

Looks like Prometheus in esams/ulsfo/eqsin is getting "tight" on space (not alarmingly yet, but enough to warn icinga in ulsfo)

$ for i in prometheus3001.esams prometheus4001.ulsfo prometheus5001.eqsin ; do ssh ${i}.wmnet df -h /srv; done
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-srv  146G  110G   37G  76% /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       118G  104G  7.7G  94% /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       118G  101G   11G  91% /

Details

Event Timeline

lmata triaged this task as Medium priority.Mar 15 2021, 4:20 PM
lmata moved this task from Inbox to Backlog on the observability board.
lmata subscribed.

Moving to short term backlog

Change 681684 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: trim prometheus retention on PoPs

https://gerrit.wikimedia.org/r/681684

Change 681684 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: trim prometheus retention on PoPs

https://gerrit.wikimedia.org/r/681684

fgiunchedi claimed this task.

Back to 90-ish percent max fs utilization

===== NODE GROUP =====
(1) prometheus3001.esams.wmnet
----- OUTPUT of 'df -h /srv' -----
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-srv  146G  106G   40G  73% /srv
===== NODE GROUP =====
(1) prometheus5001.eqsin.wmnet
----- OUTPUT of 'df -h /srv' -----
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1       118G   99G   13G  89% /
===== NODE GROUP =====
(1) prometheus4001.ulsfo.wmnet
----- OUTPUT of 'df -h /srv' -----                                                                        
Filesystem      Size  Used Avail Use% Mounted on                                                          
/dev/vda1       118G  102G   11G  92% /

Today we tried to create 2 new VMs in the ganeti esams cluster and actually ran out of resources.

Upon checking what uses resources here, the situation is roughly as follows:

bastion: 40G (because it used to have prometheus on it, but not anymore, left comment at T243057#7130650

install, ncredir, netflow: 20G each
ping: 5 G

prometheus: 2 disks with a total of 278G

With about 390G DTotal on each ganeti node, this is ~ 70% of all ganeti resources in the POP just for prometheus alone.

But does this mean our resources are too low in general or that prometheus uses too much? That I have no idea.

cc: @BBlack

The latter: AFAICS prometheus in esams is significantly larger than its counterparts in e.g. eqsin or ulsfo. IIRC this is due to the migration work in T243057, but at steady state I don't think now we'd need more than 120-150G in PoPs (and we can ask Prometheus to be bound by disk space available rather than time, if disk space on Ganeti becomes really tight)

ACK, thanks, @fgiunchedi As commented on the linked ticket, multiple people have mentioned we have metal there that prometheus could move to. Seems to me that is what we should probably do (instead of using that hardware as another ganeti server).