Page MenuHomePhabricator

prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning
Closed, ResolvedPublic

Description

Icinga is showing a warning for prometheus1003 and prometheus1004 because of /srv/prometheus/ops being at 95% disk usage

root@prometheus1003:~# df -hT
Filesystem                                   Type      Size  Used Avail Use% Mounted on
udev                                         devtmpfs   48G     0   48G   0% /dev
tmpfs                                        tmpfs     9.5G  1.1G  8.4G  11% /run
/dev/mapper/vg--ssd-root                     ext4       37G  3.1G   32G   9% /
tmpfs                                        tmpfs      48G     0   48G   0% /dev/shm
tmpfs                                        tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                                        tmpfs      48G     0   48G   0% /sys/fs/cgroup
/dev/mapper/vg--hdd-prometheus--services     ext4      275G  109G  153G  42% /srv/prometheus/services
/dev/mapper/vg--hdd-prometheus--global       ext4      295G  218G   62G  79% /srv/prometheus/global
/dev/mapper/vg--ssd-prometheus--ops          ext4      787G  705G   42G  95% /srv/prometheus/ops
/dev/mapper/vg--ssd-prometheus--k8s          ext4       98G   49G   45G  53% /srv/prometheus/k8s
/dev/mapper/vg--hdd-prometheus--k8s--staging ext4       99G  6.7G   87G   8% /srv/prometheus/k8s-staging
/dev/mapper/vg--hdd-prometheus--analytics    ext4       99G   24G   71G  25% /srv/prometheus/analytics
/dev/sda1                                    ext4      268M   95M  156M  38% /boot

Event Timeline

Volans triaged this task as High priority.Feb 16 2020, 11:08 AM
Volans added a subscriber: Volans.

It seems directly related to the bump in retention https://gerrit.wikimedia.org/r/c/operations/puppet/+/564680 as can be clearly seen in the graph below.

Screenshot 2020-02-16 at 12.03.25.png (1×2 px, 326 KB)

On the bright side we have some additional free space in the volume group.

Setting priority to high as it increases of ~0.8% per day.

We should consider lowering the threshold for the alarm and have it be CRITICAL at something like 90% so to have a bit more room to act on it.

For the record:

root@prometheus1003:~# pvs
  PV         VG     Fmt  Attr PSize PFree
  /dev/sda3  vg-ssd lvm2 a--  1.42t 514.16g
  /dev/sdb1  vg-hdd lvm2 a--  3.64t   2.88t

I am not increasing the LV now as we still have room for @fgiunchedi to take a look tomorrow.

Mentioned in SAL (#wikimedia-operations) [2020-02-17T09:06:09Z] <godog> +50G to prometheus/ops fs on prometheus eqiad - T245361

Mentioned in SAL (#wikimedia-operations) [2020-02-17T09:09:58Z] <godog> +10G to prometheus/ops fs on prometheus eqiad - T245361

Thanks @Marostegui @Volans ! Indeed the space used grew because of longer retention, I added 150G to the LVs (last log is wrong, it is 100G) which should be enough to stabilize in ~15d and leave headroom too

I'll take this and resolve once space has stabilized again

fgiunchedi lowered the priority of this task from High to Medium.Feb 17 2020, 9:21 AM

Utilization growth has stabilized around Feb 20th and is now back to organic growth, resolving