Page MenuHomePhabricator

prometheus1003/prometheus1004 /srv/prometheus/ops disk space warning
Closed, ResolvedPublic

Description

Icinga is showing a warning for prometheus1003 and prometheus1004 because of /srv/prometheus/ops being at 95% disk usage

root@prometheus1003:~# df -hT
Filesystem                                   Type      Size  Used Avail Use% Mounted on
udev                                         devtmpfs   48G     0   48G   0% /dev
tmpfs                                        tmpfs     9.5G  1.1G  8.4G  11% /run
/dev/mapper/vg--ssd-root                     ext4       37G  3.1G   32G   9% /
tmpfs                                        tmpfs      48G     0   48G   0% /dev/shm
tmpfs                                        tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                                        tmpfs      48G     0   48G   0% /sys/fs/cgroup
/dev/mapper/vg--hdd-prometheus--services     ext4      275G  109G  153G  42% /srv/prometheus/services
/dev/mapper/vg--hdd-prometheus--global       ext4      295G  218G   62G  79% /srv/prometheus/global
/dev/mapper/vg--ssd-prometheus--ops          ext4      787G  705G   42G  95% /srv/prometheus/ops
/dev/mapper/vg--ssd-prometheus--k8s          ext4       98G   49G   45G  53% /srv/prometheus/k8s
/dev/mapper/vg--hdd-prometheus--k8s--staging ext4       99G  6.7G   87G   8% /srv/prometheus/k8s-staging
/dev/mapper/vg--hdd-prometheus--analytics    ext4       99G   24G   71G  25% /srv/prometheus/analytics
/dev/sda1                                    ext4      268M   95M  156M  38% /boot

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 16 2020, 10:46 AM
Volans triaged this task as High priority.Feb 16 2020, 11:08 AM
Volans added a subscriber: Volans.

It seems directly related to the bump in retention https://gerrit.wikimedia.org/r/c/operations/puppet/+/564680 as can be clearly seen in the graph below.

On the bright side we have some additional free space in the volume group.

Setting priority to high as it increases of ~0.8% per day.

We should consider lowering the threshold for the alarm and have it be CRITICAL at something like 90% so to have a bit more room to act on it.

For the record:

root@prometheus1003:~# pvs
  PV         VG     Fmt  Attr PSize PFree
  /dev/sda3  vg-ssd lvm2 a--  1.42t 514.16g
  /dev/sdb1  vg-hdd lvm2 a--  3.64t   2.88t

I am not increasing the LV now as we still have room for @fgiunchedi to take a look tomorrow.

Mentioned in SAL (#wikimedia-operations) [2020-02-17T09:06:09Z] <godog> +50G to prometheus/ops fs on prometheus eqiad - T245361

Mentioned in SAL (#wikimedia-operations) [2020-02-17T09:09:58Z] <godog> +10G to prometheus/ops fs on prometheus eqiad - T245361

Thanks @Marostegui @Volans ! Indeed the space used grew because of longer retention, I added 150G to the LVs (last log is wrong, it is 100G) which should be enough to stabilize in ~15d and leave headroom too

I'll take this and resolve once space has stabilized again

fgiunchedi lowered the priority of this task from High to Medium.Feb 17 2020, 9:21 AM
fgiunchedi closed this task as Resolved.Mar 2 2020, 8:32 AM

Utilization growth has stabilized around Feb 20th and is now back to organic growth, resolving