Maniphest T220326

ops almost full
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	Apr 8 2019, 5:12 AM

Description

01:26 <+icinga-wm> PROBLEM - Disk space on prometheus1004 is CRITICAL: DISK CRITICAL - free space: /srv/prometheus/ops 19053 MB (3% inode=67%)

root@prometheus1004:/srv/prometheus/ops# df -hT
Filesystem                                   Type      Size  Used Avail Use% Mounted on
udev                                         devtmpfs   10M     0   10M   0% /dev
tmpfs                                        tmpfs      19G  1.9G   17G  11% /run
/dev/dm-0                                    ext4       37G  3.7G   31G  11% /
tmpfs                                        tmpfs      48G     0   48G   0% /dev/shm
tmpfs                                        tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                                        tmpfs      48G     0   48G   0% /sys/fs/cgroup
/dev/sda1                                    ext4       88M   38M   43M  47% /boot
/dev/mapper/vg--hdd-prometheus--global       ext4      296G  166G  115G  60% /srv/prometheus/global
/dev/mapper/vg--hdd-prometheus--services     ext4      197G   56G  142G  29% /srv/prometheus/services
/dev/mapper/vg--hdd-prometheus--k8s--staging ext4       50G  2.7G   44G   6% /srv/prometheus/k8s-staging
/dev/mapper/vg--ssd-prometheus--ops          ext4      493G  448G   22G  96% /srv/prometheus/ops
/dev/mapper/vg--ssd-prometheus--k8s          ext4       99G  8.2G   86G   9% /srv/prometheus/k8s
/dev/mapper/vg--hdd-prometheus--analytics    ext4       99G  9.8G   84G  11% /srv/prometheus/analytics

I checked to see if the lvm could have been extended, and looks like there is some available space, but it is reporting Input/Output errors, so I am not confident enough to expand it myself

root@prometheus1004:/srv/prometheus/ops# vgs
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 322122481664: Input/output error
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 322122539008: Input/output error
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 4096: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 536870846464: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 536870903808: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 4096: Input/output error
  VG     #PV #LV #SN Attr   VSize VFree
  vg-hdd   1   8   4 wz--n- 3.64t   2.51t
  vg-ssd   1   6   2 wz--n- 1.42t 313.95g

root@prometheus1004:/srv/prometheus/ops# lvs
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 322122481664: Input/output error
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 322122539008: Input/output error
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg-hdd/prom-snap-global: read failed after 0 of 4096 at 4096: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 536870846464: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 536870903808: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 0: Input/output error
  /dev/vg-ssd/prom-snap-ops: read failed after 0 of 4096 at 4096: Input/output error
  LV                     VG     Attr       LSize   Pool Origin                 Data%  Meta%  Move Log Cpy%Sync Convert
  prom-snap-analytics    vg-hdd swi-aos--- 100.39g      prometheus-analytics   21.25
  prom-snap-global       vg-hdd swi-Ios--- 150.00g      prometheus-global      100.00
  prom-snap-k8s-staging  vg-hdd swi-aos---  50.20g      prometheus-k8s-staging 16.32
  prom-snap-services     vg-hdd swi-aos--- 200.79g      prometheus-services    43.35
  prometheus-analytics   vg-hdd owi-aos--- 100.00g
  prometheus-global      vg-hdd owi-aos--- 300.00g
  prometheus-k8s-staging vg-hdd owi-aos---  50.00g
  prometheus-services    vg-hdd owi-aos--- 200.00g
  prom-snap-k8s          vg-ssd swi-aos--- 100.39g      prometheus-k8s         16.75
  prom-snap-ops          vg-ssd swi-Ios--- 400.00g      prometheus-ops         100.00
  prometheus-k8s         vg-ssd owi-aos--- 100.00g
  prometheus-ops         vg-ssd owi-aos--- 500.00g
  root                   vg-ssd -wi-ao----  37.25g
  swap                   vg-ssd -wi-ao---- 952.00m

Details

	Subject	Repo	Branch	Lines +/-
	icinga/base/prometheus: add notes_url to DISK space checks	operations/puppet	production	+1 -0

Customize query in gerrit

Event Timeline

• Marostegui created this task.Apr 8 2019, 5:12 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2019, 5:12 AM

Joe triaged this task as Unbreak Now! priority.Apr 8 2019, 5:42 AM

Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 8 2019, 5:42 AM

Triaged to UBN! as by my estimation the partition serving /srv/prometheus/ops will fill up in the next 2-3 days.

I've cleaned up the snapshots used for the migration and added 300G to the ops instance filesystem (matching prometheus1003). Lowering to normal as this host is going to get reimaged anyways to migrate to prometheus v2

root@prometheus1004:~# df -h
Filesystem                                    Size  Used Avail Use% Mounted on
udev                                           10M     0   10M   0% /dev
tmpfs                                          19G  2.0G   17G  11% /run
/dev/dm-0                                      37G  3.7G   32G  11% /
tmpfs                                          48G     0   48G   0% /dev/shm
tmpfs                                         5.0M     0  5.0M   0% /run/lock
tmpfs                                          48G     0   48G   0% /sys/fs/cgroup
/dev/sda1                                      88M   38M   43M  47% /boot
/dev/mapper/vg--hdd-prometheus--global        296G  167G  115G  60% /srv/prometheus/global
/dev/mapper/vg--hdd-prometheus--services      197G   56G  142G  29% /srv/prometheus/services
/dev/mapper/vg--hdd-prometheus--k8s--staging   50G  2.7G   44G   6% /srv/prometheus/k8s-staging
/dev/mapper/vg--ssd-prometheus--ops           493G  451G   19G  97% /srv/prometheus/ops
/dev/mapper/vg--ssd-prometheus--k8s            99G  8.2G   86G   9% /srv/prometheus/k8s
/dev/mapper/vg--hdd-prometheus--analytics      99G  9.8G   84G  11% /srv/prometheus/analytics
root@prometheus1004:~# lvextend -r --size +300G /dev/mapper/vg--ssd-prometheus--ops
  Size of logical volume vg-ssd/prometheus-ops changed from 500.00 GiB (128000 extents) to 800.00 GiB (204800 extents).
  Logical volume prometheus-ops successfully resized
resize2fs 1.42.12 (29-Aug-2014)
Filesystem at /dev/mapper/vg--ssd-prometheus--ops is mounted on /srv/prometheus/ops; on-line resizing required
old_desc_blocks = 32, new_desc_blocks = 50
The filesystem on /dev/mapper/vg--ssd-prometheus--ops is now 209715200 (4k) blocks long.

root@prometheus1004:~# df -h
Filesystem                                    Size  Used Avail Use% Mounted on
udev                                           10M     0   10M   0% /dev
tmpfs                                          19G  2.0G   17G  11% /run
/dev/dm-0                                      37G  3.7G   32G  11% /
tmpfs                                          48G     0   48G   0% /dev/shm
tmpfs                                         5.0M     0  5.0M   0% /run/lock
tmpfs                                          48G     0   48G   0% /sys/fs/cgroup
/dev/sda1                                      88M   38M   43M  47% /boot
/dev/mapper/vg--hdd-prometheus--global        296G  166G  115G  60% /srv/prometheus/global
/dev/mapper/vg--hdd-prometheus--services      197G   56G  142G  29% /srv/prometheus/services
/dev/mapper/vg--hdd-prometheus--k8s--staging   50G  2.7G   44G   6% /srv/prometheus/k8s-staging
/dev/mapper/vg--ssd-prometheus--ops           788G  451G  302G  60% /srv/prometheus/ops
/dev/mapper/vg--ssd-prometheus--k8s            99G  8.2G   86G   9% /srv/prometheus/k8s
/dev/mapper/vg--hdd-prometheus--analytics      99G  9.8G   84G  11% /srv/prometheus/analytics
root@prometheus1004:~#

Can we attach a notes_url parameter to that alert so we know how to proceed in case this happens again? :)

In T220326#5091992, @Marostegui wrote:

Can we attach a notes_url parameter to that alert so we know how to proceed in case this happens again? :)

We definitely should! I've documented the procedure at https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_host_running_out_of_space however I'm not sure we can do notes_url specifically for prometheus hosts? Anyways said url can point to a generic wikitech page about "out of disk space" troubleshooting.

Change 504658 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/base/prometheus: add notes_url to DISK space checks

https://gerrit.wikimedia.org/r/504658

gerritbot added a project: Patch-For-Review.Apr 17 2019, 7:21 PM

@Marostegui @fgiunchedi

I made this new "landing page" for disk space checks and from there i linked to the Prometheus runbook:

https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space

also see: https://wikitech.wikimedia.org/wiki/Special:Contributions/Dzahn

Change 504658 merged by Dzahn:
[operations/puppet@production] icinga/base/prometheus: add notes_url to DISK space checks

https://gerrit.wikimedia.org/r/504658

The specific check this was about, disk on prometheus 1004, now has the Icinga link:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=prometheus1004

And all other DISK checks on all hosts will also link to

https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space

Does this resolve the ticket or was there something else left to be done?

In T220326#5120519, @Dzahn wrote:

The specific check this was about, disk on prometheus 1004, now has the Icinga link:

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=prometheus1004

And all other DISK checks on all hosts will also link to

https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space

Does this resolve the ticket or was there something else left to be done?

Good to resolve, thanks @Dzahn !

Dzahn closed this task as Resolved.Apr 23 2019, 11:21 PM

Dzahn claimed this task.

prometheus1004 /srv/prometheus/ops almost fullClosed, ResolvedPublicActions

Description

Details

Event Timeline

prometheus1004 /srv/prometheus/ops almost full
Closed, ResolvedPublic
Actions