Page MenuHomePhabricator

fix disk space check on dataset1001
Closed, ResolvedPublic

Description

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=dataset1001&service=Disk+space
Ariel says this is not really critical and has enough disk space..
..but we should fix the check instead to not become CRIT for this host
checking for percentage of disk left gets you this..
also.. making a ticket just so we can ACK it in Icinga

Details

Reference
rt7922

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 1:57 AM
rtimport added a project: ops-core.
rtimport set Reference to rt7922.

On Fri Jul 18 16:29:57 2014, dzahn wrote:

https://icinga.wikimedia.org/cgi-
bin/icinga/extinfo.cgi?type=2&host=dataset1001&service=Disk+space

Ariel says this is not really critical and has enough disk space..

..but we should fix the check instead to not become CRIT for this host

checking for percentage of disk left gets you this..

also.. making a ticket just so we can ACK it in Icinga

what mount point ran out of disk space btw? I can't find any recent alerts for
that
dataset1001:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 111G 7.7G 97G 8% /
udev 7.9G 4.0K 7.9G 1% /dev
tmpfs 3.2G 2.2M 3.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 7.9G 0 7.9G 0% /run/shm
/dev/mapper/vg0-lv0 37T 34T 3.1T 92% /data
labstore1003.eqiad.wmnet:/dumps 44T 8.8T 35T 20% /mnt/dumps

Status changed from 'new' to 'open' by RT_System

On Thu Sep 18 10:41:10 2014, fgiunchedi wrote:

what mount point ran out of disk space btw? I can't find any recent

Yea, true, it's a bit unfortunate that Icinga always forgets the history or we
could look it up now. I _think_ it was this:

/dev/mapper/vg0-lv0 37T 34T 3.1T 92% /data

and it always triggered because we are checking for a percentage of space being
left and this one is so large. so 5% of 37T is still quite some space for
example.
the ticket was for adjusting that somehow, but i think you can close it for now

agreed, resolving, thanks Daniel!

Status changed from 'open' to 'resolved' by fgiunchedi

I keep bumping into dataset1001 alerts every time I look at icinga, so it looks like this isn't resolved; reopening.
Ariel has previously said that it's a known problem, probably referring to this ticket :) Ariel, can you clarify and/or have a look?

Status changed from 'resolved' to 'open' by RT_System

faidon raised the priority of this task from Medium to High.Jan 20 2015, 1:36 AM
faidon set Security to None.
faidon subscribed.

Ping?

it's because our default disk check checks for a percentage of space left, and in the case of datasets, even a few percent are quite a bit of space:

for example, 1.5T free are still 97% full and causes it to trigger

https://gerrit.wikimedia.org/r/#/c/193834/ for this bug (would allow other custom checks for e.g. mariadb as well)

https://gerrit.wikimedia.org/r/#/c/193834/ got merged 18 months ago and this task has not seen any updates for 30 months. Still valid? Or resolved?

Dzahn claimed this task.
Dzahn lowered the priority of this task from High to Low.
Dzahn removed a project: Patch-For-Review.
Dzahn changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
Dzahn changed the edit policy from "WMF-NDA (Project)" to "All Users".
Dzahn added a subscriber: ArielGlenn.

@Aklapper thanks, resolved :) (and made public, NDA was just because this was an RT import, yea, that old)