Page MenuHomePhabricator

Investigate why Cloud Services was not paged when the tools filesystem on NFS went critical
Closed, ResolvedPublic

Description

In the service history for the disk on labstore1004, this alert is mentioned:

[2019-02-25 01:31:06] SERVICE ALERT: labstore1004;Disk space;CRITICAL;HARD;3;DISK CRITICAL - free space: /srv/tools 473469 MB (5% inode=79%):

It recovered after @GTirloni got the info from other teams and IRC and fixed things.

[2019-02-25 10:28:40] SERVICE ALERT: labstore1004;Disk space;OK;HARD;3;DISK OK

These things should page us. This task is to track down why this did not happen.

Event Timeline

Bstorm triaged this task as High priority.Feb 25 2019, 6:05 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2019, 6:05 PM

Change 492761 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] labstores: set check_disk_critical: true and profile::base::notifications: critical

https://gerrit.wikimedia.org/r/492761

Change 492761 merged by Andrew Bogott:
[operations/puppet@production] labstores: make failures on these hosts page more

https://gerrit.wikimedia.org/r/492761

Change 493083 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumpsdistribution: Make pages go off for disk space

https://gerrit.wikimedia.org/r/493083

Change 493083 merged by Bstorm:
[operations/puppet@production] dumpsdistribution: Make pages go off for disk space

https://gerrit.wikimedia.org/r/493083

Change 493271 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumpsdistribution: Make pages go off for disk space

https://gerrit.wikimedia.org/r/493271

Change 493271 merged by Bstorm:
[operations/puppet@production] dumpsdistribution: Make pages go off for disk space

https://gerrit.wikimedia.org/r/493271

While we're talking about monitoring/alerting if dump file usage gets too large, should there be some sort of email notiication if /srv/dumps on labstore1006,7 gets 'too large'? I don't mean waiting until space is almost gone but notification/phab task when it's time to adjust the number of dumps kept. Or is there already something set up?

Bstorm added a comment.EditedFeb 27 2019, 9:17 PM

We had an NFS server's filesystem fill, and it didn't page. The effort here is to ensure that is not true :) We want a full NFS filesystem to page.

At this point, we've had good evidence that pages are working on the "secondary" cluster. I also checked the configs resulting from things and think this confirms that the dumps servers have the stuff enabled as well (see below).

I'm going to subtask @ArielGlenn's idea to check if there is anything already and do kind of a deep dive there, since that's going to involved checking what this all thinks "critical" means and if there's a Good Way to implement a cleanup notification/message or if we just have to do a timer/cron to email us. That would be useful on tools NFS as well as dumps, although my intuitive memory of the alert in icinga is that "critical" is quite far from nearly full. I'd rather dig in and be sure, though.

define service {
# --PUPPET_NAME-- labstore1006 disk_space
        active_checks_enabled          1
        check_command                  nrpe_check!check_disk_space!10
        check_freshness                0
        check_interval                 1
        check_period                   24x7
        contact_groups                 admins
        host_name                      labstore1006
        is_volatile                    0
        max_check_attempts             3
        notification_interval          0
        notification_options           c,r,f
        notification_period            24x7
        notifications_enabled          1
        passive_checks_enabled         1
        retry_interval                 1
        service_description            Disk space
        servicegroups                  wmcs_eqiad

}
Bstorm closed this task as Resolved.Mar 21 2019, 6:26 PM
Bstorm claimed this task.