Page MenuHomePhabricator

when servers are about to run out of disk, monitoring should notify the owners
Open, Needs TriagePublic

Description

We often have monitoring alerts when servers are about to run out of disk.

One random example being T392834 for mwmaint. This was created in response to T392834#10780146 but is not supposed to be specific to that machine.. it's about all disk space checks.

These alerts only show up on IRC and are often overlooked nowadays. The appropriate owners/subteams are not automatically informed about it.

Then other people jump in to fix it because it's considered an UBN.. or nobody at all and it's only noticed once services actually go down with machines having 0 bytes free.

Once an immediate UBN has been fixed there is usually the need for a follow-up to avoid this from happening again in the future which raises questions for the actual service owners.

A better way approach would be if this alert could also create automatic tickets (like for example RAID checks already do), or would send emails.

Ideally the owner team would be tagged automatically and/or the emails would go to the correct subteam, though this might be more tricky as it would require the monitoring system to know which machine is using which role and lookup the role_contact in the puppet repo.

The lower hanging fruit would be to just switch this disk space check to automatic tickets since we already have that for other checks. Clinic duty or whoever can still tag them with the right team manually.

Event Timeline

Reedy renamed this task from when servers are about to run out of disk monitoring should notify the owners to when servers are about to run out of disk, monitoring should notify the owners .May 21 2025, 11:01 PM

I am not trying to say the disk checks need to be migrated to a different system or anything.

The check_disk Nagios/Icinga plugin is fine. The fact that it alerts at 95% full is fine. Not saying we need trends or base it on graphs.

The only part I think needs fixing, in the context of this ticket at least, is the _way it tells us when the alert is triggered_.

"IRC-only" should be replaced with "ticket created in phab" (and/or emails).

And a bonus would be if the system can lookup who the owner of the alerting server is.. and then tag that ticket accordingly or use the most relevant team email alias.

But if that is too complex then it's also resolved if it simply creates SRE tickets and humans add more specific tags.