Page MenuHomePhabricator

Monitor failing ferm restarts / availability of ferm service
Closed, ResolvedPublic

Description

We should create an Icinga check to detect failing ferm restarts after a puppet run:

The "notrack" ferm rules in puppet were broken (adding them to the incorrect table). When the change was applied to helium it was noticed due to the pool counter no longer working (caused by an overflown connection table). But on the poolcounters running in codfw, the broken change already caused a non-working ferm restart after a earlier puppet run (e.g. in logged in syslog on Aug 4 10:24:57).

While the notrack failures errors were introduced during the initial ferm setup of a host, such errors may also be caused in day-to-day operation, e.g. if resolve() fails (we had that with the list of snapshot dump mirrors before).

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
MoritzMuehlenhoff triaged this task as Medium priority.
MoritzMuehlenhoff set Security to None.
MoritzMuehlenhoff renamed this task from Monitor failing ferm restarts to Monitor failing ferm restarts / availability of ferm service.Oct 5 2016, 3:00 PM

We now have an Icinga check for ferm.