Monitor failing ferm restarts / availability of ferm service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• MoritzMuehlenhoff
	Aug 7 2015, 10:16 AM

Description

We should create an Icinga check to detect failing ferm restarts after a puppet run:

The "notrack" ferm rules in puppet were broken (adding them to the incorrect table). When the change was applied to helium it was noticed due to the pool counter no longer working (caused by an overflown connection table). But on the poolcounters running in codfw, the broken change already caused a non-working ferm restart after a earlier puppet run (e.g. in logged in syslog on Aug 4 10:24:57).

While the notrack failures errors were introduced during the initial ferm setup of a host, such errors may also be caused in day-to-day operation, e.g. if resolve() fails (we had that with the list of snapshot dump mirrors before).

Event Timeline

• MoritzMuehlenhoff created this task.Aug 7 2015, 10:16 AM

• MoritzMuehlenhoff raised the priority of this task from to Needs Triage.

• MoritzMuehlenhoff updated the task description. (Show Details)

• MoritzMuehlenhoff added a project: acl*sre-team.

• MoritzMuehlenhoff subscribed.

Restricted Application added subscribers: Matanya, Aklapper. · View Herald TranscriptAug 7 2015, 10:16 AM

• MoritzMuehlenhoff claimed this task.Aug 18 2015, 9:31 AM

• MoritzMuehlenhoff triaged this task as Medium priority.

• MoritzMuehlenhoff set Security to None.

• MoritzMuehlenhoff renamed this task from Monitor failing ferm restarts to Monitor failing ferm restarts / availability of ferm service.Oct 5 2016, 3:00 PM

We now have an Icinga check for ferm.

Monitor failing ferm restarts / availability of ferm serviceClosed, ResolvedPublicActions

Description

Event Timeline

Monitor failing ferm restarts / availability of ferm service
Closed, ResolvedPublic
Actions