Page MenuHomePhabricator

Increased icinga check latency since 05/12
Closed, ResolvedPublic

Description

Looks like icinga check latency increased (max/avg) increased a few days ago and didn't go back down:

2020-12-07-102632_1189x350_scrot.png (350×1 px, 46 KB)

dashboard: https://grafana.wikimedia.org/d/rsCfQfuZz/icinga?orgId=1&from=1607100450020&to=1607332978365

Event Timeline

lmata moved this task from Inbox to In progress on the observability board.

Mentioned in SAL (#wikimedia-operations) [2020-12-07T18:52:54Z] <herron> systemctl restart icinga on alert1001 T269560

Starting around 1400 UTC today, average check latency has been dropping steadily.

  1. Puppet changes prior to and around that time do not correlate with symptoms (increasing check attempts on maps cluster after 1400)
  2. Host overview of alert1001 shows a drop in system load and network utilization for the duration of the event which appears to end around the restart at 1852 UTC today
colewhite triaged this task as Medium priority.

Checking back in some time later, this does not appear to have occurred again.