Page MenuHomePhabricator

Host page did not auto-resolve in VO
Closed, ResolvedPublic

Description

A recent example of this behaviour below. The problem seems to be that incidents from host alerts (as opposed to service alerts) don't self-resolve when the recovery email comes in from icinga.

Incident 1227 involved cr3-eqsin being unpingable from alert1001: link to splunk oncall

The triggering email from the link above:

Critical: Host cr3-eqsin - PING - Packet loss = 100%
From: nagios@alert1001.wikimedia.org
Notification Type: PROBLEM Host: cr3-eqsin State: DOWN Address: 103.102.166.131 Info: PING CRITICAL - Packet loss = 100% Date/Time: Fri Jun 18 08:30:25 UTC 2021 Acknowledged by :

Alert Payload
Alert Data
Alert Fields
Show Null Fields (3)
Splunk On-Call Fields
agent	m
alert_received_time_utc	2021-06-18T08:30:36Z
alert_received_week_time_utc	2021-W24-5T08:30:36Z
alert_type	CRITICAL
api_key	redacted
entity_display_name	Host cr3-eqsin - PING  - Packet loss = 100%
entity_id	Host cr3-eqsin - PING  - Packet loss = 100%
entity_is_host	false
entity_state	CRITICAL
message_type	CRITICAL
monitor_name	nagios@alert1001.wikimedia.org
monitoring_tool	Email
NOTIFICATIONTYPE	CRITICAL
routing_key	icinga
sender	nagios@alert1001.wikimedia.org
SERVICESTATE	CRITICAL
state_message	Notification Type: PROBLEM
Host: cr3-eqsin
State: DOWN
Address: 103.102.166.131
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Fri Jun 18 08:30:25 UTC 2021

Acknowledged by :
state_start_time	1624005036995
subject	PROBLEM Host cr3-eqsin - PING CRITICAL - Packet loss = 100%
timestamp	1624005036995
VO_ALERT_RCV_TIME	1624005036995
VO_ALERT_TYPE	SERVICE
VO_MONITOR_TYPE	8
VO_ORGANIZATION_ID	wikimedia
VO_UUID	079ca093-a3ce-42bc-8a5a-6d090aaf88a0

The alert was then acknowledged by Riccardo and finally resolved by me.

Icinga sending the alert and recovery emails:

Jun 18 08:30:25 alert1001 icinga: HOST NOTIFICATION: victorops;cr3-eqsin;DOWN;vo-host-notify-by-email;PING CRITICAL - Packet loss = 100%
Jun 18 08:31:03 alert1001 icinga: HOST NOTIFICATION: victorops;cr3-eqsin;UP;vo-host-notify-by-email;PING OK - Packet loss = 0%, RTA = 238.16 ms

And the recovery email looks sth like below, and I couldn't find it in the incident's timeline:

Delivered-To: fgiunchedi@wikimedia.org
Received: by 2002:a92:a812:0:0:0:0:0 with SMTP id o18csp1063253ilh;
        Fri, 18 Jun 2021 01:31:04 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyic1JRYuI+9yMpiaWzn7qltxK7leWWMdA9SZWdxCJ4wXhClSNlY/bx4Uhku7FZZniXiVJT
X-Received: by 2002:ac8:4b42:: with SMTP id e2mr9376616qts.210.1624005064605;
        Fri, 18 Jun 2021 01:31:04 -0700 (PDT)
Return-Path: <root@wikimedia.org>
Received: from mx1001.wikimedia.org (mx1001.wikimedia.org. [208.80.154.76])
        by mx.google.com with ESMTPS id j6si4555812qko.196.2021.06.18.01.31.04
        (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256);
        Fri, 18 Jun 2021 01:31:04 -0700 (PDT)
Received: from alert1001.wikimedia.org ([2620:0:861:3:208:80:154:88]:38128) by mx1001.wikimedia.org with esmtp (Exim 4.89) (envelope-from <root@wikimedia.org>) id 1lu9u3-0002vz-VC for alerts@wikimedia.org; Fri, 18 Jun 2021 08:31:03 +0000
Received: from nagios by alert1001.wikimedia.org with local (Exim 4.92) (envelope-from <root@wikimedia.org>) id 1lu9u3-0001iI-Ug for alerts@wikimedia.org; Fri, 18 Jun 2021 08:31:03 +0000
To: alerts@wikimedia.org
Subject: RECOVERY Host cr3-eqsin - PING OK - Packet loss = 0%, RTA = 238.16 ms
Reply-To: alerts@wikimedia.org
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Message-Id: <E1lu9u3-0001iI-Ug@alert1001.wikimedia.org>
From: nagios@alert1001.wikimedia.org
Date: Fri, 18 Jun 2021 08:31:03 +0000

Notification Type: RECOVERY
Host: cr3-eqsin
State: UP
Address: 103.102.166.131
Info: PING OK - Packet loss = 0%, RTA = 238.16 ms

Date/Time: Fri Jun 18 08:31:03 UTC 2021

Acknowledged by :

Older example

We got an host alert (labweb1002) that paged in VO but the recovery email didn't auto-resolve the related incident (possibly similar to T263423: librenms page didn't auto-resolve in VO)

Alert details

Alert Data

is_vo_ack	1
TIMET	1601307976737
Show Null Fields (0)
VictorOps Fields

ack_author	xionox
ack_msg	
agent	m
alert_received_time_utc	2020-09-28T15:41:00Z
alert_received_week_time_utc	2020-W40-1T15:41:00Z
alert_type	ACKNOWLEDGEMENT
api_key	redacted
entity_display_name	Host labweb1002 - PING  - Packet loss = 100%
entity_id	Host labweb1002 - PING  - Packet loss = 100%
entity_is_host	false
entity_state	CRITICAL
host_name	
INCIDENT_ID	511
message_type	ACKNOWLEDGEMENT
monitor_name	nagios@alert1001.wikimedia.org
monitoring_tool	Email
NOTIFICATIONTYPE	ACKNOWLEDGEMENT
routing_key	icinga
sender	nagios@alert1001.wikimedia.org
SERVICESTATE	CRITICAL
state_message	Notification Type: PROBLEM
Host: labweb1002
State: DOWN
Address: 208.80.155.109
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Mon Sept 28 15:41:00 UTC 2020

Acknowledged by :
state_start_time	1601307976737
subject	PROBLEM Host labweb1002 - PING CRITICAL - Packet loss = 100%
timestamp	1601307976737
VO_ALERT_RCV_TIME	1601307976737
VO_ALERT_TYPE	SERVICE
VO_MONITOR_TYPE	8
VO_ORGANIZATION_ID	wikimedia
VO_UUID	9aa38dfc-8a82-4b8c-8530-bceba8d68ef3

Recovery details

Alert Data

Payload is empty

VictorOps Fields

ack_author	
ack_msg	
agent	m
alert_received_time_utc	2020-09-28T15:47:23Z
alert_received_week_time_utc	2020-W40-1T15:47:23Z
alert_type	RECOVERY
api_key	redacted
entity_display_name	Host labweb1002 - PING  - Packet loss = 0%, RTA = 0.22 ms
entity_id	Host labweb1002 - PING  - Packet loss = 0%, RTA = 0.22 ms
entity_is_host	false
entity_state	OK
host_name	
message_type	RECOVERY
monitor_name	nagios@alert1001.wikimedia.org
monitoring_tool	Email
NOTIFICATIONTYPE	RECOVERY
routing_key	icinga
sender	nagios@alert1001.wikimedia.org
SERVICESTATE	OK
state_message	Notification Type: RECOVERY
Host: labweb1002
State: UP
Address: 208.80.155.109
Info: PING OK - Packet loss = 0%, RTA = 0.22 ms

Date/Time: Mon Sept 28 15:47:22 UTC 2020

Acknowledged by :
state_start_time	1601308043663
subject	RECOVERY Host labweb1002 - PING OK - Packet loss = 0%, RTA = 0.22 ms
timestamp	1601308043663
VO_ALERT_RCV_TIME	1601308043663
VO_ALERT_TYPE	SERVICE
VO_MONITOR_TYPE	8
VO_ORGANIZATION_ID	wikimedia
VO_UUID	7e71bad7-d171-4f38-9edb-5adb5157c7a7

Event Timeline

fgiunchedi claimed this task.

I'm tentatively resolving this since I believe we didn't see new occurrences