Page MenuHomePhabricator

Two close pages for idle workers api + appserver didn't auto-resolve on recovery
Open, MediumPublic

Description

During the last switchover (Oct 27th) there were two pages sent out close together, both triggered distinct incidents in VO as expected. However on recovery only one of the two incidents was automatically resolved by VO.

The emails and incidents are the following:

appserver (auto resolved)

Message to alerts@:

Date: Tue, 27 Oct 2020 14:06:37 +0000
Subject: ** PROBLEM alert - alert1001/Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page is CRITICAL **
Received: from nagios by alert1001.wikimedia.org with local (Exim 4.92) (envelope-from <root@wikimedia.org>) id 1kXPcT-00016D-Uz for alerts@wikimedia.org; Tue, 27 Oct 2020 14:06:37 +0000

Triggered incident 573: https://portal.victorops.com/ui/wikimedia/incident/573/details

And said incident has been automatically resolved by this message:

Date: Tue, 27 Oct 2020 14:08:25 +0000
Subject: ** RECOVERY alert - alert1001/Not enough idle PHP-FPM workers for Mediawiki appserver at eqiad #page is OK **
Received: from nagios by alert1001.wikimedia.org with local (Exim 4.92) (envelope-from <root@wikimedia.org>) id 1kXPeD-0005iQ-RA for alerts@wikimedia.org; Tue, 27 Oct 2020 14:08:25 +0000

api-appserver (did not auto resolve)

Similar message to alerts@:

Date: Tue, 27 Oct 2020 14:06:39 +0000
Subject: ** PROBLEM alert - alert1001/Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page is CRITICAL **
Received: from alert1001.wikimedia.org ([2620:0:861:3:208:80:154:88]:39792) by mx1001.wikimedia.org with esmtp (Exim 4.89) (envelope-from <root@wikimedia.org>) id 1kXPcV-0000se-JL for alerts@wikimedia.org; Tue, 27 Oct 2020 14:06:39 +0000

Triggered incident 574 instead: https://portal.victorops.com/ui/wikimedia/incident/574/details

However the following message didn't auto-resolve the incident (nor I could find the email in 574's timeline)

Date: Tue, 27 Oct 2020 14:08:27 +0000
Subject: ** RECOVERY alert - alert1001/Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page is OK **
Received: from alert1001.wikimedia.org ([2620:0:861:3:208:80:154:88]:51878) by mx1001.wikimedia.org with esmtp (Exim 4.89) (envelope-from <root@wikimedia.org>) id 1kXPeF-0001IO-T1 for fgiunchedi@wikimedia.org; Tue, 27 Oct 2020 14:08:27 +0000