Page MenuHomePhabricator

CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2
Open, HighPublic

Description

CA App Synthetic Monitor is sending occasional email alerts like the one below, with recoveries happening just a few minutes after.

According to monitor 'Mail (SMTP)', the smtp service on 'mx1001.wikimedia.org'
has not been working as specified since 2019-12-16 11:27:16.

Message: Connection timed out; connect(): -2

Details

Related Gerrit Patches:
operations/puppet : productionmx: log smtp_connection details

Event Timeline

herron triaged this task as Medium priority.Dec 16 2019, 9:52 PM
herron created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 16 2019, 9:52 PM

Looked into these alerts a bit, and pulled the source IP addresses for these checks from watchmouse, but I don't see these IPs appearing in the mx logs. I think it is because the exim mx logs are not currently detailed enough. So I'll make the logs a bit more verbose and review again after more log information has been gathered.

Change 558228 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] mx: log smtp_connection details

https://gerrit.wikimedia.org/r/558228

Change 558228 merged by Herron:
[operations/puppet@production] mx: log smtp_connection details

https://gerrit.wikimedia.org/r/558228

Mentioned in SAL (#wikimedia-operations) [2019-12-16T22:07:01Z] <herron> increasing mx exim log verbosity by adding smtp_connection to log_selector list T240906

jcrespo added a subscriber: jcrespo.EditedTue, Dec 24, 10:00 AM

There has been multiple of mx1001 issues lately (even if that is unreliable, it is worth noting). My suggestion would be, at least initially, to detect the same issue, if real, on icinga.

Availability in the last month, according to the emails, has been of 78%. List of reports:

Nov 03
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-03 11:57:27.
Nov 12
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-12 10:17:17.
Nov 13
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-13 17:17:16.
Nov 18
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-18 09:42:15.
Nov 25
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-25 09:37:17.
Dec 4
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-04 18:27:15.
Dec 5
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-05 10:17:16.
Dec 14
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-14 09:47:16.
Dec 16
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-16 11:27:16.
Dec 21
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-21 20:12:16.
Dec 22
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-22 17:07:16.
Dec 23
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-23 02:37:16.
Dec 24
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-24 05:52:16.

Recoveries:

Nov 3
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-03 12:26:28.
Nov 12
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-12 10:41:20.
Nov 13
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-13 17:46:24.
Nov 18
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-18 10:16:20.
Nov 25
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-25 10:06:15.
Dec 4
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-04 18:56:16.
Dec 5
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-05 10:41:16.
Dec 14
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-14 10:11:16.
Dec 16
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-16 11:51:16.
Dec 21
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-21 20:36:47.
Dec 22
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-22 17:36:16.
Dec 23
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-23 03:01:24.
Dec 24
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-24 06:16:16.
jcrespo raised the priority of this task from Medium to High.Tue, Dec 24, 10:05 AM

I am going to mark this as high, as we have now daily alerts, assuming those are real.

There has been multiple of mx1001 issues lately (even if that is unreliable, it is worth noting). My suggestion would be, at least initially, to detect the same issue, if real, on icinga.

From icinga's perspective mx1001 SMTP service has been in OK state for 288 days (Last State Change: 2019-03-21 09:16:57)

Looking at the exim logs again, with the additional verbosity from T240906#5746152 this seems strange. Here are the details of a recent alert:

Description:	110: Connection timed out; connect(): -2
Checkpoint:	Brazil - Sao Paulo
Time:	January 02 2020 20:56:15 GMT

According to https://asm.ca.com/en/products/monitoring-stations the "Brazil - Sao Paulo" monitor has IP 35.198.36.207. However this yields nothing when searching for this IP in all of the exim logs still on the host:

zgrep 35.198.36.207 /var/log/exim4/mainlog*
(no results)

Meanwhile, there are various other watchmouse IP addresses in the logs. e.g.

grep watchmouse.com /var/log/exim4/mainlog
2020-01-03 06:26:19 SMTP connection from (watchmouse.com) [45.117.79.197]:46774 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 06:36:45 SMTP connection from (watchmouse.com) [13.88.225.57]:60944 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 06:51:47 SMTP connection from (watchmouse.com) [52.187.230.102]:37836 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 06:56:46 SMTP connection from rumow02.watchmouse.net (watchmouse.com) [31.131.253.230]:46708 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 07:01:16 SMTP connection from 1e.46.33a9.ip4.static.sl-reverse.com (watchmouse.com) [169.51.70.30]:52714 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 07:16:45 SMTP connection from (watchmouse.com) [13.88.225.57]:46692 I=[208.80.154.76]:25 closed by QUIT

This is only one example, but I'm seeing the same pattern for each of the alerts I've checked out. Seems that the monitoring traffic is not making it to mx1001.

Fwiw the ganeti host mx1001 resides on is throwing constant temperature and cpu throttle down warnings, and mx1001 is the only VM running on this ganeti host using the public vlan. I'd like to rule out possible hardware issues by migrating this VM to another Ganeti host, and seeing if that makes any improvement. I'll hold off on doing that since it's getting late on a Friday and this is the primary production MX. Will give that a try next week.

Mentioned in SAL (#wikimedia-operations) [2020-01-06T15:47:53Z] <herron> migrating mx1001 to seconday ganeti node T240906

herron added a comment.Wed, Jan 8, 4:17 PM

I'd like to rule out possible hardware issues by migrating this VM to another Ganeti host, and seeing if that makes any improvement.

Nope, an alert happened again since migrating to a different Ganeti host, so we can rule this possibility out.

ayounsi added a subscriber: ayounsi.Thu, Jan 9, 5:38 AM
  • Is it always the same source Watchmouse probe failing or "random" ones?
  • What does the check do exactly? (TCP, more L7 checks?)
  • Is the check configured to retry or email on the first issue?
  • As discussed during the meeting, reaching out to Watchmouse' support is probably the best next step if we ruled out all issues on our side
  • A packet capture of a working vs. non working probing would be useful but tricky to get