Page MenuHomePhabricator

CA App Synthetic Monitor Mail (SMTP): Connection timed out; connect(): -2
Closed, ResolvedPublic

Description

CA App Synthetic Monitor is sending occasional email alerts like the one below, with recoveries happening just a few minutes after.

According to monitor 'Mail (SMTP)', the smtp service on 'mx1001.wikimedia.org'
has not been working as specified since 2019-12-16 11:27:16.

Message: Connection timed out; connect(): -2

Related Objects

Event Timeline

herron triaged this task as Medium priority.Dec 16 2019, 9:52 PM
herron created this task.

Looked into these alerts a bit, and pulled the source IP addresses for these checks from watchmouse, but I don't see these IPs appearing in the mx logs. I think it is because the exim mx logs are not currently detailed enough. So I'll make the logs a bit more verbose and review again after more log information has been gathered.

Change 558228 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] mx: log smtp_connection details

https://gerrit.wikimedia.org/r/558228

Change 558228 merged by Herron:
[operations/puppet@production] mx: log smtp_connection details

https://gerrit.wikimedia.org/r/558228

Mentioned in SAL (#wikimedia-operations) [2019-12-16T22:07:01Z] <herron> increasing mx exim log verbosity by adding smtp_connection to log_selector list T240906

There has been multiple of mx1001 issues lately (even if that is unreliable, it is worth noting). My suggestion would be, at least initially, to detect the same issue, if real, on icinga.

Availability in the last month, according to the emails, has been of 78%. List of reports:

Nov 03
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-03 11:57:27.
Nov 12
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-12 10:17:17.
Nov 13
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-13 17:17:16.
Nov 18
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-18 09:42:15.
Nov 25
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-11-25 09:37:17.
Dec 4
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-04 18:27:15.
Dec 5
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-05 10:17:16.
Dec 14
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-14 09:47:16.
Dec 16
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-16 11:27:16.
Dec 21
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-21 20:12:16.
Dec 22
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-22 17:07:16.
Dec 23
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-23 02:37:16.
Dec 24
CA App Synthetic Monitor - No Reply has not been working as specified since 2019-12-24 05:52:16.

Recoveries:

Nov 3
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-03 12:26:28.
Nov 12
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-12 10:41:20.
Nov 13
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-13 17:46:24.
Nov 18
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-18 10:16:20.
Nov 25
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-11-25 10:06:15.
Dec 4
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-04 18:56:16.
Dec 5
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-05 10:41:16.
Dec 14
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-14 10:11:16.
Dec 16
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-16 11:51:16.
Dec 21
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-21 20:36:47.
Dec 22
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-22 17:36:16.
Dec 23
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-23 03:01:24.
Dec 24
CA App Synthetic Monitor - No Reply has been working again as specified since 2019-12-24 06:16:16.
jcrespo raised the priority of this task from Medium to High.Dec 24 2019, 10:05 AM

I am going to mark this as high, as we have now daily alerts, assuming those are real.

There has been multiple of mx1001 issues lately (even if that is unreliable, it is worth noting). My suggestion would be, at least initially, to detect the same issue, if real, on icinga.

From icinga's perspective mx1001 SMTP service has been in OK state for 288 days (Last State Change: 2019-03-21 09:16:57)

Looking at the exim logs again, with the additional verbosity from T240906#5746152 this seems strange. Here are the details of a recent alert:

Description:	110: Connection timed out; connect(): -2
Checkpoint:	Brazil - Sao Paulo
Time:	January 02 2020 20:56:15 GMT

According to https://asm.ca.com/en/products/monitoring-stations the "Brazil - Sao Paulo" monitor has IP 35.198.36.207. However this yields nothing when searching for this IP in all of the exim logs still on the host:

zgrep 35.198.36.207 /var/log/exim4/mainlog*
(no results)

Meanwhile, there are various other watchmouse IP addresses in the logs. e.g.

grep watchmouse.com /var/log/exim4/mainlog
2020-01-03 06:26:19 SMTP connection from (watchmouse.com) [45.117.79.197]:46774 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 06:36:45 SMTP connection from (watchmouse.com) [13.88.225.57]:60944 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 06:51:47 SMTP connection from (watchmouse.com) [52.187.230.102]:37836 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 06:56:46 SMTP connection from rumow02.watchmouse.net (watchmouse.com) [31.131.253.230]:46708 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 07:01:16 SMTP connection from 1e.46.33a9.ip4.static.sl-reverse.com (watchmouse.com) [169.51.70.30]:52714 I=[208.80.154.76]:25 closed by QUIT
2020-01-03 07:16:45 SMTP connection from (watchmouse.com) [13.88.225.57]:46692 I=[208.80.154.76]:25 closed by QUIT

This is only one example, but I'm seeing the same pattern for each of the alerts I've checked out. Seems that the monitoring traffic is not making it to mx1001.

Fwiw the ganeti host mx1001 resides on is throwing constant temperature and cpu throttle down warnings, and mx1001 is the only VM running on this ganeti host using the public vlan. I'd like to rule out possible hardware issues by migrating this VM to another Ganeti host, and seeing if that makes any improvement. I'll hold off on doing that since it's getting late on a Friday and this is the primary production MX. Will give that a try next week.

Mentioned in SAL (#wikimedia-operations) [2020-01-06T15:47:53Z] <herron> migrating mx1001 to seconday ganeti node T240906

I'd like to rule out possible hardware issues by migrating this VM to another Ganeti host, and seeing if that makes any improvement.

Nope, an alert happened again since migrating to a different Ganeti host, so we can rule this possibility out.

  • Is it always the same source Watchmouse probe failing or "random" ones?
  • What does the check do exactly? (TCP, more L7 checks?)
  • Is the check configured to retry or email on the first issue?
  • As discussed during the meeting, reaching out to Watchmouse' support is probably the best next step if we ruled out all issues on our side
  • A packet capture of a working vs. non working probing would be useful but tricky to get

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

@lmata is it still valid issue?

it shouldn't be, watchmouse has been sunset, see T299147: Retire WatchMouse (CA DX APP), resolving