Page MenuHomePhabricator

Phabricator outbound email seems to have a SPOF of mx1001
Closed, ResolvedPublic

Description

Last week when exim on mx1001 was stopped in preparation for an OS upgrade, phabricator stopped sending outbound emails.

Mail was queued and delivered after mx1001 exim was restarted, which is good, but failover did not work out. So, it seems we have an SPOF to address.

Event Timeline

herron created this task.Jun 11 2018, 3:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2018, 3:50 PM

What does the phabricator outbound mail config look like today?

Do we already have both mx1001 and mx2001 configured as outbound mail servers?

https://phabricator.wikimedia.org/config/all lists for phpmailer.smtp-host the value mx1001.wikimedia.org;mx2001.wikimedia.org (but don't know since when)

Reedy added a subscriber: Reedy.Jun 11 2018, 6:56 PM

What does the phabricator outbound mail config look like today?

Do we already have both mx1001 and mx2001 configured as outbound mail servers?

They were both listed/configured when I looked after/during the email problems last week

Joe triaged this task as High priority.Jun 18 2018, 10:48 AM

Network connectivity looks good from phab1001 to both MX servers.

phab1001:~# nc -vz mx1001.wikimedia.org 25
mx1001.wikimedia.org [208.80.154.76] 25 (smtp) open
phab1001:~# nc -vz mx2001.wikimedia.org 25
mx2001.wikimedia.org [208.80.153.45] 25 (smtp) open

Since mail was queued successfully by Phabricator, we could choose a time to set log verbosity to debug and attempt to reproduce this. A temporary firewall rule to reject connections from phab1001 to mx1001 should have the same effect. Even better if there is a test environment to use.

In the mean time I recommend we update the config to phpmailer.smtp-host = localhost and let the local Exim MTA handle failover.

Change 440910 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] phabricator: set smtp-host to localhost

https://gerrit.wikimedia.org/r/440910

Vvjjkkii renamed this task from Phabricator outbound email seems to have a SPOF of mx1001 to f9aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
ArielGlenn renamed this task from f9aaaaaaaa to Phabricator outbound email seems to have a SPOF of mx1001.Jul 1 2018, 6:24 AM
ArielGlenn updated the task description. (Show Details)
ArielGlenn added subscribers: gerritbot, Aklapper.

In the mean time I recommend we update the config to phpmailer.smtp-host = localhost and let the local Exim MTA handle failover.

FWIW the localhost MTA approach has been used by Gerrit for a couple days now and is working well there.

Does this need more review/commentary before moving forward?

herron added a comment.Sep 4 2018, 3:01 PM

Does this need more review/commentary before moving forward?

We should be in good shape to move forward with this. The localhost smtp-host patch has two +1 so I'll plan to merge that tomorrow unless someone objects.

Mentioned in SAL (#wikimedia-operations) [2018-09-05T13:38:59Z] <herron> updating phabricator mail smtp-host to localhost T196916

Change 440910 merged by Herron:
[operations/puppet@production] phabricator: set smtp-host to localhost

https://gerrit.wikimedia.org/r/440910

herron closed this task as Resolved.Sep 5 2018, 1:50 PM
herron claimed this task.

This is looking good. Here are the received headers from a recently delivered message that was relayed through localhost:

Received: from phab1001.eqiad.wmnet ([2620:0:861:102:10:64:16:8]:55908)
	by mx1001.wikimedia.org with esmtp (Exim 4.84_2)
	(envelope-from <no-reply@phabricator.wikimedia.org>)
	id 1fxY6e-0005SD-3z
	for kherron@wikimedia.org; Wed, 05 Sep 2018 13:44:28 +0000
Received: from localhost ([::1]:36512 helo=localhost.localdomain)
	by phab1001.eqiad.wmnet with esmtp (Exim 4.84_2)
	(envelope-from <no-reply@phabricator.wikimedia.org>)
	id 1fxY6d-0000iF-V4
	for kherron@wikimedia.org; Wed, 05 Sep 2018 13:44:28 +0000