Phab)
Open, HighPublic
Actions

Assigned To

Authored By

	MoritzMuehlenhoff
	Sep 9 2019, 2:33 PM

Description

Some like CVE-2019-13917 only affect exotic configurations, but the recent one was actually triggerable via TLS negotiation.

Postfix OTOH only had one security issue resulting in code execution in the last decade (https://www.debian.org/security/2011/dsa-2233) and by nature of it's design the impact is reduced to code injection as user postfix instead of root.

Exim has great upstream support and security issues are dealt with in an exceptionally professional manner, but there's always risk of zero days. The recent string handling issue exploitable via TLS was present in the code base as far as the VCS dates back, it could just as well have been discovered before until it was eventually resposibly disclosed.

In fact, there has been at least on security issue in the past which was exploited in the wild for a change which wasn't identified as a security issue:
https://lists.exim.org/lurker/message/20101207.215955.bb32d4f2.en.html (which is CVE-2010-4344)

So let's have a discussion/evaluation whether Postfix meets our feature needs and whether moving Postfix is an option. If so, when we migrate our MXes to Buster we could consider to move to Postfix instead.

Details

Subject	Repo	Branch	Lines +/-
profile::mail: add mta hiera option profile::mail::mta	operations/puppet	production	+11 -5
exim: make exim class ensurable	operations/puppet	production	+13 -16
mail: move default mail relay config out of standard module	operations/puppet	production	+14 -8

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	jhathaway	T232343 Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab)
Open	jhathaway	T325394 Replace Exim with Postfix on mail servers
Resolved	jhathaway	T325395 postfix mx puppetry
Resolved	jhathaway	T325396 Postfix Module
Resolved	jhathaway	T325397 Rspamd module
Resolved	jhathaway	T325398 Postfix MTA Profile
Resolved	jhathaway	T325403 MTA provisioning
Invalid	jhathaway	T325401 Provision mta-inbound-infra
Invalid	jhathaway	T325402 Provision mta-outbound-infra
Resolved	jhathaway	T325406 Provision mx-in
Resolved	aborrero	T374278 Update wmcloud.org MX records
Resolved	jhathaway	T325407 Provision mx-out
Resolved	jhathaway	T361750 Site: eqiad, codfw 2 VM request for postfix mx-out
Open	jhathaway	T325408 Replace Exim null client config with a Postfix null client config
Resolved	jhathaway	T325409 Decom Exim based mx{1001,2001}.wikimedia.org
Resolved	jhathaway	T358355 Integration tests
Resolved	jhathaway	T365395 Postfix outbound rollout sequence, mx-out
Resolved	jhathaway	T366113 Update SPF records as needed
Resolved	Dwisehaupt	T366740 Update fundraising mail settings to use new production mx hosts
Resolved	jhathaway	T367517 Postfix inbound rollout sequence, mx-in
Resolved	Dwisehaupt	T367573 Update fundraising mail / firewall settings to use new production mx-in hosts
Open	jhathaway	T378021 Replace Exim on lists.wikimedia.org with Postfix
Open	jhathaway	T325404 Provision mx-in-lists
Open	jhathaway	T325405 Provision mx-out-lists
Open	jhathaway	T378028 Replace Exim on VRTS servers with Postfix
Open	jhathaway	T378029 Replace Exim on phabricator servers with Postfix

Event Timeline

MoritzMuehlenhoff created this task.Sep 9 2019, 2:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2019, 2:33 PM

Peachey88 subscribed.Sep 9 2019, 8:27 PM

jbond triaged this task as Medium priority.Sep 10 2019, 10:07 AM

jbond added a subscriber: herron.

jbond subscribed.

• crusnov subscribed.Sep 11 2019, 4:31 PM

MoritzMuehlenhoff updated the task description. (Show Details)Sep 30 2019, 8:05 AM

Paladox subscribed.Sep 30 2019, 8:12 AM

MoritzMuehlenhoff added a project: User-MoritzMuehlenhoff.May 28 2020, 11:53 AM

MoritzMuehlenhoff raised the priority of this task from Medium to High.May 4 2021, 1:46 PM

bd808 subscribed.May 4 2021, 4:13 PM

Ladsgroup subscribed.May 4 2021, 5:11 PM

This has a good comparison: https://mailtrap.io/blog/postfix-sendmail-exim/

It seems postfix is better in security/performance but has lower marketshare. My biggest worry in general is the marketshare (and IIRC, I need to double check, exim4 seems to be taking more marketshare) but beside that I have no objection

Some high level thoughts about how we might approach migrating:

Inbound mail: As a first step in migrating to postfix we could front the existing exim mx100[12] hosts with postfix mx-in100[12] hosts running a simple queue+forward configuration. Listing only postfix mx-in in hosts in our MX records would remove our need for internet facing exim listeners on the prod MXes, and we could firewall the exim MX hosts to receive local traffic only. Longer-term we could migrate the routing and content filtering configs from exim mx to postfix mx-in, eventually turning down the exim MXes entirely.

Outbound mail: Today the same MXes handling inbound mail also handle outbound bulk mail and transitional "wiki mail". There is an additional IP address bound to the MX servers which is used for wiki mail. We could/should split outbound mail handling off to a new set of postfix mx-out[12]001 smarthosts.
optionally splitting mx-out once more between bulk mail (root mail, crons, etc) and wiki mail with something to the effect of mx-out-wiki[12]001 and mx-out-bulk[12]001

Host MTAs: There is an exim instance on each host in the fleet configured to queue and relay messages out via mx[12]001. Several services make use of this (gerrit, phabricator, etc.). This will be straightforward to implement in postfix, but for awareness we will need to ensure that the postfix host MTA binds localhost:25 to avoid interrupting outbound application mail flow.

Lists: Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front this with postfix near-term, using much the same approach as described for inbound mail, possibly the same hosts even. Longer-term we could migrate the inbound/outbound lists routing configuration into postix, and optionally integrate lists into future production mx-(in|out) clusters described above.

Cloud smarthosts: There are a set of outbound MX smarthosts in the cloud environment. We should be able to repurpose much of the work done to set up a postfix based mx-out to deploy postfix based cloud smarthosts.

I think that in terms of effort vs reward, removing internet facing exim instances, and migrating host MTAs will be a good place to start. That would shrink our exim surface area significantly.

Also, we could think about settling into a mixed environment where we service something like 95% of our mail needs using postfix, and reserve exim only for more complex routing/lookup where it may be the better tool for the job.

In T232343#7058654, @herron wrote:

Lists: Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front this with postfix near-term, using much the same approach as described for inbound mail, possibly the same hosts even. Longer-term we could migrate the inbound/outbound lists routing configuration into postix, and optionally integrate lists into future production mx-(in|out) clusters described above.

If I'm understanding this proposal correctly, it would also help toward T278495: Figure out plan for mailman IP situation since lists would no longer have its own internet-facing MTA.

Legoktm updated the task description. (Show Details)May 4 2021, 11:09 PM

Change 686633 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mail: move default mail relay config out of standard module

https://gerrit.wikimedia.org/r/686633

gerritbot added a project: Patch-For-Review.May 7 2021, 4:01 PM

Change 688333 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] exim: make exim class ensurable

https://gerrit.wikimedia.org/r/688333

Change 688391 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::mail: add mta hiera option profile::mail::mta

https://gerrit.wikimedia.org/r/688391

Change 686633 merged by Herron:

[operations/puppet@production] mail: move default mail relay config out of standard module

https://gerrit.wikimedia.org/r/686633

Aklapper added a project: Infrastructure-Foundations.Jun 21 2021, 8:59 PM

Legoktm mentioned this in T286066: Put lists.wikimedia.org web interface behind LVS.Jul 2 2021, 4:36 PM

In T232343#7058654, @herron wrote:

Lists: Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front this with postfix near-term, using much the same approach as described for inbound mail, possibly the same hosts even. Longer-term we could migrate the inbound/outbound lists routing configuration into postix, and optionally integrate lists into future production mx-(in|out) clusters described above.

Some historical context on the "same hosts" part: inbound MXes and lists were originally all in the same MX, alongside a few other functions (OTRS etc.). It was a very tightly coupled, very complicated and thus very brittle exim config, and also very hard to monitor or debug. With any change (e.g. exim config, Debian upgrade etc.) one needed to test against a number of different things, and also risking to bring everything down (and we've had very high stress situations with all corporate email being broken). Mailing list spam or backscatter would result into a full exim queue, delaying other emails. There were some complicated issues that I don't fully recall (i.e. erased from my memory ;)) with regards to DKIM signing, and alias expansions where emails went through multiple routers (alias to a mailing list with corp email members).

While some of them could be mitigated (e.g. separate exim queues per router), ultimately the complexity issue would not be addressed or could even become worse. I split the functions out into separate boxes and exim configs in 2014, and it was an immediate improvement, with very rare outages on either system ever since. There were some security wins as well, that we benefitted from soon after that. I ran out of time, but at the time I was also planning the split of inbound/outbound, as well as outbound smarthost vs. wiki-mail, for the same reasons.

That was 2014, and I'm not qualified to have an opinion on the technical design in 2021 ;) However, directionally I'd like to add two more non-technical concerns to take into account:

We should not create additional impediments to mail-related projects. For example, we should be thinking if a proposed solution would have prevented the same velocity or access to be granted for the Mailman 2->3 project.
As SRE and the org grows, ownership of different components may lie into different teams, with the served audiences being the differentiator, rather than the technology used. Tighter coupling could create interdependencies between teams or contribute to lack of ownership. (This may the case already today even at our existing size and existing functions of our MXes :)

In T232343#7194626, @faidon wrote:

While some of them could be mitigated (e.g. separate exim queues per router), ultimately the complexity issue would not be addressed or could even become worse. I split the functions out into separate boxes and exim configs in 2014, and it was an immediate improvement, with very rare outages on either system ever since. There were some security wins as well, that we benefitted from soon after that. I ran out of time, but at the time I was also planning the split of inbound/outbound, as well as outbound smarthost vs. wiki-mail, for the same reasons.

If the main MXes were just relays for lists mail, do you think your concerns would still apply? E.g. mx#### (internet facing, postfix) -> lists#### (not internet facing, exim) -> mailman3.

We should not create additional impediments to mail-related projects. For example, we should be thinking if a proposed solution would have prevented the same velocity or access to be granted for the Mailman 2->3 project.

Hard to get worse velocity than how overdue Mailman3 was :P but I agree, I'm generally more hesitant to make changes to the prod MXes (partly why https://gerrit.wikimedia.org/r/c/operations/puppet/+/681242 hasn't moved forward faster yet).

BTullis subscribed.Dec 7 2022, 11:12 AM

Change 688333 abandoned by Jbond:

[operations/puppet@production] exim: make exim class ensurable

Reason:

work is being don to move to posfix

https://gerrit.wikimedia.org/r/688333

jhathaway claimed this task.Feb 26 2024, 3:46 PM

Ladsgroup mentioned this in T356984: Stop sending change notification email if edit is done by a bot.Mar 18 2024, 10:56 AM

Change #688391 abandoned by Herron:

[operations/puppet@production] profile::mail: add mta hiera option profile::mail::mta

Reason:

spring cleaning -- stale patch