Page MenuHomePhabricator

Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab)
Open, HighPublic

Description

Exim has had a number of security issues over the recent years, which allowed local and remote privilege escalation, often as root. Looking back at the last ten years:
https://www.qualys.com/2021/05/04/21nails/21nails.txt
https://www.debian.org/security/2019/dsa-4456
https://www.debian.org/security/2019/dsa-4488
https://www.debian.org/security/2019/dsa-4456
https://www.debian.org/security/2019/dsa-4517
https://www.debian.org/security/2018/dsa-4110
https://www.debian.org/security/2017/dsa-4053
https://www.debian.org/security/2016/dsa-3517
https://www.debian.org/security/2012/dsa-2566
https://www.debian.org/security/2011/dsa-2232
https://www.debian.org/security/2011/dsa-2236
https://www.debian.org/security/2011/dsa-2154

Some like CVE-2019-13917 only affect exotic configurations, but the recent one was actually triggerable via TLS negotiation.

Postfix OTOH only had one security issue resulting in code execution in the last decade (https://www.debian.org/security/2011/dsa-2233) and by nature of it's design the impact is reduced to code injection as user postfix instead of root.

Exim has great upstream support and security issues are dealt with in an exceptionally professional manner, but there's always risk of zero days. The recent string handling issue exploitable via TLS was present in the code base as far as the VCS dates back, it could just as well have been discovered before until it was eventually resposibly disclosed.

In fact, there has been at least on security issue in the past which was exploited in the wild for a change which wasn't identified as a security issue:
https://lists.exim.org/lurker/message/20101207.215955.bb32d4f2.en.html (which is CVE-2010-4344)

So let's have a discussion/evaluation whether Postfix meets our feature needs and whether moving Postfix is an option. If so, when we migrate our MXes to Buster we could consider to move to Postfix instead.

Event Timeline

jbond triaged this task as Medium priority.Sep 10 2019, 10:07 AM
jbond added a subscriber: herron.
jbond added a subscriber: jbond.
MoritzMuehlenhoff raised the priority of this task from Medium to High.May 4 2021, 1:46 PM

This has a good comparison: https://mailtrap.io/blog/postfix-sendmail-exim/

It seems postfix is better in security/performance but has lower marketshare. My biggest worry in general is the marketshare (and IIRC, I need to double check, exim4 seems to be taking more marketshare) but beside that I have no objection

Some high level thoughts about how we might approach migrating:

Inbound mail: As a first step in migrating to postfix we could front the existing exim mx100[12] hosts with postfix mx-in100[12] hosts running a simple queue+forward configuration. Listing only postfix mx-in in hosts in our MX records would remove our need for internet facing exim listeners on the prod MXes, and we could firewall the exim MX hosts to receive local traffic only. Longer-term we could migrate the routing and content filtering configs from exim mx to postfix mx-in, eventually turning down the exim MXes entirely.

Outbound mail: Today the same MXes handling inbound mail also handle outbound bulk mail and transitional "wiki mail". There is an additional IP address bound to the MX servers which is used for wiki mail. We could/should split outbound mail handling off to a new set of postfix mx-out[12]001 smarthosts.
optionally splitting mx-out once more between bulk mail (root mail, crons, etc) and wiki mail with something to the effect of mx-out-wiki[12]001 and mx-out-bulk[12]001

Host MTAs: There is an exim instance on each host in the fleet configured to queue and relay messages out via mx[12]001. Several services make use of this (gerrit, phabricator, etc.). This will be straightforward to implement in postfix, but for awareness we will need to ensure that the postfix host MTA binds localhost:25 to avoid interrupting outbound application mail flow.

Lists: Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front this with postfix near-term, using much the same approach as described for inbound mail, possibly the same hosts even. Longer-term we could migrate the inbound/outbound lists routing configuration into postix, and optionally integrate lists into future production mx-(in|out) clusters described above.

Cloud smarthosts: There are a set of outbound MX smarthosts in the cloud environment. We should be able to repurpose much of the work done to set up a postfix based mx-out to deploy postfix based cloud smarthosts.

I think that in terms of effort vs reward, removing internet facing exim instances, and migrating host MTAs will be a good place to start. That would shrink our exim surface area significantly.

Also, we could think about settling into a mixed environment where we service something like 95% of our mail needs using postfix, and reserve exim only for more complex routing/lookup where it may be the better tool for the job.

Lists: Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front this with postfix near-term, using much the same approach as described for inbound mail, possibly the same hosts even. Longer-term we could migrate the inbound/outbound lists routing configuration into postix, and optionally integrate lists into future production mx-(in|out) clusters described above.

If I'm understanding this proposal correctly, it would also help toward T278495: Figure out plan for mailman IP situation since lists would no longer have its own internet-facing MTA.

Change 686633 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] mail: move default mail relay config out of standard module

https://gerrit.wikimedia.org/r/686633

Change 688333 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/puppet@production] exim: make exim class ensurable

https://gerrit.wikimedia.org/r/688333

Change 688391 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::mail: add mta hiera option profile::mail::mta

https://gerrit.wikimedia.org/r/688391

Change 686633 merged by Herron:

[operations/puppet@production] mail: move default mail relay config out of standard module

https://gerrit.wikimedia.org/r/686633

Lists: Lists/mailman has an internet facing exim instance, separate from the mx cluster. We could front this with postfix near-term, using much the same approach as described for inbound mail, possibly the same hosts even. Longer-term we could migrate the inbound/outbound lists routing configuration into postix, and optionally integrate lists into future production mx-(in|out) clusters described above.

Some historical context on the "same hosts" part: inbound MXes and lists were originally all in the same MX, alongside a few other functions (OTRS etc.). It was a very tightly coupled, very complicated and thus very brittle exim config, and also very hard to monitor or debug. With any change (e.g. exim config, Debian upgrade etc.) one needed to test against a number of different things, and also risking to bring everything down (and we've had very high stress situations with all corporate email being broken). Mailing list spam or backscatter would result into a full exim queue, delaying other emails. There were some complicated issues that I don't fully recall (i.e. erased from my memory ;)) with regards to DKIM signing, and alias expansions where emails went through multiple routers (alias to a mailing list with corp email members).

While some of them could be mitigated (e.g. separate exim queues per router), ultimately the complexity issue would not be addressed or could even become worse. I split the functions out into separate boxes and exim configs in 2014, and it was an immediate improvement, with very rare outages on either system ever since. There were some security wins as well, that we benefitted from soon after that. I ran out of time, but at the time I was also planning the split of inbound/outbound, as well as outbound smarthost vs. wiki-mail, for the same reasons.

That was 2014, and I'm not qualified to have an opinion on the technical design in 2021 ;) However, directionally I'd like to add two more non-technical concerns to take into account:

  • We should not create additional impediments to mail-related projects. For example, we should be thinking if a proposed solution would have prevented the same velocity or access to be granted for the Mailman 2->3 project.
  • As SRE and the org grows, ownership of different components may lie into different teams, with the served audiences being the differentiator, rather than the technology used. Tighter coupling could create interdependencies between teams or contribute to lack of ownership. (This may the case already today even at our existing size and existing functions of our MXes :)

While some of them could be mitigated (e.g. separate exim queues per router), ultimately the complexity issue would not be addressed or could even become worse. I split the functions out into separate boxes and exim configs in 2014, and it was an immediate improvement, with very rare outages on either system ever since. There were some security wins as well, that we benefitted from soon after that. I ran out of time, but at the time I was also planning the split of inbound/outbound, as well as outbound smarthost vs. wiki-mail, for the same reasons.

If the main MXes were just relays for lists mail, do you think your concerns would still apply? E.g. mx#### (internet facing, postfix) -> lists#### (not internet facing, exim) -> mailman3.

  • We should not create additional impediments to mail-related projects. For example, we should be thinking if a proposed solution would have prevented the same velocity or access to be granted for the Mailman 2->3 project.

Hard to get worse velocity than how overdue Mailman3 was :P but I agree, I'm generally more hesitant to make changes to the prod MXes (partly why https://gerrit.wikimedia.org/r/c/operations/puppet/+/681242 hasn't moved forward faster yet).