Page MenuHomePhabricator

Upgrade MXes to Bullseye
Open, In Progress, MediumPublic

Description

mx1001 and mx2001 are currently running Stretch and we should upgrade them to Bullseye. In principal we have two ways to do that:

  1. Install new mx1002 and mx2002 systems with Bullseye and switch over the MX records. This involves an IP change which may or may not impact mail sending (further research needed)
  2. Test the role's compatibility with Bullseye beforehand. Then temporarily filter access to each MX and reimage in place to the new OS (starting with mx2001 and then keeping that around for 1-2 weeks, then move on with mx1001)

Event Timeline

+1 for option 2, I think that will be a more straightforward approach overall.

In either case let's include a step to route and flush queued mail to the MX in the other DC before erasing/retiring each stretch host.

The exim version in Bullseye (4.94) had some breaking changes - see https://www.debian.org/releases/bullseye/amd64/release-notes/ch-information.en.html#idm1404. So I agree that option 2 is a better approach.

Ok, so let's proceed with option two. There's a test instance mx2002.wikimedia.org which I'll setup with the mx role and bullseye next week.

To prevent this test server from accidentally messing with our existing production mail infrastructure, I'd like to also filter port 25 for mx2002.wikimedia.org on the router level. @cmooney or @ayounsi, is that something you could set up?

Change 710943 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] discard traffic to mx2002 tcp/25

https://gerrit.wikimedia.org/r/710943

Change 710943 merged by jenkins-bot:

[operations/homer/public@master] discard traffic to mx2002 tcp/25

https://gerrit.wikimedia.org/r/710943

To prevent this test server from accidentally messing with our existing production mail infrastructure, I'd like to also filter port 25 for mx2002.wikimedia.org on the router level. @cmooney or @ayounsi, is that something you could set up?

Done, let us know when to rollback.

To prevent this test server from accidentally messing with our existing production mail infrastructure, I'd like to also filter port 25 for mx2002.wikimedia.org on the router level. @cmooney or @ayounsi, is that something you could set up?

Done, let us know when to rollback.

Thanks, will do.

Change 711123 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Apply MX role to mx2002

https://gerrit.wikimedia.org/r/711123

Change 711123 merged by Muehlenhoff:

[operations/puppet@production] Apply MX role to mx2002

https://gerrit.wikimedia.org/r/711123

Change 712277 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] acmechief: acmechief: allow mx2002

https://gerrit.wikimedia.org/r/712277

Change 712277 merged by Herron:

[operations/puppet@production] acmechief: acmechief: allow mx2002

https://gerrit.wikimedia.org/r/712277

Seeing errors like this in the paniclog unfortunately

mx2002:~# zcat /var/log/exim4/paniclog*.gz
2021-08-29 00:00:01 1mK8Ez-0095xJ-IE Tainted filename for search '/etc/exim4/aliases/wikimedia.org'
2021-08-22 00:00:01 1mHau9-005TiL-Hz Tainted filename for search '/etc/exim4/aliases/wikimedia.org'
2021-08-19 00:40:18 1mF3ZJ-001qIs-21 Tainted filename for search '/etc/exim4/aliases/wikimedia.org'

Which AIUI is due to our current use of $domain in the alias lookup, and will cause aliases to be rejected in the upgraded version.

aliases:
        driver = redirect
        domains = +local_domains
        require_files = CONFDIR/aliases/$domain
        data = ${lookup{$local_part}lsearch*{CONFDIR/aliases/$domain}}

From the docs at http://exim.org/exim-html-current/doc/html/spec_html/ch-string_expansions.html

$domain 
<snip>
If the origin of the data is an incoming message, the result of expanding this variable is tainted. When un untainted version is needed, one should be obtained from looking up the value in a local (therefore trusted) database. Often $domain_data is usable in this role.

And found some additional related information in https://github.com/Exim/exim/blob/master/src/README.UPDATING

Some Transports now refuse to use tainted data in constructing their delivery
location; this WILL BREAK configurations which are not updated accordingly.
In particular: any Transport use of $local_part which has been relying upon
check_local_user far away in the Router to make it safe, should be updated to
replace $local_part with $local_part_data.

Change 719975 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Disable new config validation on Bullseye

https://gerrit.wikimedia.org/r/719975

Change 719975 merged by Muehlenhoff:

[operations/puppet@production] Disable new config validation on Bullseye

https://gerrit.wikimedia.org/r/719975

Change 720277 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Temporarily filter port 25 on mx2001 for reimage

https://gerrit.wikimedia.org/r/720277

Mentioned in SAL (#wikimedia-operations) [2021-09-13T14:44:28Z] <herron> drained mx2001 mail queue to mx1001 T286911

Change 720277 merged by Muehlenhoff:

[operations/homer/public@master] Temporarily filter port 25 on mx2001 for reimage

https://gerrit.wikimedia.org/r/720277

Mentioned in SAL (#wikimedia-operations) [2021-09-13T15:54:45Z] <moritzm> filtered mx2001 on the routers for reimage T286911

Change 720783 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Revert \"Temporarily filter port 25 on mx2001 for reimage\"

https://gerrit.wikimedia.org/r/720783

mx2001 is now filtered on the routers, in case there are any issues, this can be reverted by merging https://gerrit.wikimedia.org/r/720783 and running 'homer "cr*" merge' on cumin2002.

Next, mx2001 will be reimaged to bullseye tomorrow after the mw switchover.

Mentioned in SAL (#wikimedia-operations) [2021-09-14T17:45:41Z] <moritzm> reimaging mx2001 to bullseye T286911

Change 720783 merged by Muehlenhoff:

[operations/homer/public@master] Revert \"Temporarily filter port 25 on mx2001 for reimage\"

https://gerrit.wikimedia.org/r/720783

Mentioned in SAL (#wikimedia-operations) [2021-09-14T18:48:57Z] <moritzm> removed filter for tcp/25 on mx2001, reimage is complete T286911

Change 721289 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Prefer mx2001 over mx1001 for internal smarthosts

https://gerrit.wikimedia.org/r/721289

Change 721289 merged by Muehlenhoff:

[operations/puppet@production] Prefer mx2001 over mx1001 for internal smarthosts

https://gerrit.wikimedia.org/r/721289

Change 721554 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Switch a few domains over to mx2001

https://gerrit.wikimedia.org/r/721554

Change 721555 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Switch remaining MX records to mx2001

https://gerrit.wikimedia.org/r/721555

Change 721554 merged by Muehlenhoff:

[operations/dns@master] Switch a few domains over to mx2001

https://gerrit.wikimedia.org/r/721554

Status update: mx2001 is reimaged to Bullseye and working fine so far. The smart hosts config on our servers has been switched to prefer mx2001 over mx1001 and the MX records of a handful of lesser used domains now point to mx2001.
If there's no further issues, the remaining DNS records will be updated on Monday and following that mx1001 will be reimaged some time mid next week.

Marostegui triaged this task as Medium priority.Mon, Sep 20, 5:08 AM

Change 721555 merged by Muehlenhoff:

[operations/dns@master] Switch remaining MX records to mx2001

https://gerrit.wikimedia.org/r/721555

Change 722551 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Temporarily filter port 25 on mx1001 for reimage

https://gerrit.wikimedia.org/r/722551

joanna_borun changed the task status from Open to In Progress.Tue, Sep 21, 1:33 PM

Change 722630 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::mail::mx: add +dkim_verbose to log_selector on bullseye

https://gerrit.wikimedia.org/r/722630

Change 722630 merged by Herron:

[operations/puppet@production] profile::mail::mx: add +dkim_verbose to log_selector on bullseye

https://gerrit.wikimedia.org/r/722630

Re: the above patch -- DKIM metrics dropped off on mx2001 beginning yesterday. Our Exim metrics are generated via mtail parsing of the Exim log, and Exim 4.90 introduced a new log_selector option for DKIM logging which is switched off by default. Since the DKIM log lines were disabled by default, the DKIM metrics dropped off as well.

After merging https://gerrit.wikimedia.org/r/722630 DKIM metrics are flowing again from the bullseye hosts, looking better!

Screen Shot 2021-09-21 at 3.48.09 PM.png (1×4 px, 447 KB)