Page MenuHomePhabricator

Upgrade MXes to Bullseye
Closed, ResolvedPublic

Description

mx1001 and mx2001 are currently running Stretch and we should upgrade them to Bullseye. In principal we have two ways to do that:

  1. Install new mx1002 and mx2002 systems with Bullseye and switch over the MX records. This involves an IP change which may or may not impact mail sending (further research needed)
  2. Test the role's compatibility with Bullseye beforehand. Then temporarily filter access to each MX and reimage in place to the new OS (starting with mx2001 and then keeping that around for 1-2 weeks, then move on with mx1001)

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -7
operations/puppetproduction+1 -2
operations/dnsmaster+14 -14
operations/puppetproduction+1 -1
operations/dnsmaster+5 -5
operations/puppetproduction+1 -1
operations/puppetproduction+3 -7
operations/puppetproduction+0 -20
operations/homer/publicmaster+13 -0
operations/puppetproduction+1 -1
operations/homer/publicmaster+13 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
operations/dnsmaster+28 -28
operations/dnsmaster+10 -10
operations/puppetproduction+1 -1
operations/homer/publicmaster+0 -13
operations/homer/publicmaster+13 -0
operations/puppetproduction+9 -0
operations/puppetproduction+2 -1
operations/puppetproduction+1 -1
operations/homer/publicmaster+26 -0
Show related patches Customize query in gerrit

Event Timeline

+1 for option 2, I think that will be a more straightforward approach overall.

In either case let's include a step to route and flush queued mail to the MX in the other DC before erasing/retiring each stretch host.

The exim version in Bullseye (4.94) had some breaking changes - see https://www.debian.org/releases/bullseye/amd64/release-notes/ch-information.en.html#idm1404. So I agree that option 2 is a better approach.

Ok, so let's proceed with option two. There's a test instance mx2002.wikimedia.org which I'll setup with the mx role and bullseye next week.

To prevent this test server from accidentally messing with our existing production mail infrastructure, I'd like to also filter port 25 for mx2002.wikimedia.org on the router level. @cmooney or @ayounsi, is that something you could set up?

Change 710943 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] discard traffic to mx2002 tcp/25

https://gerrit.wikimedia.org/r/710943

Change 710943 merged by jenkins-bot:

[operations/homer/public@master] discard traffic to mx2002 tcp/25

https://gerrit.wikimedia.org/r/710943

To prevent this test server from accidentally messing with our existing production mail infrastructure, I'd like to also filter port 25 for mx2002.wikimedia.org on the router level. @cmooney or @ayounsi, is that something you could set up?

Done, let us know when to rollback.

To prevent this test server from accidentally messing with our existing production mail infrastructure, I'd like to also filter port 25 for mx2002.wikimedia.org on the router level. @cmooney or @ayounsi, is that something you could set up?

Done, let us know when to rollback.

Thanks, will do.

Change 711123 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Apply MX role to mx2002

https://gerrit.wikimedia.org/r/711123

Change 711123 merged by Muehlenhoff:

[operations/puppet@production] Apply MX role to mx2002

https://gerrit.wikimedia.org/r/711123

Change 712277 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] acmechief: acmechief: allow mx2002

https://gerrit.wikimedia.org/r/712277

Change 712277 merged by Herron:

[operations/puppet@production] acmechief: acmechief: allow mx2002

https://gerrit.wikimedia.org/r/712277

Seeing errors like this in the paniclog unfortunately

mx2002:~# zcat /var/log/exim4/paniclog*.gz
2021-08-29 00:00:01 1mK8Ez-0095xJ-IE Tainted filename for search '/etc/exim4/aliases/wikimedia.org'
2021-08-22 00:00:01 1mHau9-005TiL-Hz Tainted filename for search '/etc/exim4/aliases/wikimedia.org'
2021-08-19 00:40:18 1mF3ZJ-001qIs-21 Tainted filename for search '/etc/exim4/aliases/wikimedia.org'

Which AIUI is due to our current use of $domain in the alias lookup, and will cause aliases to be rejected in the upgraded version.

aliases:
        driver = redirect
        domains = +local_domains
        require_files = CONFDIR/aliases/$domain
        data = ${lookup{$local_part}lsearch*{CONFDIR/aliases/$domain}}

From the docs at http://exim.org/exim-html-current/doc/html/spec_html/ch-string_expansions.html

$domain 
<snip>
If the origin of the data is an incoming message, the result of expanding this variable is tainted. When un untainted version is needed, one should be obtained from looking up the value in a local (therefore trusted) database. Often $domain_data is usable in this role.

And found some additional related information in https://github.com/Exim/exim/blob/master/src/README.UPDATING

Some Transports now refuse to use tainted data in constructing their delivery
location; this WILL BREAK configurations which are not updated accordingly.
In particular: any Transport use of $local_part which has been relying upon
check_local_user far away in the Router to make it safe, should be updated to
replace $local_part with $local_part_data.

Change 719975 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Disable new config validation on Bullseye

https://gerrit.wikimedia.org/r/719975

Change 719975 merged by Muehlenhoff:

[operations/puppet@production] Disable new config validation on Bullseye

https://gerrit.wikimedia.org/r/719975

Change 720277 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Temporarily filter port 25 on mx2001 for reimage

https://gerrit.wikimedia.org/r/720277

Mentioned in SAL (#wikimedia-operations) [2021-09-13T14:44:28Z] <herron> drained mx2001 mail queue to mx1001 T286911

Change 720277 merged by Muehlenhoff:

[operations/homer/public@master] Temporarily filter port 25 on mx2001 for reimage

https://gerrit.wikimedia.org/r/720277

Mentioned in SAL (#wikimedia-operations) [2021-09-13T15:54:45Z] <moritzm> filtered mx2001 on the routers for reimage T286911

Change 720783 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Revert \"Temporarily filter port 25 on mx2001 for reimage\"

https://gerrit.wikimedia.org/r/720783

mx2001 is now filtered on the routers, in case there are any issues, this can be reverted by merging https://gerrit.wikimedia.org/r/720783 and running 'homer "cr*" merge' on cumin2002.

Next, mx2001 will be reimaged to bullseye tomorrow after the mw switchover.

Mentioned in SAL (#wikimedia-operations) [2021-09-14T17:45:41Z] <moritzm> reimaging mx2001 to bullseye T286911

Change 720783 merged by Muehlenhoff:

[operations/homer/public@master] Revert \"Temporarily filter port 25 on mx2001 for reimage\"

https://gerrit.wikimedia.org/r/720783

Mentioned in SAL (#wikimedia-operations) [2021-09-14T18:48:57Z] <moritzm> removed filter for tcp/25 on mx2001, reimage is complete T286911

Change 721289 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Prefer mx2001 over mx1001 for internal smarthosts

https://gerrit.wikimedia.org/r/721289

Change 721289 merged by Muehlenhoff:

[operations/puppet@production] Prefer mx2001 over mx1001 for internal smarthosts

https://gerrit.wikimedia.org/r/721289

Change 721554 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Switch a few domains over to mx2001

https://gerrit.wikimedia.org/r/721554

Change 721555 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Switch remaining MX records to mx2001

https://gerrit.wikimedia.org/r/721555

Change 721554 merged by Muehlenhoff:

[operations/dns@master] Switch a few domains over to mx2001

https://gerrit.wikimedia.org/r/721554

Status update: mx2001 is reimaged to Bullseye and working fine so far. The smart hosts config on our servers has been switched to prefer mx2001 over mx1001 and the MX records of a handful of lesser used domains now point to mx2001.
If there's no further issues, the remaining DNS records will be updated on Monday and following that mx1001 will be reimaged some time mid next week.

Change 721555 merged by Muehlenhoff:

[operations/dns@master] Switch remaining MX records to mx2001

https://gerrit.wikimedia.org/r/721555

Change 722551 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Temporarily filter port 25 on mx1001 for reimage

https://gerrit.wikimedia.org/r/722551

joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 1:33 PM

Change 722630 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] profile::mail::mx: add +dkim_verbose to log_selector on bullseye

https://gerrit.wikimedia.org/r/722630

Change 722630 merged by Herron:

[operations/puppet@production] profile::mail::mx: add +dkim_verbose to log_selector on bullseye

https://gerrit.wikimedia.org/r/722630

Re: the above patch -- DKIM metrics dropped off on mx2001 beginning yesterday. Our Exim metrics are generated via mtail parsing of the Exim log, and Exim 4.90 introduced a new log_selector option for DKIM logging which is switched off by default. Since the DKIM log lines were disabled by default, the DKIM metrics dropped off as well.

After merging https://gerrit.wikimedia.org/r/722630 DKIM metrics are flowing again from the bullseye hosts, looking better!

Screen Shot 2021-09-21 at 3.48.09 PM.png (1×4 px, 447 KB)

Change 722816 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] dhcp: Switch mx1001 to bullseye

https://gerrit.wikimedia.org/r/722816

Change 722816 merged by Muehlenhoff:

[operations/puppet@production] dhcp: Switch mx1001 to bullseye

https://gerrit.wikimedia.org/r/722816

Change 722551 merged by Muehlenhoff:

[operations/homer/public@master] Temporarily filter port 25 on mx1001 for reimage

https://gerrit.wikimedia.org/r/722551

Mentioned in SAL (#wikimedia-operations) [2021-09-22T13:26:48Z] <moritzm> mx1001 filterered on the routers for forthcoming reimage to bullseye T286911

Mentioned in SAL (#wikimedia-operations) [2021-09-22T13:39:22Z] <herron> flushed mx1001 mail queue to mx2001 T286911

Change 722870 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Prefer codfw wiki smarthost over eqiad one for mx1001 reimage

https://gerrit.wikimedia.org/r/722870

Change 722870 merged by Muehlenhoff:

[operations/puppet@production] Prefer codfw wiki smarthost over eqiad one for mx1001 reimage

https://gerrit.wikimedia.org/r/722870

Mentioned in SAL (#wikimedia-operations) [2021-09-22T15:02:44Z] <moritzm> re-installing mx1001 with bullseye T286911

Mentioned in SAL (#wikimedia-operations) [2021-09-22T15:52:09Z] <moritzm> removed filters on mx1001 filterered on the routers due to an issue with the mx1001 reinstall T286911

Change 723072 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Temporarily filter port 25 on mx1001 for reimage

https://gerrit.wikimedia.org/r/723072

Change 723072 merged by Muehlenhoff:

[operations/homer/public@master] Temporarily filter port 25 on mx1001 for reimage

https://gerrit.wikimedia.org/r/723072

Mentioned in SAL (#wikimedia-operations) [2021-09-23T10:53:03Z] <moritzm> mx1001 filterered on the routers for forthcoming reimage to bullseye T286911

Mentioned in SAL (#wikimedia-operations) [2021-09-23T13:27:40Z] <moritzm> reimaging mx1001 to bullseye T286911

Mentioned in SAL (#wikimedia-operations) [2021-09-23T14:19:45Z] <moritzm> removed routers filter for mx1001, reimage to bullseye complete T286911

Both mx1001 and mx2001 are now running Bullseye. There's a little cleanup/followup work, but the core of the work is completed.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: mx1002.wikimedia.org

  • mx1002.wikimedia.org (WARN)
    • Host not found on Icinga, unable to downtme it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: mx2002.wikimedia.org

  • mx2002.wikimedia.org (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

Change 723421 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove mx1002/mx2002

https://gerrit.wikimedia.org/r/723421

Change 723422 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] acmechief: Remove mx2002

https://gerrit.wikimedia.org/r/723422

Change 723421 merged by Muehlenhoff:

[operations/puppet@production] Remove mx1002/mx2002

https://gerrit.wikimedia.org/r/723421

Change 723433 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Revert \"Prefer codfw wiki smarthost over eqiad one for mx1001 reimage\"

https://gerrit.wikimedia.org/r/723433

Change 723434 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Revert \"Prefer mx2001 over mx1001 for internal smarthosts\"

https://gerrit.wikimedia.org/r/723434

Change 723473 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Configure a few domains with equal weights for mx1001/mx2001

https://gerrit.wikimedia.org/r/723473

Change 723482 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Configure remaining domains with equal weights for mx1001/mx2001

https://gerrit.wikimedia.org/r/723482

Change 723487 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] profile::mail::mx: Remove OS checks

https://gerrit.wikimedia.org/r/723487

The two VMs (mx1002/mx2002) which were used to test the Bullseye setup have been taken down.

Change 723487 merged by Muehlenhoff:

[operations/puppet@production] profile::mail::mx: Remove OS checks

https://gerrit.wikimedia.org/r/723487

Change 723434 merged by Muehlenhoff:

[operations/puppet@production] Revert \"Prefer mx2001 over mx1001 for internal smarthosts\"

https://gerrit.wikimedia.org/r/723434

Change 723473 merged by Muehlenhoff:

[operations/dns@master] Configure a few domains with equal weights for mx1001/mx2001

https://gerrit.wikimedia.org/r/723473

Change 723433 merged by Muehlenhoff:

[operations/puppet@production] Revert \"Prefer codfw wiki smarthost over eqiad one for mx1001 reimage\"

https://gerrit.wikimedia.org/r/723433

Change 723482 merged by Muehlenhoff:

[operations/dns@master] Configure remaining domains with equal weights for mx1001/mx2001

https://gerrit.wikimedia.org/r/723482

Mentioned in SAL (#wikimedia-operations) [2021-09-30T12:17:20Z] <moritzm> adapted MX records to point to both mx1001.wikimedia.org and mx2001.wikimedia.org with equal weights T286911

MoritzMuehlenhoff claimed this task.

mx1001/mx2001 have been reimaged to Bullseye (reusing the VM/IP for potential IP reputation issues).

Along with the update the setup has also been adapted to better spread the load across both servers:

  • For outbound mail (wiki-mail records and smarthosts of our servers) servers in eqiad/esams use mx1001 and codfw/eqsin/ulsfo use mx2001
  • For inbound mail the weight of the MX records are now equal, which should make mail servers do round-robin lookups for both.

Change 723422 merged by Muehlenhoff:

[operations/puppet@production] acmechief: Remove mx2002

https://gerrit.wikimedia.org/r/723422

Change 801799 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] mx: enable tainted data checking

https://gerrit.wikimedia.org/r/801799

Change 801799 merged by JHathaway:

[operations/puppet@production] mx: enable tainted data checking

https://gerrit.wikimedia.org/r/801799