Page MenuHomePhabricator

Upgrade mx1001/mx2001 to stretch
Closed, ResolvedPublic

Description

We need to do this at some point anyway, but in this case it may make a difference w.r.t. antispam and crypto, e.g. ciphers supported by GnuTLS, the OCSP stapling bug 90dbb023366cc761073f1b15edb37ccc33fd49f9.

Event Timeline

Is the established upgrade process a rebuild or dist-upgrade?

In either case I'm thinking we should pull each server from the MX record in DNS (one at a time of course) while we upgrade and validate to avoid unexpected issues with production mail handling.

Is the established upgrade process a rebuild or dist-upgrade?

A dist-upgrade is generally fine, but a reimage has some benefits: There's e.g. a few settings which only get enabled on new installations, but are kept as-is during upgrades, so a reimage is usually cleaner and ensures to pick up all the changes. It's best decided on a case-by-case basis.

Ok, reimage sounds good to me. It would also be a good opportunity for some hands on with server builds.

Change 378936 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] Remove mx2001 MX record from dns for OS upgrade

https://gerrit.wikimedia.org/r/378936

Change 378936 merged by Herron:
[operations/dns@master] Remove mx2001 MX record from dns for OS upgrade

https://gerrit.wikimedia.org/r/378936

Looking more closely at how to pull mx2001 out of service for an OS reload it is more complicated than I originally thought. We have ~100 dns zones referencing the fqdn of mx2001 in MX and SPF records making it difficult to confidently pull a mx from service via dns. It seems we have a few different ways to proceed with upgrading:

  1. Provision a mx2001 replacement, say mx2002, test it and then cut the public IPs of mx2001 over to mx2002. Potentially rename it back to mx2001 as well.
  2. At the network layer temporarily reject inbound mail traffic to mx2001 for the duration of OS reload and testing. Reload in place.
  3. Defer OS upgrades to mail infrastructure refresh T175362 and roll in enhancements like:
    1. New stretch dedicated inbound mail exchangers
    2. New stretch dedicated outbound transactional mail exchangers
    3. Introduction of LVS mail frontend to centrally (de)pool servers and disconnect hostnames from service names

IMHO with any approach it's good to retain our existing mx IPs for outbound mail service to retain IP reputation.

The desired process would apply to mx1001 as well later on. Considering mx2001 here as the first node for rolling maintenance.

@Volans and I connected about this yesterday. @faidon @akosiaris what do you think?

  1. Provision a mx2001 replacement, say mx2002, test it and then cut the public IPs of mx2001 over to mx2002. Potentially rename it back to mx2001 as well.

Given that mx* are Ganeti virtual machines and not physical hardware, that sounds like a good idea. That also allows a simple switchback in case of any problems with stretch.

Looking more closely at how to pull mx2001 out of service for an OS reload it is more complicated than I originally thought. We have ~100 dns zones referencing the fqdn of mx2001 in MX and SPF records making it difficult to confidently pull a mx from service via dns. It seems we have a few different ways to proceed with upgrading:

  1. Provision a mx2001 replacement, say mx2002, test it and then cut the public IPs of mx2001 over to mx2002. Potentially rename it back to mx2001 as well.

That. Add a new mx2002 host. No need to rename it back to mx2001, we don't do that anyway for other components of the fleet. We can renumber it though to keep the IP if we want.

  1. At the network layer temporarily reject inbound mail traffic to mx2001 for the duration of OS reload and testing. Reload in place.
  2. Defer OS upgrades to mail infrastructure refresh T175362 and roll in enhancements like:
    1. New stretch dedicated inbound mail exchangers
    2. New stretch dedicated outbound transactional mail exchangers
    3. Introduction of LVS mail frontend to centrally (de)pool servers and disconnect hostnames from service names

I don't see any reason to couple these with the upgrade. I 'd rather we did not, that allows us to pursue the time schedule we want for any of the above separetely

IMHO with any approach it's good to retain our existing mx IPs for outbound mail service to retain IP reputation.

Do we indeed ? Do we know if we have Good reputation ? In my (granted old) experience, reputation is usually either Neutral or Bad. Neutral is easy to obtain, you are just a new previously unused IP (we got those) sending out email. Bad I don't really need to say what it is, but what is Good and how many databases out there report that host as Good ?

Do we indeed ? Do we know if we have Good reputation ? In my (granted old) experience, reputation is usually either Neutral or Bad. Neutral is easy to obtain, you are just a new previously unused IP (we got those) sending out email. Bad I don't really need to say what it is, but what is Good and how many databases out there report that host as Good ?

I'd expect to see ISPs throttling our mail more aggressively if we switched sender IPs without a warm up period. And even little things like re-training remote greylists can have noticeable effects to users (delays). It's nothing insurmountable, but I would prefer to have a strong reason to switch the MX IPs before doing it. And keeping the same hostname/IP simplifies our cutover process.

After more thought I'm leaning towards option 2 (reload in place). Building out a new VM and later assuming the IP of and/or even renaming to mx2001 after the fact is going to be hairy because of puppet, reverse dns, let's encrypt certs, etc. @ayounsi would it be difficult to temporarily reject packets to for example mx2001.wikimedia.org:25/tcp with a network firewall (as a poor man's depool) while the system is rebuilt?

In any event, I'll provision a stretch MX in openstack and see how that works out.

@ayounsi would it be difficult to temporarily reject packets to for example mx2001.wikimedia.org:25/tcp with a network firewall (as a poor man's depool) while the system is rebuilt?

Not difficult.

role::mail::mx applies on Stretch for the most part, but not without some issues.

First off, AFAICT role::mail:mx cannot be applied using horizon because node default contains require ::role::labs::instance when not in realm production, and causes Duplicate declaration: Class[Exim4] is already declared in file /etc/puppet/modules/standard/manifests/mail/sender.pp:2; cannot redeclare at /etc/puppet/modules/role/manifests/mail/mx.pp:59. Worked around this by modifying site.pp on a self hosted master.

Then, there are a few issues relating to missing connectivity from this labs test instance to prod dependencies like ldap, mysql, etc. Also Let's Encrypt fails to obtain a certificate because in this case a certificate requested has a nonstandard TLD (wmflabs). But these are to be expected and shouldn't occur when run from production.

mtail, however, fails to start with the current config under Stretch printing the error:

Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: F0417 16:45:48.512115    3676 main.go:68] couldn't start: Compile encountered errors:
Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: compile failed for exim.mtail:
Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: exim.mtail:35:5-18: Type mismatch between lhs (%!s(vm.Type=3)) and rhs (%!s(vm.Type=2)) for op %!s(int=57379)
Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: exim.mtail:43:5-18: Type mismatch between lhs (%!s(vm.Type=3)) and rhs (%!s(vm.Type=2)) for op %!s(int=57379)
Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: exim.mtail:51:5-18: Type mismatch between lhs (%!s(vm.Type=3)) and rhs (%!s(vm.Type=2)) for op %!s(int=57379)
Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: exim.mtail:55:5-18: Type mismatch between lhs (%!s(vm.Type=3)) and rhs (%!s(vm.Type=2)) for op %!s(int=57379)
Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: exim.mtail:59:5-18: Type mismatch between lhs (%!s(vm.Type=3)) and rhs (%!s(vm.Type=2)) for op %!s(int=57379)
Apr 17 16:45:48 mx-keith-stretch1 mtail[3676]: exim.mtail:67:5-44: Type mismatch between lhs (%!s(vm.Type=3)) and rhs (%!s(vm.Type=2)) for op %!s(int=57379)

I see the mtail versions between Jessie and Stretch are different with 3.0.0~rc5-1 available on Jessie (3.0.0~rc4-1~bpo8+1 currently installed on prod mxes) and 0.0+git20161231.ae129e9-1+b2 currently on Stretch.

mtail 3.0.0~rc5-1~bpo9+1 is available in stretch-backports, so I'll give that a try to see if it solves the issue.

Indeed, after upgrading to 3.0.0~rc5-1~bpo9+1 mtail starts up happily.

@fgiunchedi do you think it would be safe to pin the mtail package to stretch-backports for all stretch hosts, or should a special case be made for mx?

Indeed, after upgrading to 3.0.0~rc5-1~bpo9+1 mtail starts up happily.

@fgiunchedi do you think it would be safe to pin the mtail package to stretch-backports for all stretch hosts, or should a special case be made for mx?

I think we're ok to pin mtail to backports, afaik it is all backwards compatibile and only bugfixes are introduced.

Change 427681 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] mtail: pin package to stretch-backports on stretch hosts

https://gerrit.wikimedia.org/r/427681

Change 427681 merged by Herron:
[operations/puppet@production] mtail: pin package to stretch-backports on stretch hosts

https://gerrit.wikimedia.org/r/427681

Change 427710 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: reinstall mx2001 with stretch

https://gerrit.wikimedia.org/r/427710

Ok, test mx instance is looking good. Will plan to depool and reinstall mx2001 with Stretch next week.

@ayounsi could we coordinate a time to reject connections to mx2001.wikimedia.org on 25/tcp (ipv4 and ipv6) early next week? No strong preference on exactly when, whatever would work best for you.

@herron Monday 10:30am PDT? (5:30pm UTC)
How long will the block be installed for?

[edit firewall family inet filter border-in4]
       term nrpe { ... }
+      /* T175361 */
+      term tmp-mx2001-block {
+          from {
+              destination-address {
+                  208.80.153.45/32;
+              }
+              protocol tcp;
+              destination-port 25;
+          }
+          then {
+              reject;
+          }
+      }
       term default { ... }


[edit firewall family inet6 filter border-in6]
       term nrpe { ... }
+      /* T175361 */
+      term tmp-mx2001-block {
+          from {
+              destination-address {
+                  2620:0:860:2:208:80:153:45/128;
+              }
+              next-header tcp;
+              destination-port 25;
+          }
+          then {
+              reject;
+          }
+      }
       term default { ... }

@herron Monday 10:30am PDT? (5:30pm UTC)
How long will the block be installed for?

Sounds good! Barring any unexpected issues ~24h

Mentioned in SAL (#wikimedia-operations) [2018-04-23T17:24:57Z] <XioNoX> pushing firewall block on cr1/2-codfw - T175361

Mentioned in SAL (#wikimedia-operations) [2018-04-23T17:35:44Z] <XioNoX> pushing firewall block on cr1-eqdfw - T175361

Mail log activity has stopped on mx2001 and I'm unable to connect to mx2001:25 from a VPS system outside wikimedia (with confirmed working outbound tcp/25 connectivity). Looks good!

# nc -w 5 -vz mx2001.wikimedia.org 25
mx2001.wikimedia.org [208.80.153.45] 25 (smtp) : Connection timed out

The messages that were in the deferred queue on mx2001 have been manually relayed over to mx1001. The queue is now empty on mx2001 and I'll get started on the rebuild to Stretch shortly.

Change 427710 merged by Herron:
[operations/puppet@production] install_server: reinstall mx2001 with stretch

https://gerrit.wikimedia.org/r/427710

mx2001 has been reinstalled with Stretch, services are configured and test mail messages flow successfully. So far so good.

Will plan to re-pool in the morning (PDT) tomorrow and monitor closely during the day.

Mentioned in SAL (#wikimedia-operations) [2018-04-24T17:35:08Z] <XioNoX> removing firewall block on cr1-eqdfw - T175361

Mentioned in SAL (#wikimedia-operations) [2018-04-24T17:35:59Z] <XioNoX> removing firewall block on cr1/2-codfw - T175361

mx2001 has been repooled (thanks @ayounsi!) Will monitor closely for the rest of the day

Change 429241 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] install_server: reinstall mx1001 with stretch

https://gerrit.wikimedia.org/r/429241

Change 429344 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] WIP: icinga-sms: use localhost as smtp server

https://gerrit.wikimedia.org/r/429344

mx2001 has been running Stretch for a few days and has been stable. I think we're in good shape to move on to mx1001. However, there are a few configs with mx1001 hardcoded as the smtp server in the puppet repo. I'll work on removing those to simplify the depool process before rebuilding mx1001.

Change 429456 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] standard::mail::sender: run a smtp daemon on localhost:25

https://gerrit.wikimedia.org/r/429456

Change 429457 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] profile::kafka::burrow: use localhost as smtp server

https://gerrit.wikimedia.org/r/429457

Would love some feedback on the patches above. In particular, are there any reservations about using standard::mail::sender to configure an exim listener on localhost:25?

mx2001 has been running Stretch for a few days and has been stable. I think we're in good shape to move on to mx1001. However, there are a few configs with mx1001 hardcoded as the smtp server in the puppet repo. I'll work on removing those to simplify the depool process before rebuilding mx1001.

Has been there any progress on this?

A +1 (or other feedback) on https://gerrit.wikimedia.org/r/#/c/429456/ would be a huge help to keep this moving. I'm hesitant to self merge that fleet-wide change.

Crickets! Ok, I'll plan on merging the localhost smtp listener part tomorrow and get to work depooling mx1001 for reimage next week.

Mentioned in SAL (#wikimedia-operations) [2018-05-31T15:46:40Z] <herron> enabling localhost:25 exim smtp listeners in production realm T175361

Change 429456 merged by Herron:
[operations/puppet@production] standard::mail::sender: run a smtp daemon on localhost:25

https://gerrit.wikimedia.org/r/429456

Change 429344 merged by Herron:
[operations/puppet@production] icinga-sms: use localhost as smtp server

https://gerrit.wikimedia.org/r/429344

Change 436626 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] exim minimal: allow from local host interface addresses in rcpt acl

https://gerrit.wikimedia.org/r/436626

Change 436626 merged by Herron:
[operations/puppet@production] exim minimal: allow from local host interface addresses in rcpt acl

https://gerrit.wikimedia.org/r/436626

Change 429457 merged by Herron:
[operations/puppet@production] profile::kafka::burrow: use localhost as smtp server

https://gerrit.wikimedia.org/r/429457

Ok, we should be in good shape to depool mx1001, relay any deferred messages to mx2001, and rebuild mx1001 with stretch.

@ayounsi is there a morning (PDT) that works for you this week or next to depool mx1001 by firewall reject?

I'm currently in Europe, so if you're on the east coast, ping me anytime (east coast) this week and I can do it.

Planning to proceed with the firewall update and reinstall to Stretch starting at 10a Eastern tomorrow (coordinated over IRC)

In preparation for that, Exim on mx1001 has been stopped to allow for quick revert if needed. Mail is currently flowing through mx2001 and looking good so far.

So this backfired, but thankfully the fix was as simple as starting exim :) Good thinking @herron!

We've heard of and noticed at least two breakages:

  1. Phabricator. It seems to handle its own email via SMTP, using a library called PHPMailer. It's configured for both mailservers, so that needs further investigation.
  2. Gerrit -- seems configured with smtpServer = <%= @mail_smarthost[0] %> :( so that's explained right there.

So the above two need to be fixed (probably separately), plus… a git grep mail_smarthost\[ reveals a few more of these unfortunately, so more fixes like those needed across the board.

Related change to your standard::mail::sender changes above: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439774/

Also Let's Encrypt fails to obtain a certificate because in this case a certificate requested has a nonstandard TLD (wmflabs).

We partially took care of that in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/435814/
Also https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439451/ is open for review now because I missed a rather important bit.

The issues that were blocking this have been resolved (and added here as subtasks for reference)

Moving forward once again with the mx1001 stretch upgrade

To be more specific -- First depooling mx1001 by stopping Exim, then manually relaying any queued/deferred messages on mx1001 to mx2001, then reinstalling with Stretch.

Mentioned in SAL (#wikimedia-operations) [2018-09-13T13:59:23Z] <herron> depool mx1001 and relay queued messages to mx2001 for upgrade to stretch T175361

Change 429241 merged by Herron:
[operations/puppet@production] install_server: reinstall mx1001 with stretch

https://gerrit.wikimedia.org/r/429241

Mentioned in SAL (#wikimedia-operations) [2018-09-13T15:15:30Z] <herron> repool mx1001 — upgrade to stretch complete T175361

mx1001 has been stable for 24 hours.

In Grafana deferrals on mx1001 do appear to be trending upwards (https://grafana.wikimedia.org/dashboard/db/mail?refresh=5m&orgId=1&from=1536855645814&to=1536942045814&var-datasource=eqiad%20prometheus%2Fops&panelId=27&fullscreen) but afaict this is an unrelated case where an external domain with a user receiving mail from gerrit happens to be down by coincidence (double checked and this mail system is unreachable from a non wmf system as well)

I think we're in good shape to resolve!