tools-mail: Migrate to Stretch
Closed, ResolvedPublicApr 30 2019

Description

As part of the Toolforge rebuild/refactor related to the Trusty deprecation, we must understand/ensure we can rebuild the mail server.

  • the mail server is tools-mail.tools.eqiad.wmflabs
  • look for documentation on how that works
  • check the server to see if docs matches reality
  • check what is in puppet and what is not, improve if required
  • try rebuilding the host in stretch

Some docs in wikitech:

Puppet code:

  • modules/role/manifests/toollabs/mailrelay.pp role::toollabs::mailrelay
  • modules/toollabs/manifests/mailrelay.pp toollabs::mailrelay

Details

Due Date
Apr 30 2019, 7:00 AM
aborrero created this task.Nov 2 2018, 11:35 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2018, 11:35 AM

Some docs in wikitech:

Puppet code:

  • modules/role/manifests/toollabs/mailrelay.pp role::toollabs::mailrelay
  • modules/toollabs/manifests/mailrelay.pp toollabs::mailrelay
aborrero updated the task description. (Show Details)Nov 2 2018, 11:54 AM

For the record, I was able to apply the current role::tolllabs::mailrelay class to a test server arturo-test.testlabs.eqiad.wmflabs with this horizon/puppet config:

roles/profiles:
true role::toollabs::mailrelay

hiera config:
active_proxy_host: proxy.example.com
gridengine::gridmaster: grid.example.com
standard::has_default_mail_relay: false
toollabs::external_hostname: mail.example.com
toollabs::is_mail_relay: true

Running puppet agent shows several issues:

  • missing jobutils package in stretch (expected?)
  • missing NFS (ensure-grid-is-on-NFS)
  • motd banner: File[/etc/update-motd.d/50-infrastructure-banner]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/toollabs/40-testlabs-infrastructure-banner.sh

(this last one is probably bc different CloudVPS project name)

So the work to migrate this puppet code to stretch should be relatively simple:

  • refactor code to the new naming scheme
  • use sonofgridengine instead of gridengine (this is a grid submit host, not sure why yet)
  • evaluate how/if we should use new cloudinfra MX servers for relay? no idea

Thanks to Chase, I discovered one can submit jobs to the grid via email: https://wikitech.wikimedia.org/wiki/Help:Toolforge#Processing_email_programatically, so that's why the mail server is a grid submit host.

Thanks to Chase, I discovered one can submit jobs to the grid via email: https://wikitech.wikimedia.org/wiki/Help:Toolforge#Processing_email_programatically, so that's why the mail server is a grid submit host.

I went looking to see how used this feature is:

$ ssh nfs-tools-project.svc.eqiad.wmnet
$ cd /srv/tools/shared/tools/project
$ grep jmail $(find . -maxdepth 2 -name '.forward*')
./tsreports-dev/.forward:|jmail /data/project/tsreports-dev/test.py
./scfc-test-can-be-deleted-anytime/.forward:|jmail tee -a /data/project/scfc-test-can-be-deleted-anytime/jmail.txt
./csbot/.forward.test:|jmail /data/project/csbot/mailtest
./csbot/.forward.test~:|jmail mailtest
grep: ./wikibugs/.forward.l: No such file or directory
./drtrigonbot/.forward.subster:|jmail cat >> ~/data/subster/mail_inbox
./bd808-test/.forward.jmail:|jmail tee -a /data/project/bd808-test/jmail.spool

That looks to me like an underused feature for the potential complexity of long term maintenance. Let's dig a bit further...

  • tsreports-dev: appends the message to a file. That file does not currently exist.
  • scfc-test-can-be-deleted-anytime: obvious name is obvious :)
  • csbot: appends a timestamp and the message to a file. That file contains 1 message timestamped Wed Apr 2 18:38:40 UTC 2014 that was an obvious test.
  • wikibugs: dangling symlink
  • drtrigonbot: appends message to a file. That file is 8.8M, contains 194 messages, and they mostly look like spam.
  • bd808-test: a test I made at some point to figure out if the docs on wikitech were really accurate.

Based on this, I think that we could kill off the grid submission feature with very little if any disruption to the Toolforge community.

tsreports-dev is most likely a test I did -- I can't think of a reason why tsreports would need to do something with incoming emails. wikibugs is from a previous era where wikibugs reported events based on incoming emails... but that was in the bugzilla era.

I have BOLDly removed the documentation of the email to grid submission feature from wikitech: https://wikitech.wikimedia.org/w/index.php?title=Help:Toolforge&diff=1807729&oldid=1805357

Cool, this mean we can probably just rebuild these hosts in stretch right away, since there is no longer a dependency with gridengine vs sonofgridengine.

I will start with the puppet refactoring.

Change 471730 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: refactor mail server

https://gerrit.wikimedia.org/r/471730

Change 471941 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[labs/toollabs@master] Remove jmail

https://gerrit.wikimedia.org/r/471941

Change 471942 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] tools - Remove unused jmail exim4 config

https://gerrit.wikimedia.org/r/471942

Change 471941 abandoned by GTirloni:
Remove jmail

Reason:
Favor T207968 instead

https://gerrit.wikimedia.org/r/471941

Change 471942 merged by GTirloni:
[operations/puppet@production] tools - Remove unused jmail exim4 config

https://gerrit.wikimedia.org/r/471942

This is what the incoming mail method using jmail looks like in the logs:

018-11-08 12:29:43 1gKjRP-0007RI-Ih <= user@example.com H=out5-smtp.messagingengine.com [66.111.4.29] P=esmtp S=2745 id=1541680177.789142.1569976984.168DA801@webmail.messagingengine.com
2018-11-08 12:29:45 1gKjRP-0007RI-Ih ** |jmail test <gtirloni-sandbox.test@tools.wmflabs.org> R=tool_forward_general T=gridqueue: Child process of gridqueue transport returned 1 from command: /usr/bin/jmail

I looked for gridqueue in the logs and couldn't find anything, which I think confirms this isn't being used.

Bstorm added a subscriber: Bstorm.Nov 8 2018, 3:06 PM

Do we know if email submissions are tagged with the queue "mailq" from any of the script or settings what you are stripping out?

Do we know if email submissions are tagged with the queue "mailq" from any of the script or settings what you are stripping out?

The jmail helper from our jobutils package does that part:

# Execute the given program synchronously on the grid.
rv = subprocess.call(['/usr/bin/qsub',
                      '-N', 'mail.' + pwd.getpwuid(os.getuid()).pw_name,
                      '-sync', 'y',
                      '-b', 'y',
                      '-m', 'n',
                      '-o', output.name,
                      '-j', 'y',
                      '-i', input.name,
                      '-q', 'mailq',
                      '-l', 'h_vmem=500M',
                      '-r', 'n', program] + sys.argv[2:],
                     stdout=subprocess.DEVNULL)

Change 471730 merged by GTirloni:
[operations/puppet@production] toolforge: refactor mail server

https://gerrit.wikimedia.org/r/471730

Created tools-mail-02.tools.eqiad.wmflabs and applied role::wmcs::toolforge::mailrelay successfully.

I tested it on tools-bastion-03 successfully. The only error I'm seeing on exim4 logs is:

2018-11-12 21:09:44 Warning: No server certificate defined; will use a selfsigned one.
 Suggested action: either install a certificate or change tls_advertise_hosts option

I would like to solve this before pointing all of Toolforge to mail-tools-02 since it's going to flood the logs.

TODO: Move Puppet configuration in Horizon from per-host to a prefix. This can't be done while tools-mail and tools-mail-02 use different Puppet settings (toollabs vs toolforge role/profile).

Change 473175 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] Revert "Revert "toolforge: refactor mail server""

https://gerrit.wikimedia.org/r/473175

Change 473175 merged by GTirloni:
[operations/puppet@production] Revert "Revert "toolforge: refactor mail server""

https://gerrit.wikimedia.org/r/473175

Change 473208 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolforge: Add strict rules against spam

https://gerrit.wikimedia.org/r/473208

Change 473208 merged by GTirloni:
[operations/puppet@production] toolforge: Add strict rules against spam

https://gerrit.wikimedia.org/r/473208

GTirloni renamed this task from Toolforge: understand/ensure we can rebuild the mail server to tools-mail: Migrate to Stretch.Nov 13 2018, 3:28 PM

tools-mail-02 has a new floating IP (208.80.155.158) and the old one was disassociated from tools-mail (208.80.155.162).

mail.tools.wmflabs.org is pointing to 208.80.155.158

There was an issue automatically creating the reverse DNS entry for that floating IP and it was added manually through Horizon (wmflabsdotorg project). Probably related to T209375.

Hiera:Tools was updated with the new smarthost (tools-mail-02.tools.eqiad.wmflabs). I considered changing that to mail.tools.wmflabs.org but didn't want to introduce a routing change at this time (10.68.23.71 vs 208.80.155.158). Both worked in my tests though.

tools-mail-02 now has TLS enabled as well (T209347).

Test evidence:

mail to user:

2018-11-13 14:16:42 1gMZUg-0004BX-C0 DKIM: d=example.com s=fm3 c=relaxed/relaxed a=rsa-sha256 b=2048 [verification succeeded]
2018-11-13 14:16:42 1gMZUg-0004BX-C0 DKIM: d=messagingengine.com s=fm1 c=relaxed/relaxed a=rsa-sha256 b=2048 [verification succeeded]
2018-11-13 14:16:42 1gMZUg-0004BX-C0 <= gtirloni@example.com H=wout2-smtp.messagingengine.com [64.147.123.25] P=esmtps X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=no S=2838 id=1542118596.3440791.1575341984.7853A1B2@webmail.messagingengine.com
2018-11-13 14:16:43 1gMZUg-0004BX-C0 => gtirloni@wikimedia.org <gtirloni@tools.wmflabs.org> R=dnslookup T=remote_smtp H=mx1001.wikimedia.org [208.80.154.76] X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=yes C="250 OK id=1gMZUg-0007Du-Su"
2018-11-13 14:16:43 1gMZUg-0004BX-C0 Completed


mail to tool.anything:

2018-11-13 14:18:12 1gMZW8-0004CX-TI DKIM: d=example.com s=fm3 c=relaxed/relaxed a=rsa-sha256 b=2048 [verification succeeded]
2018-11-13 14:18:12 1gMZW8-0004CX-TI DKIM: d=messagingengine.com s=fm1 c=relaxed/relaxed a=rsa-sha256 b=2048 [verification succeeded]
2018-11-13 14:18:12 1gMZW8-0004CX-TI <= gtirloni@example.com H=wout2-smtp.messagingengine.com [64.147.123.25] P=esmtps X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=no S=2888 id=1542118690.3441089.1575343840.2217613E@webmail.messagingengine.com
2018-11-13 14:18:13 1gMZW8-0004CX-TI => gtirloni@wikimedia.org <gtirloni-sandbox.anything@tools.wmflabs.org> R=dnslookup T=remote_smtp H=mx1001.wikimedia.org [208.80.154.76] X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=yes C="250 OK id=1gMZW9-0007RU-4c"
2018-11-13 14:18:13 1gMZW8-0004CX-TI Completed


mail to tool (without suffix):

2018-11-13 14:19:36 H=wout2-smtp.messagingengine.com [64.147.123.25] X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=no F=<gtirloni@example.com> rejected RCPT <gtirloni-sandbox@tools.wmflabs.org>: Unrouteable address


mail from tools-bastion-03:

2018-11-13 14:44:36 1gMZvg-0004do-C8 <= root@tools.wmflabs.org H=tools-bastion-03.tools.eqiad.wmflabs [10.68.23.58] P=esmtps X=TLS1.2:RSA_AES_256_CBC_SHA1:256 CV=no S=809 id=E1gMZvg-0002qp-9m@tools-bastion-03.tools.eqiad.wmflabs
2018-11-13 14:44:38 1gMZvg-0004do-C8 => gtirloni@example.com R=dnslookup T=remote_smtp H=in1-smtp.messagingengine.com [66.111.4.71] X=TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128 CV=yes C="250 2.0.0 Queued as 86AAB0D2792"
2018-11-13 14:44:38 1gMZvg-0004do-C8 Completed

So, shall we try shutting down the old tools-mail (trusty) and see what happens?

@aborrero I've just shut it down. Logs show it hasn't received any emails since 2018-11-13 18:48:46

Change 474305 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[labs/toollabs@master] Remove jmail helper script

https://gerrit.wikimedia.org/r/474305

Change 474311 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: purge jmail script

https://gerrit.wikimedia.org/r/474311

Change 474311 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: purge jmail script

https://gerrit.wikimedia.org/r/474311

Change 474305 merged by Arturo Borrero Gonzalez:
[labs/toollabs@master] Remove jmail helper script

https://gerrit.wikimedia.org/r/474305

Mentioned in SAL (#wikimedia-cloud) [2018-11-20T10:52:23Z] <arturo> T208579 distributing now misctools and jobutils 1.33 in all aptly repos

aborrero closed this task as Resolved.Wed, Nov 21, 5:19 PM