Page MenuHomePhabricator

Beta Cluster mailer not sending emails
Open, MediumPublic

Description

I performed a password reset early today on Beta Cluster. The account has an active and verified email address set. However the password reset email hasn't arrived after a long waiting time. We should check if the mailer is working as expected.

Event Timeline

@herron - not sure if related to T41785; also @aborrero due to being in wmflabs realm.

What outbound smtp server is currently being used by the beta cluster?

I think I set beta's wikimail stuff to go via deployment-mx02, will look into it

From deployment-mx02:/var/log/exim4/mainlog:
2018-12-19 19:45:50 H=deployment-mediawiki-07.deployment-prep.eqiad.wmflabs [172.16.4.119]:45122 I=[172.16.4.120]:25 F=<wiki-enwiki-1t-pk01kd-IfxAC4l3++aDeOz1@beta.wmflabs.org> rejected RCPT <krenair@gmail.com>: Relay not permitted

Let's fix root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+16-110) first

Well that's weird, dunno why puppet's autoupdater was stuck while rebasing seemed to work fine:

root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+16-110)# git pull --rebase origin production
From https://gerrit.wikimedia.org/r/p/operations/puppet
 * branch                  production -> FETCH_HEAD
First, rewinding head to replay your work on top of it...
Applying: [WIP] logstash: send errors to sentry
Applying: swift: lower replication interval for beta
Applying: prometheus: make ferm DNS record type configurable
Applying: Hack profile::base::firewall to prevent dupe definition
Applying: Add account for phabricator_files to swift::params::accounts
Applying: Scap: scap_source correct gid
Applying: swift: use implicit /dev/swift prefix for swift devices
Applying: Puppetise simple no-CA class for deployment-dumps-puppetmaster02
Applying: Attempt to secure Puppet DB better
Applying: [LOCAL HACK] tls certs for deployment-elastic*
Applying: Move declaration of diamond package out of diamond class
Applying: cumin: Allow Puppet DB backend to be used within Labs projects that use it
Applying: Beta: maintenance: no openldap management
Applying: Re-combine labs and production exim minimal config
Applying: varnish: move $all_networks to $trusted_networks
Applying: logstash: add new logging kafka consumer
root@deployment-puppetmaster03:/var/lib/git/operations/puppet(production u+16)#

(None of that seemed to have any effect on -mx02.)

So I think this broke during the eqiad1-r migration.
/etc/exim4/exim4.conf:
hostlist wikimedia_nets = <; 91.198.174.0/24 ; 208.80.152.0/22 ; 2620:0:860::/46 ; 198.35.26.0/23 ; 185.15.56.0/22 ; 2a02:ec80::/32 ; 2001:df2:e500::/48 ; 103.102.166.0/24 ; 10.0.0.0/8
hostlist relay_from_hosts = <; @[] ; 127.0.0.1 ; ::1 ; 91.198.174.0/24 ; 208.80.152.0/22 ; 2620:0:860::/46 ; 198.35.26.0/23 ; 185.15.56.0/22 ; 2a02:ec80::/32 ; 2001:df2:e500::/48 ; 103.102.166.0/24 ; 10.0.0.0/8
Contains 10/8 but not the new range.

hostlist wikimedia_nets = <; <%= scope.lookupvar('network::constants::all_networks').join(" ; ") %>
hostlist relay_from_hosts = <; @[] ; 127.0.0.1 ; ::1 ; <%= scope.lookupvar('network::constants::all_networks').join(" ; ") %>

$external_networks = $network_data['network::external']
$all_networks = flatten([$external_networks, '10.0.0.0/8'])

I think all_networks should add the new range if $realm == 'labs'.

MarcoAurelio renamed this task from Beta Cluster mailer not sending emails apparently to Beta Cluster mailer not sending emails.Dec 20 2018, 9:15 PM

At https://horizon.wikimedia.org/project/instances/41fe8dce-d0bb-424d-9c9a-9dec6dc68362/ (deployment-mx02) I've found:

AppliedNameParams.Actions
Truerole::mail::mxverp_bounce_post_url: 'api-rw.discovery.wmnet/w/api.php'; verp_domains: [ 'wikimedia.org' ]; prometheus_nodes: hiera('prometheus_nodes', []); verp_post_connect_server: 'meta.wikimedia.org'

Not sure if that'd be related with this but verp_post_connect_server: 'meta.wikimedia.org' doesn't look right?

I don't see what that has to do with it

Change 481215 had a related patch set uploaded (by Alex Monk; owner: Bstorm):
[operations/puppet@production] toolforge: add the new cloud region to all_networks

https://gerrit.wikimedia.org/r/481215

Just wanted to chime in and say that the Growth team (@Catrope) would benefit from this fix, since we're trying to test our Help Panel feature in beta cluster (which involves sending and confirming emails).

How about configuring the beta cluster to relay via the cloud/labs smarthosts mx-out0[12].wmflabs.org ?

Change 475714 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Introduce $aggregate_networks, deprecate $all_networks

https://gerrit.wikimedia.org/r/475714

Change 481215 abandoned by Bstorm:
network: Add the new cloud region to all_networks

Reason:
This one is a non-starter in the middle of other refactors

https://gerrit.wikimedia.org/r/481215

Change 475714 merged by Alexandros Kosiaris:
[operations/puppet@production] Introduce $aggregate_networks, deprecate $all_networks

https://gerrit.wikimedia.org/r/475714

Is the patch above solving this incident or do we need further changes so that Beta can send mail again? Thanks.

Re-checked -betalbs still not sending emails.

The purpose of Beta (testing software) is being affected.

Krenair lowered the priority of this task from High to Medium.Mar 1 2019, 10:09 AM

Email from beta works, it just mishandles @wikimedia.org addresses. Dropping priority

2019-03-01 10:08:37 1gzf5p-0004SQ-Bm <= wiki-enwiki-1t-pnomud-dWAVN5KAegIR6rAh@beta.wmflabs.org H=deployment-mediawiki-07.deployment-prep.eqiad.wmflabs [172.16.4.119]:39624 I=[172.16.4.120]:25 P=esmtp S=1575 id=enwiki.5c7904a500d375.37089512@en.wikipedia.beta.wmflabs.org
2019-03-01 10:08:37 1gzf5p-0004SQ-Bm => krenair@gmail.com R=dnslookup T=remote_smtp_signed S=2279 H=gmail-smtp-in.l.google.com [173.194.68.27] I=[172.16.4.120] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK  1551434917 y28si461207qvf.34 - gsmtp" DT=0s
2019-03-01 10:08:37 1gzf5p-0004SQ-Bm Completed

vs.:

2019-03-01 10:05:46 H=deployment-mediawiki-07.deployment-prep.eqiad.wmflabs [172.16.4.119]:35632 I=[172.16.4.120]:25 F=<wiki-enwiki-50n-pnnvxh-ujky0FUwA22N6N9J@beta.wmflabs.org> temporarily rejected RCPT <etonkovidova@wikimedia.org>: failed to bind the LDAP connection to server ldap-corp.codfw.wikimedia.org:389 - ldap_bind() returned -1

Basically our MX is trying to do the special routing for @wikimedia.org addresses (e.g., looking up against the mirror of the foundation's corp LDAP system to see if the user has a google inbox) that should only be done by a prod MX. Our one should just be sending it on to prod.

2019-03-01 10:05:46 H=deployment-mediawiki-07.deployment-prep.eqiad.wmflabs [172.16.4.119]:35632 I=[172.16.4.120]:25 F=<wiki-enwiki-50n-pnnvxh-ujky0FUwA22N6N9J@beta.wmflabs.org> temporarily rejected RCPT <etonkovidova@wikimedia.org>: failed to bind the LDAP connection to server ldap-corp.codfw.wikimedia.org:389 - ldap_bind() returned -1

Basically our MX is trying to do the special routing for @wikimedia.org addresses (e.g., looking up against the mirror of the foundation's corp LDAP system to see if the user has a google inbox) that should only be done by a prod MX. Our one should just be sending it on to prod.

Makes sense, and this describes in a nutshell the behavior of the labs smarthosts (just sending mail on to prod)

How about configuring the beta cluster to relay via the cloud/labs smarthosts mx-out0[12].wmflabs.org ?

Bumping this question

How about configuring the beta cluster to relay via the cloud/labs smarthosts mx-out0[12].wmflabs.org ?

Bumping this question

I feel like there was some reason there is a separate MX host in beta but don't remember it right now.

Email from beta works, it just mishandles @wikimedia.org addresses. Dropping priority

Thanks! It does work for not @wikimedia.org addresses - it's really great to have it working just in time to check GrowthExperiments on improving user emailability.