Page MenuHomePhabricator

Toolforge outbound root email in eqiad1
Closed, ResolvedPublic

Description

Before any migration to the eqiad1 deployment, the outbout email path for of Toolforge seems to be (for @wikimedia.org recipients):

  1. a toolforge server (10.0.0.x) sends the email, using relay tools-mail-02.eqiad.wmflabs
  2. the tools relay server (10.0.0.x) recvs the email and forwards it to mx1001.wikimedia.org, i.e, prod relays
  3. the prod (public addr) relay handles the email correctly (follow up steps are not of interest right now)

The problem, after we move Toolforge to eqiad1 is the addressing change (from 10.0.x.x to 172.16.x.x):

  1. a toolforge server (172.16.x.x) sends the email, using relay tools-mail-02.eqiad,wmflabs
  2. the tools relay server (172.16.x.x) recvs the email and tries to forward it to mx1001.wikimedia.org, i.e, prod relays
  3. the prod relays doesn't allow this new addressing.

note: this description may be only valid for root email. I'm not sure what the policies are for non-root outbound emails.
note: non-wikimedia recipients will get emails delivered directly by tools-mail-02 without going through mx*.wikimedia.org

There are 2 main approaches to handle this situation:

  • allow 172.16.x.x in prod relays
  • use the intermediate cloud smarthosts mx-out0[12].wmflabs.org, to do something like:

toolforge server -> tools-mail-02 -> mx-out01.wmflabs.org -> mx1001.wikimedia.org

Any solution taken may solve T212327: Beta Cluster mailer not sending emails as well.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2019, 12:29 PM
aborrero triaged this task as High priority.Jan 10 2019, 12:30 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero added subscribers: bd808, Bstorm, GTirloni and 4 others.
GTirloni updated the task description. (Show Details)Jan 10 2019, 12:35 PM

That does not sound correct. There is no way a 10.0.0.x IP is allowed to reach out from labs to production due to network ACLs. If anything tools-mail-02.eqiad.wmflabs is using some NAT outbound IP (208.80.155.128/25 IP space)

And hence "allowing" 172.16.x.x in prod relays would not work either.

That does not sound correct. There is no way a 10.0.0.x IP is allowed to reach out from labs to production due to network ACLs. If anything tools-mail-02.eqiad.wmflabs is using some NAT outbound IP (208.80.155.128/25 IP space)
And hence "allowing" 172.16.x.x in prod relays would not work either.

No, I stand corrected, it's the edge case of WMCS VMs reaching out to production public IP hosts do not get NATed

GTirloni added a comment.EditedJan 10 2019, 1:02 PM

It seems we may not have to whitelist a future Toolforge MX sitting in the 172.16.0.0/21 network if we comply with email best practices (as to not trigger SpamAssassin rage).

Here is a sample email that got classified as spam by SpamAssassin when sent from tools-sgegrid-master (172.16.4.197):

Delivered-To: gtirloni@wikimedia.org
Received: by 2002:a05:6000:104b:0:0:0:0 with SMTP id c11csp1766158wrx;
        Thu, 10 Jan 2019 04:50:30 -0800 (PST)
X-Google-Smtp-Source: ALg8bN75cA+NAbB/kSeu3DOWh7lXCv03EyWCK9hdtHUCgXhf16yRs961u5hamcTn9rG90neOWmMC
X-Received: by 2002:ac8:320a:: with SMTP id x10mr9412559qta.275.1547124630643;
        Thu, 10 Jan 2019 04:50:30 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1547124630; cv=none;
        d=google.com; s=arc-20160816;
        b=E6f67ujvpU4KzZhmvVjG9ZI978KZBQVG2KcfTmGwf9/DSCPJ2IJ2puBdQ/IpYIUfLs
         UEhYnTTBW0J3l055NKPPAbTbrdvhWusMg58DUpX4YOPi9IU1XDxNAS8CwWSTjbgO/S6C
         cEVEO9PQAaJcZxjzK+kqsla9FqIP7QShGfBVx8A7xfr0DLN/EHZuEvP07AyBwoRg3f6c
         V9PEWTBYTjkVlxq67xErGxa8zytDkmP0IwtHNg+U8yD9+CzRb2ZNWVG1R+AmimFay47/
         HLL6b8SOrhFNL7Gx3MWcdtpdI3p79Gwe/ahROMfJdBn8xQvfi0AqZF3S117WI3HK++oN
         wDKw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=message-id:date;
        bh=c8KDIzV2wXOYSGDL7dT12WRpBrhR19CM+S1oHm5jRoQ=;
        b=f1KtYhgEye05D4EijFiwmandHxNDgEaS4Aeby7KuYRiSBtDFPRIOYeDWZmEPFw6+8M
         NvMD5Vgnid1mMM9+uoQp27iP24DsO8kH6VAMxdfWB31OXjDEKB9yEg+zHDEzE57YfRTE
         uJIeNuXlsEKWlUvxRSth1NbJP0196TFQ+dZwp/xT2x4AitshXRzQm+AJUEFcbxr/iT5C
         kFkkuuLB/10IClCplJvc/gHoyGq4VvO/pFvJ14s2bgqL4L6a0AtoTb0jKehqT/o/3dr0
         IJyx/zFn3UBozb2hpBGCJz+bvygoDMDlrxJyyPoRo1i/f8UjIGCOrnHsVZBlOtyU3cO5
         5cKw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address) smtp.mailfrom=gtirloni@tools.wmflabs.org
Return-Path: <gtirloni@tools.wmflabs.org>
Received: from mx1001.wikimedia.org (mx1001.wikimedia.org. [208.80.154.76])
        by mx.google.com with ESMTPS id d16si2406878qvn.7.2019.01.10.04.50.30
        for <gtirloni@wikimedia.org>
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Thu, 10 Jan 2019 04:50:30 -0800 (PST)
Received-SPF: pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address)
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address) smtp.mailfrom=gtirloni@tools.wmflabs.org
Date: Thu, 10 Jan 2019 04:50:30 -0800 (PST)
Message-Id: <5c373f96.1c69fb81.fa773.54a1SMTPIN_ADDED_MISSING@mx.google.com>
Received: from [172.16.4.197] (port=34300 helo=tools-sgegrid-master)
	by mx1001.wikimedia.org with esmtp (Exim 4.89)
	(envelope-from <gtirloni@tools.wmflabs.org>)
	id 1ghZmx-0006ia-Oy
	for gtirloni@wikimedia.org; Thu, 10 Jan 2019 12:50:30 +0000
X-Spam-Score: 7.3 (+++++++)
X-Spam-Report: Spam detection software, running on the system "mx1001.wikimedia.org",
 has identified this incoming email as possible spam.  The original
 message has been attached to this so you can view it or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 
 Content preview:  Hey hey 
 
 Content analysis details:   (7.3 points, 4.0 required)
 
  pts rule name              description
 ---- ---------------------- --------------------------------------------------
  0.8 BAYES_50               BODY: Bayes spam probability is 40 to 60%
                             [score: 0.5018]
  0.0 FSL_HELO_NON_FQDN_1    No description available.
  1.0 MISSING_HEADERS        Missing To: header
  0.0 SPF_FAIL               SPF: sender does not match SPF record (fail)
 [SPF failed: Please see http://www.openspf.org/Why?s=mfrom;id=gtirloni%40tools.wmflabs.org;ip=172.16.4.197;r=mx1001.wikimedia.org]
  1.4 MISSING_DATE           Missing Date: header
  1.8 MISSING_SUBJECT        Missing Subject: header
  0.8 RDNS_NONE              Delivered to internal network by a host with no rDNS
  0.5 MISSING_MID            Missing Message-Id: header
  1.0 MISSING_FROM           Missing From: header
From: gtirloni@tools.wmflabs.org

Hey hey

This was sent over telnet without following any best pratices.

When I pointed tools-sgegrid-master to mx-out01.cloudinfra.eqiad.wmflabs to use a MX that's is already in the 172.16.0.0/21 network, my email wasn't forward to the spam folder (and doesn't contain a report from SpamAssassin either):

Delivered-To: gtirloni@wikimedia.org
Received: by 2002:a05:6000:104b:0:0:0:0 with SMTP id c11csp1772652wrx;
        Thu, 10 Jan 2019 04:59:04 -0800 (PST)
X-Google-Smtp-Source: ALg8bN6NFix6pBModdzEZDKnOM7g6RFCNvsqEncqa0+5CZLVL0BOb5kG82J2wb1ufPCJBunFUbMZ
X-Received: by 2002:a37:85c7:: with SMTP id h190mr8884174qkd.225.1547125143999;
        Thu, 10 Jan 2019 04:59:03 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1547125143; cv=none;
        d=google.com; s=arc-20160816;
        b=DwzucEZkAeulkJiBEfLhc6erojT0xLiUwvKyR3qZFsxj3Sxc1w3Xty8aQUSahhbAOb
         +vX8y74L8nRsASwIndgOM1edjUVII0WMieB9A2SxhGJsxfRCjLFZd7/iOziRN58rnktY
         oylrPo8zx9x9E7XV64aO4QvvX7mlYfF8X3rl+9PZfhKEnWm+lDeenJs9YMtGus65Grqe
         d5PRjk4RlmOKBXm57aUYaoeb0aYnw9G2zOT+X9udh1ikaj3dZlJod/Ev77HTyR+B3px7
         rTJA6g/pZ6LBysZSXhwYDxo68da/07gDWQEd/Zw55aOPx8RmJn25Df2gJGa4NoX5xcFg
         lUWg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=date:from:message-id:content-transfer-encoding:mime-version:subject
         :to;
        bh=g3zLYH4xKxcPrHOD18z9YfpQcnk/GaJedfustWU5uGs=;
        b=O2bzsPdkTFY/G1Uwzp6asUQR1pziOlBITw2zmkVOcIpsm0vI+jP15vErqgOuHr7lTK
         hDvSvMmXhgY4vOlZAxPB3ETo3M3spgybwpHoOq3pMYBDZjdULOFfY4aK0XbihJunXPx3
         Dwp3aLk29Cxlqf3Ua3Rlsp1ugnbAgeFy44VQDlHGdf1vDOwDprVhgAC5JJAjwX09HVV0
         b+c5+DwJcW9mgSJi1PBqWnm4jSOajdfFDWjidLUvV+5N3Jfx1A7rz1LODv8V4j9PSyFf
         PABLXPOYNXiyP4lNr3GRW9lQseSGSJ/P0znviKIiEX9CPDz4Q6qzgeu2Q14N72emDeUC
         OZCw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address) smtp.mailfrom=root@tools.wmflabs.org
Return-Path: <root@tools.wmflabs.org>
Received: from mx1001.wikimedia.org (mx1001.wikimedia.org. [208.80.154.76])
        by mx.google.com with ESMTPS id w11si935168qvf.141.2019.01.10.04.59.03
        for <gtirloni@wikimedia.org>
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Thu, 10 Jan 2019 04:59:03 -0800 (PST)
Received-SPF: pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address)
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address) smtp.mailfrom=root@tools.wmflabs.org
Received: from [172.16.1.239] (port=46050 helo=mx-out01.wmflabs.org)
	by mx1001.wikimedia.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.89)
	(envelope-from <root@tools.wmflabs.org>)
	id 1ghZvL-0008CN-Jb
	for gtirloni@wikimedia.org; Thu, 10 Jan 2019 12:59:03 +0000
Received: from tools-sgegrid-master.tools.eqiad.wmflabs ([172.16.4.197]:41864)
	by mx-out01.wmflabs.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.89)
	(envelope-from <root@tools.wmflabs.org>)
	id 1ghZvL-0001fk-Db
	for gtirloni@wikimedia.org; Thu, 10 Jan 2019 12:59:03 +0000
Received: from root by tools-sgegrid-master.tools.eqiad.wmflabs with local (Exim 4.89)
	(envelope-from <root@tools.wmflabs.org>)
	id 1ghZvL-0004nD-BQ
	for gtirloni@wikimedia.org; Thu, 10 Jan 2019 12:59:03 +0000
To: gtirloni@wikimedia.org
Subject: test
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
Message-Id: <E1ghZvL-0004nD-BQ@tools-sgegrid-master.tools.eqiad.wmflabs>
From: root <root@tools.wmflabs.org>
Date: Thu, 10 Jan 2019 12:59:03 +0000

test

I guess everything will be fine?

The only thing that makes me think whitelisting 172.16.0.0/21 is still desired is that, as usual, we can't guarantee tools will be good citizens so there might be an impact even if our MX is following best practices. Tools authors may have to fix their code in that case.

we can't guarantee tools will be good citizens so there might be an impact even if our MX is following best practices. Tools authors may have to fix their code in that case.

This was a primary concern with whitelisting 172.16.0.0/21 at the production MX, and the reasoning for creation of the cloud smarthosts (further detail available in T41785). The thinking there is to keep outbound mail service separate so an incident involving mail in WMCS does not impact production mail (and vice versa).

The spam score outlined in T213416#4869367 we can safely disregard for the moment as that is resulting from bare minimum headers and terse body which is totally normal for test mail sent via telnet, but should not be cause for concern as "real" mail will contain To, From, Date, Subject, etc. headers. Fwiw the X-Spam-Score header will only be present if the score is >1.

I can think of a few additional approaches to the situation, but I am unsure if there is processing happening at tools-mail that requires a hop through this system, or if that hop could be eliminated.

  1. toolforge server -> tools-mail-02 -> mx-out01.wmflabs.org -> internet (MX records used from here) (as outlined in the task description)
  1. toolforge server -> mx-out01.wmflabs.org -> internet (MX records used from here) (is it a viable option to eliminate the tools-mail-02 hop?)
  1. toolforge server -> tools-mail-02 -> internet (MX records used from here)

Option 1 looks to be the most immediate solution.

In any event the prod MXes should not be used as the smarthosts/relayhosts for WMCS going forward, so the whitelisting at prod MX option is out.

Change 481215 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] network: Add the new cloud region to all_networks

https://gerrit.wikimedia.org/r/481215

Just connecting that change to this task, since this is really what it was hoping to resolve.

Bstorm added a comment.EditedJan 10 2019, 4:02 PM

I'll also point out that toolforge is currently whitelisted at the MX via the networks from what I can tell from puppet. We are discussing the new cluster of toolforge only.

I can think of a few additional approaches to the situation, but I am unsure if there is processing happening at tools-mail that requires a hop through this system, or if that hop could be eliminated.

Yes, there is some processing at tools-mail which may be non-trivial to move elsewhere. I suggest we keep using tools-mail and figure out what is the next step in the email travel from there.

  1. toolforge server -> tools-mail-02 -> mx-out01.wmflabs.org -> internet (MX records used from here) (as outlined in the task description)

[...]
Option 1 looks to be the most immediate solution.

Ok, could you please give advice on how to proceed with this option. Specially, what changes will we need from tools-mail point of view?

Actually upon closer inspection I'm not understanding the issue with the current config. The production MX hosts accept mail for valid @wikimedia.org addresses regardless of the originating IP address (unless it is in a dnsbl). It is relay for other remote domains where the relay whitelist comes into play, and I had misunderstood the description thinking that tools-mail was attempting to use the prod MX as a smarthost relay for other domains. I also understand now that the current configuration is what I had described as option 3 in T213416#4869863.

Here's are a couple examples of mail for an @wikimedia.org address being accepted by the prod MX from a 172.16.0.0/12 address.

MX server side of T213416#4869367:

2019-01-10 12:50:30 1ghZmx-0006ia-Oy <= gtirloni@tools.wmflabs.org H=(tools-sgegrid-master) [172.16.4.197]:34300 I=[208.80.154.76]:25 P=esmtp S=254
2019-01-10 12:50:30 1ghZmx-0006ia-Oy => gtirloni@wikimedia.org R=ldap_account T=remote_smtp S=1562 H=aspmx.l.google.com [172.217.197.27] I=[208.80.154.76] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1547124630 d16si2406878qvn.7 - gsmtp" DT=0s
2019-01-10 12:50:30 1ghZmx-0006ia-Oy Completed

Example shinken mail:

2019-01-10 16:35:55 1ghdJD-0005HK-II <= root@wmflabs.org H=(mx-out01.wmflabs.org) [172.16.1.239]:47656 I=[208.80.154.76]:25 P=esmtps X=TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 CV=no S=1211 id=E1ghdJD-0005q6-9c@shinken-02.shinken.eqiad.wmflabs
2019-01-10 16:35:56 1ghdJD-0005HK-II => redacted@wikimedia.org R=ldap_account T=remote_smtp S=1249 H=aspmx.l.google.com [173.194.175.26] I=[208.80.154.76] X=TLS1.2:ECDHE_RSA_CHACHA20_POLY1305:256 CV=yes DN="C=US,ST=California,L=Mountain View,O=Google LLC,CN=mx.google.com" C="250 2.0.0 OK 1547138156 z64si2130017qkd.17 - gsmtp" DT=1s
2019-01-10 16:35:56 1ghdJD-0005HK-II Completed

@aborrero how were you able to produce a problem sending mail to an @wikimedia.org address from a host residing in 172.16.0.0/12?

Bstorm added a comment.EditedJan 10 2019, 5:40 PM

The issue is an email that isn't well formatted. An email from root for cron, for instance. In fact, an email from root rather specifically. Grid Engine sends them as well. They are often needed for troubleshooting on the grid.

For specifically gridengine toolforge root emails, I may have found a small part of the problem with encoding that might be getting things flagged. If I can fail some jobs badly enough, I should be able to produce some related emails. This won't reflect on other VPSs and other types of toolforge emails from root that might be affected by issues.

If other root emails come through at this point, we'll be good with things as they are, and that'll be nice (some parts will be baffling, but it'll be nice).

Change 481215 abandoned by Bstorm:
network: Add the new cloud region to all_networks

Reason:
This one is a non-starter in the middle of other refactors

https://gerrit.wikimedia.org/r/481215

What's the status of this? hey @akosiaris, did you finish with the changes you were doing?

herron closed this task as Resolved.Feb 6 2019, 4:23 PM

After checking in with @aborrero and @Bstorm via IRC there isn't clear enough evidence of an issue to take action now.

Closing this task with the expectation that we may re-open in the future if/when further action is needed.

bd808 reopened this task as Open.Mar 4 2019, 1:26 AM

Reopening because I have been seeing quite a few emails that look like this:

Delivered-To: bdavis@wikimedia.org
Received: by 2002:a19:6f4c:0:0:0:0:0 with SMTP id n12csp1248738lfk;
        Fri, 1 Mar 2019 15:30:59 -0800 (PST)
X-Google-Smtp-Source: APXvYqxtq0aNtH1Ihi+eaCZaTtFzHR3jx/FmEMPYX23FEY/tUMkjbMaMX6HlivzP/4SIJSb88cKg
X-Received: by 2002:a0c:bd97:: with SMTP id n23mr5949373qvg.58.1551483059083;
        Fri, 01 Mar 2019 15:30:59 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1551483059; cv=none;
        d=google.com; s=arc-20160816;
        b=VaQj5TCWPiA4qLi7/8XUxyndRq6Ro8JVb1Jx/7k/URlwksT1f8lAL4DrBF2o3N88Yp
         a/wInhQXxTpb+xvGBndVXrLmSy9iz4pmhsFrTdHOkEFqiU5778a9jBbfG87aWVO+xU7G
         MoPoKULlA+FPZNHOlybDHH0egqGBFQza5TTmPLG2mbq0FY8lgryjmIP4woZmBGSRqnDx
         olF6qSigCh7XHoEcuyZ8WDo65VrK8SnsdmpLKbKhlpFIk0UECJVVCFgM6DJ/ioM9z71D
         UdfT1z39RgL40U/yf8gDuCLAinbj4RC7s4JXiU7JZm1GQbfnCDcd7ZfcHgm0EPYYNSAl
         BLzA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=date:message-id:subject:mime-version:to:from:auto-submitted;
        bh=cvOjhTWq3w0L95Xb1wVoDBWcDscZEw71T/K3rdF+cVA=;
        b=bW+axNGZeO0uJxahHUMHh6iTONnlbBkzXeL7TVKB8z12IOYyWg4cYIFDrajba90Cse
         vAwrjd05eLUOneHB0zgoKk+Yd/M+l0yTKrgZd29o8QmMkGPUAGKQZK38T52IoHf7olqL
         CZTRnlbWM6Cvwrlpr6tMKj8BgTkwvdg7dZow8CUTu/4R/1ov/C6BZ2GKcgNZ8KTBnW0v
         iwgJi3MT7z/J6wZUodcHIJ/7jsl+xlXW1x+bQDzgSTCCTPMD32THUK61eYtONU8Z2Wf3
         AtB+rLnPqpwYprhCXXGdUyFFzL9W9H/GO9F6wDMfalwGD7mRXKQU5OQ8koqzcYji0hoC
         A+Bw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address)
Return-Path: <>
Received: from mx1001.wikimedia.org (mx1001.wikimedia.org. [208.80.154.76])
        by mx.google.com with ESMTPS id c14si1304633qkm.197.2019.03.01.15.30.58
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Fri, 01 Mar 2019 15:30:59 -0800 (PST)
Received-SPF: pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address)
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address)
Received: from [10.68.23.71] (port=37576 helo=mail.tools.wmflabs.org) by mx1001.wikimedia.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) id 1gzrcI-0001AT-Hb; Fri, 01 Mar 2019 23:30:58 +0000
Received: from Debian-exim by mail.tools.wmflabs.org with local (Exim 4.89) id 1gzrcH-0000Z2-Un for root@tools-sgeexec-0927.tools.eqiad.wmflabs; Fri, 01 Mar 2019 23:30:57 +0000
X-Failed-Recipients: rush@wikimedia.org,
  reedy@wikimedia.org,
  nwilson@wikimedia.org,
  glavagetto@wikimedia.org,
  mmuhlenhoff@wikimedia.org,
  gtirloni@wikimedia.org,
  bstorm@wikimedia.org,
  bdavis@wikimedia.org,
  ariel@wikimedia.org,
  akosiaris@wikimedia.org,
  aborrero@wikimedia.org
Auto-Submitted: auto-replied
From: Mail Delivery System <Mailer-Daemon@tools.wmflabs.org>
To: root@tools-sgeexec-0927.tools.eqiad.wmflabs
Content-Type: multipart/report; report-type=delivery-status; boundary=1551483057-eximdsn-1604336407
MIME-Version: 1.0
Subject: Mail delivery failed: returning message to sender
Message-Id: <E1gzrcH-0000Z2-Un@mail.tools.wmflabs.org>
Date: Fri, 01 Mar 2019 23:30:57 +0000

--1551483057-eximdsn-1604336407
Content-type: text/plain; charset=us-ascii

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  rush@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<rush@wikimedia.org>:
    550 Sender verify failed
  reedy@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<reedy@wikimedia.org>:
    550 Sender verify failed
  nwilson@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<nwilson@wikimedia.org>:
    550 Sender verify failed
  glavagetto@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<glavagetto@wikimedia.org>:
    550 Sender verify failed
  mmuhlenhoff@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<mmuhlenhoff@wikimedia.org>:
    550 Sender verify failed
  gtirloni@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<gtirloni@wikimedia.org>:
    550 Sender verify failed
  bstorm@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<bstorm@wikimedia.org>:
    550 Sender verify failed
  bdavis@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<bdavis@wikimedia.org>:
    550 Sender verify failed
  ariel@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<ariel@wikimedia.org>:
    550 Sender verify failed
  akosiaris@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<akosiaris@wikimedia.org>:
    550 Sender verify failed
  aborrero@wikimedia.org
    (ultimately generated from root@tools-sgeexec-0927.tools.eqiad.wmflabs)
    host mx1001.wikimedia.org [208.80.154.76]
    SMTP error from remote mail server after RCPT TO:<aborrero@wikimedia.org>:
    550-Verification failed for <root@tools-sgeexec-0927.tools.eqiad.wmflabs>
    550-Cannot route to remote domain tools-sgeexec-0927.tools.eqiad.wmflabs
    550 Sender verify failed

--1551483057-eximdsn-1604336407
Content-type: message/delivery-status


--1551483057-eximdsn-1604336407
Content-type: message/rfc822

Return-path: <root@tools-sgeexec-0927.tools.eqiad.wmflabs>
Received: from tools-sgeexec-0927.tools.eqiad.wmflabs ([172.16.1.244]) by mail.tools.wmflabs.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <root@tools-sgeexec-0927.tools.eqiad.wmflabs>) id 1gzrcB-0000YO-07 for root@tools-sgeexec-0927.tools.eqiad.wmflabs; Fri, 01 Mar 2019 23:30:51 +0000
Received: from root by tools-sgeexec-0927.tools.eqiad.wmflabs with local (Exim 4.89) (envelope-from <root@tools-sgeexec-0927.tools.eqiad.wmflabs>) id 1gzrcA-0008L8-Tq for root@tools-sgeexec-0927.tools.eqiad.wmflabs; Fri, 01 Mar 2019 23:30:50 +0000
Subject: SGE 8.1.9: Job 459236 failed
To: <root@tools-sgeexec-0927.tools.eqiad.wmflabs>
X-Mailer: mail (GNU Mailutils 3.1.1)
Message-Id: <E1gzrcA-0008L8-Tq@tools-sgeexec-0927.tools.eqiad.wmflabs>
From: root@tools-sgeexec-0927.tools.eqiad.wmflabs
Date: Fri, 01 Mar 2019 23:30:50 +0000

Job 459236 caused action: none
 User        = tools.catrename
 Queue       = task@tools-sgeexec-0927.tools.eqiad.wmflabs
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job: 03/01/2019 23:30:49 [600:32057]: can't get passwd entry for user "tools.catrename"
Shepherd trace:
03/01/2019 23:30:30 [600:31939]: shepherd called with uid = 0, euid = 600
03/01/2019 23:30:49 [600:31939]: starting up 8.1.9
03/01/2019 23:30:49 [600:31939]: setpgid(31939, 31939) returned 0
03/01/2019 23:30:49 [600:31939]: do_core_binding: "binding" parameter not found in config file
03/01/2019 23:30:49 [600:31939]: no prolog script to start
03/01/2019 23:30:49 [600:31939]: parent: forked "job" with pid 32057
03/01/2019 23:30:49 [600:31939]: parent: job-pid: 32057
03/01/2019 23:30:49 [600:32057]: child: starting son(job, /usr/bin/python3.5, 0, 4096);
03/01/2019 23:30:49 [600:32057]: pid=32057 pgrp=32057 sid=32057 old pgrp=31939 getlogin()=<no login set>
03/01/2019 23:30:49 [600:32057]: reading passwd information for user 'tools.catrename'
03/01/2019 23:30:49 [600:32057]: can't get passwd entry for user "tools.catrename"
03/01/2019 23:30:49 [600:31939]: wait3 returned 32057 (status: 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
03/01/2019 23:30:49 [600:31939]: job exited with exit status 11
03/01/2019 23:30:49 [600:31939]: reaped "job" with pid 32057
03/01/2019 23:30:49 [600:31939]: job exited not due to signal
03/01/2019 23:30:49 [600:31939]: job exited with status 11
03/01/2019 23:30:49 [600:31939]: now sending signal KILL to pid -32057
03/01/2019 23:30:49 [600:31939]: pdc_kill_addgrpid: 65427 9
03/01/2019 23:30:49 [600:31939]: failed starting job
03/01/2019 23:30:49 [600:31939]: no epilog script to start

Shepherd error:
03/01/2019 23:30:49 [600:32057]: can't get passwd entry for user "tools.catrename"

Shepherd pe_hostfile:
tools-sgeexec-0927.tools.eqiad.wmflabs 1 task@tools-sgeexec-0927.tools.eqiad.wmflabs UNDEFINED



--1551483057-eximdsn-1604336407--

Scenario:

  1. Grid engine sees a catastrophic failure running a job on a grid host
  2. Grid engine sends an email to root@[GRID ENGINE EXEC NODE].tools.eqiad.wmflabs notifying of failure
  3. Email reaches mail.tools.wmflabs.org (tools-mail-02.tools.eqiad.wmflabs) where /etc/aliases config translates root@* to tools.admin@tools.wmaflabs.org
  4. tools.admin@tools.wmflabs.org is expanded to admin.maintainers@tools.wmflabs.org via EXIM config
  5. /usr/local/sbin/maintainers is used to compute the list of email address for all registered maintainers of the admin tool after checking for an explicit $HOME/.forward.maintainers and $HOME/.forward for the admin tool
    1. The generated list of emails includes a mix of *@wikimedia.org and other email addresses
  6. tools-mail-02.tools.eqiad.wmflabs attempts to relay the email to the *@wikimedia.org recipients
  7. mx1001.wikimedia.org rejects root@tools-sgeexec-0927.tools.eqiad.wmflabs as a valid email sender and refuses to deliver the message(s)
  8. A bounce message is created on tools-mail-02.tools.eqiad.wmflabs with a From of Mailer-Daemon@tools.wmflabs.org to the same root@tools-sgeexec-0927.tools.eqiad.wmflabs recipient from step 2 above
  9. This bounce message is delivered as expected while running through the same steps as the failure because ultimately Mailer-Daemon@tools.wmflabs.org is seen by mx1001.wikimedia.org as a valid sender.

This is the same scenario that @Bstorm was pointing out as problematic in T213416#4870463.

The reply is 550 Sender verify failed. I guess this can be solved by configuring (I am assuming this is configurable) grid engine to send failures to a specific email address that exists?

bd808 added a comment.Mar 4 2019, 7:23 PM

The reply is 550 Sender verify failed. I guess this can be solved by configuring (I am assuming this is configurable) grid engine to send failures to a specific email address that exists?

The envelope From is "root@tools-sgeexec-0927.tools.eqiad.wmflabs" which is a specific email address that does exist inside the Cloud VPS DNS/user space. It is however an address that would not seem to be valid from any host outside Cloud VPS where DNS is not available for the wmflabs local root domain.

On the mx-out01.cloudinfra.eqiad.wmflabs host this envelope-from line would be rewritten to root@wmflabs.org:

$ sudo exim4 -brw root@tools-sgeexec-0927.tools.eqiad.wmflabs
  sender: root@tools-sgeexec-0927.tools.eqiad.wmflabs
    from: root@tools-sgeexec-0927.tools.eqiad.wmflabs
      to: root@tools-sgeexec-0927.tools.eqiad.wmflabs
      cc: root@tools-sgeexec-0927.tools.eqiad.wmflabs
     bcc: root@tools-sgeexec-0927.tools.eqiad.wmflabs
reply-to: root@tools-sgeexec-0927.tools.eqiad.wmflabs
2019-03-04 19:06:09 "root@tools-sgeexec-0927.tools.eqiad.wmflabs" from env-from rewritten as "root@wmflabs.org" by rule 1
env-from: root@wmflabs.org
  env-to: root@tools-sgeexec-0927.tools.eqiad.wmflabs

I think adding similar envelope rewrite behavior to the Toolforge smarthost would fix this issue.

Change 494291 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Rewrite envelope From headers when relaying

https://gerrit.wikimedia.org/r/494291

Change 494291 merged by Alexandros Kosiaris:
[operations/puppet@production] toolforge: Rewrite envelope From headers when relaying

https://gerrit.wikimedia.org/r/494291

Change 494515 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Rewrite envelope From headers when relaying

https://gerrit.wikimedia.org/r/494515

Change 494515 merged by Bstorm:
[operations/puppet@production] toolforge: Rewrite envelope From headers when relaying

https://gerrit.wikimedia.org/r/494515

bd808 claimed this task.Mar 5 2019, 10:31 PM

With the patch (to the correct file!) applied exim on tools-mail-02.tools.eqiad.wmflabs is now reporting that it will rewrite the envelope from header:

$ sudo exim4 -brw root@tools-sgeexec-0927.tools.eqiad.wmflabs
  sender: root@tools-sgeexec-0927.tools.eqiad.wmflabs
    from: root@tools-sgeexec-0927.tools.eqiad.wmflabs
      to: root@tools-sgeexec-0927.tools.eqiad.wmflabs
      cc: root@tools-sgeexec-0927.tools.eqiad.wmflabs
     bcc: root@tools-sgeexec-0927.tools.eqiad.wmflabs
reply-to: root@tools-sgeexec-0927.tools.eqiad.wmflabs
env-from: root@tools.wmflabs.org
  env-to: root@tools-sgeexec-0927.tools.eqiad.wmflabs
bd808 closed this task as Resolved.Mar 5 2019, 11:38 PM

Declaring victory having seen this error email in my inbox.

Delivered-To: bdavis@wikimedia.org
Received: by 2002:a19:6f4b:0:0:0:0:0 with SMTP id n11csp3775957lfk;
        Tue, 5 Mar 2019 15:20:45 -0800 (PST)
X-Google-Smtp-Source: APXvYqxuPtvwmw8ekj1loqlmEkeXdHx++0yEDQmfIsedLErBdfy06BZKmhJSCc3VIvPOD45r0IKi
X-Received: by 2002:a37:3087:: with SMTP id w129mr3429075qkw.255.1551828045604;
        Tue, 05 Mar 2019 15:20:45 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1551828045; cv=none;
        d=google.com; s=arc-20160816;
        b=nwQFG9cCCfgXRrEj9YL5HfyhOBbHiO8w4coiydNu+UiuVmIkFaL0i+ckdCXpVuZ9tL
         jILFCt+EB5Yia8MiEX40LVac1V/YAudAy0Nt4O4ZZ3EKZtiFNX2Uq8j6x2RKpEAXn1Td
         xXVj851+nK6u+wrFlB/nQdXSRF329wZRQW06vMTzBwy+X7fEI0XnvKKHCDXruQwSIIjX
         sZwSjMQH3QdWL3cRAsIhbMSGUeY5ES9wh2+ArlRbUAEIkOapjtTXQfmGaSJLmQC2jv9o
         scOjV5noOUAKAmy/BnQdBwkrvpgNq0D4yJ+sSe5/CG6NNKBNBcwXuU89P+u5n7sBMHPB
         9j9w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=date:from:message-id:to:subject;
        bh=Ya68jhSbN7X4pMynIU/nBxNuvg2CmeF6Fv78ZQhMtdM=;
        b=t+RarVoulxaz4qG1zOKgUROTbYqBmDAk1CCKmYyOMeAJmPbOBCse9YdxfJI1YfUaUM
         ZQdY5zi/unO23LHH8K36+6iYXYB/LgQ9mCLMLCYsxkqJTOO2pXYH3Y6xnNGryUyN/xLc
         pzzmcEV45az49/H+OrOblNE0ax6/cKa2jabctzPuIPMCY5f+FQWys9/J7NMtL5+XmGmC
         R6q6/XRHPn4uYk0fYYCw/xpGnDsPhkEQlkD+d0p/zDgXYa45EX+gF9BIRqGcamcx7ju0
         X/tTlY7krSTCNM7dspvSS31ghK7wLAKAo9vrWf+Q3DTb2s+hhaYAhv8nYqFT2tIu/3Jj
         ni8Q==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address) smtp.mailfrom=root@tools.wmflabs.org
Return-Path: <root@tools.wmflabs.org>
Received: from mx1001.wikimedia.org (mx1001.wikimedia.org. [208.80.154.76])
        by mx.google.com with ESMTPS id l12si5529000qtq.216.2019.03.05.15.20.45
        (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
        Tue, 05 Mar 2019 15:20:45 -0800 (PST)
Received-SPF: pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address)
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain wikimedia.org configured 208.80.154.76 as internal address) smtp.mailfrom=root@tools.wmflabs.org
Received: from [10.68.23.71] (port=53732 helo=mail.tools.wmflabs.org) by mx1001.wikimedia.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <root@tools.wmflabs.org>) id 1h1JMa-0004nb-Tm; Tue, 05 Mar 2019 23:20:44 +0000
Received: from tools-sgeexec-0907.tools.eqiad.wmflabs ([172.16.1.209]) by mail.tools.wmflabs.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from <root@tools.wmflabs.org>) id 1h1JMZ-0006US-Ph for root@tools-sgeexec-0907.tools.eqiad.wmflabs; Tue, 05 Mar 2019 23:20:43 +0000
Received: from root by tools-sgeexec-0907.tools.eqiad.wmflabs with local (Exim 4.89) (envelope-from <root@tools-sgeexec-0907.tools.eqiad.wmflabs>) id 1h1JMZ-0007XI-M2 for root@tools-sgeexec-0907.tools.eqiad.wmflabs; Tue, 05 Mar 2019 23:20:43 +0000
Subject: SGE 8.1.9: Job 564416 failed
To: <root@tools-sgeexec-0907.tools.eqiad.wmflabs>
X-Mailer: mail (GNU Mailutils 3.1.1)
Message-Id: <E1h1JMZ-0007XI-M2@tools-sgeexec-0907.tools.eqiad.wmflabs>
From: root@tools-sgeexec-0907.tools.eqiad.wmflabs
Date: Tue, 05 Mar 2019 23:20:43 +0000

Job 564416 caused action: none
 User        = tools.jarbot-iii
 Queue       = task@tools-sgeexec-0907.tools.eqiad.wmflabs
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job: 03/05/2019 23:20:42 [600:28964]: can't get passwd entry for user "tools.jarbot-iii"
Shepherd trace:
03/05/2019 23:20:24 [600:28796]: shepherd called with uid = 0, euid = 600
03/05/2019 23:20:42 [600:28796]: starting up 8.1.9
03/05/2019 23:20:42 [600:28796]: setpgid(28796, 28796) returned 0
03/05/2019 23:20:42 [600:28796]: do_core_binding: "binding" parameter not found in config file
03/05/2019 23:20:42 [600:28796]: no prolog script to start
03/05/2019 23:20:42 [600:28796]: parent: forked "job" with pid 28964
03/05/2019 23:20:42 [600:28796]: parent: job-pid: 28964
03/05/2019 23:20:42 [600:28964]: child: starting son(job, /usr/bin/python3.5, 0, 4096);
03/05/2019 23:20:42 [600:28964]: pid=28964 pgrp=28964 sid=28964 old pgrp=28796 getlogin()=<no login set>
03/05/2019 23:20:42 [600:28964]: reading passwd information for user 'tools.jarbot-iii'
03/05/2019 23:20:42 [600:28964]: can't get passwd entry for user "tools.jarbot-iii"
03/05/2019 23:20:42 [600:28796]: wait3 returned 28964 (status: 2816; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
03/05/2019 23:20:42 [600:28796]: job exited with exit status 11
03/05/2019 23:20:42 [600:28796]: reaped "job" with pid 28964
03/05/2019 23:20:42 [600:28796]: job exited not due to signal
03/05/2019 23:20:42 [600:28796]: job exited with status 11
03/05/2019 23:20:42 [600:28796]: now sending signal KILL to pid -28964
03/05/2019 23:20:42 [600:28796]: pdc_kill_addgrpid: 65424 9
03/05/2019 23:20:42 [600:28796]: failed starting job
03/05/2019 23:20:42 [600:28796]: no epilog script to start

Shepherd error:
03/05/2019 23:20:42 [600:28964]: can't get passwd entry for user "tools.jarbot-iii"

Shepherd pe_hostfile:
tools-sgeexec-0907.tools.eqiad.wmflabs 1 task@tools-sgeexec-0907.tools.eqiad.wmflabs UNDEFINED