Page MenuHomePhabricator

VRTS e-mail address unreachable / e-mail routing issue
Closed, ResolvedPublicBUG REPORT

Description

The e-mail address info-ko@wikimedia.org is used on VRTS for years. If appears that for an unknown time it doesn't work any more, the e-mail bounces as undeliverable. Additionally, it appears to be routed via Google., which perhaps has never been correct.

Example e-mail headers can be found at https://vrt-wiki.wikimedia.org/wiki/Administrator_requests#info-ko or can be provided in a private space. (See P71053 and P71054, otrs-admins or WMF-NDA membership required)

The issue may or may not be related to: https://phabricator.wikimedia.org/T374090

Event Timeline

(Adding SRE and infra-foundations based on tasks at Mail.)

For those without vrt-wiki access but have WMF-NDA, you have P71053 (which is acl*otrs-admins + WMF-NDA).

Additionally, it appears to be routed via Google., which perhaps has never been correct.

If I recall correctly, mx{1001|2001} looks for known VRT address, and if there is none, it just routes everything else to Google (for WMF employee mailbox or WMF internal Google Groups). (A bit outdated docs)

for some reason the alias generator script thinks the alias is handled by google and does not route it to VRTS:

Nov 15 09:03:27 mx-in1001 vrts_aliases[4167622]: ERROR:/usr/local/bin/vrts_aliases:Skipping, email is handled by gsuite: info-ko@wikimedia.org

I can definitely say it did not work that way yesterday: there was an incoming info-ko@wikimedia.org ticket at 2024-11-14T02:04:08Z.

So something happened during the last 30-ish hours that confused poor mail server.

It seems like it's entire @wikimedia.org that is refusing to route to VRTS. My test email to oversight-ko-wp@wikimedia.org also bounced with P71054 (again, acl-otrs-admin + WMF-NDA).

EDIT: For the record, wikipedia.org routes just fine, tested via info-ko@wikipedia.org.

That actually make VRTS fubar. Unbreak now please.

Ladsgroup triaged this task as Unbreak Now! priority.Nov 15 2024, 10:16 AM
Ladsgroup subscribed.

Nothing in recently merged patches of puppet stands out neither anything in private repo. I dig a bit more.

Asking ITS if anything changed on their side recently.

As a fast fix, we can put vrt transport rule before gmail rule to make sure it gets checked first.

Can't we just check what was changed yesterday, and undo that?

As a fast fix, we can put vrt transport rule before gmail rule to make sure it gets checked first.

No, you can't, since the VRT alias generator is the one that's doing the filtering here. The gsuite transport is there as a fallback for addresses that don't match any other rule.

My postfix knowledge is not really good but what I mean is this order:

transport_maps = regexp:/etc/postfix/transport-wiki-verp-bounce-handler, hash:/etc/postfix/transport-donate, hash:/etc/postfix/transport-phabricator, hash:/etc/postfix/transport-gmail, hash:/var/lib/postfix/transport-vrt, hash:/etc/postfix/transport-recipient-discards

(in main.cf of mx-in)

Can't we just check what was changed yesterday, and undo that?

There is nothing we can find in the past day or so. That's part of the problem :D

So the issue is coming from the vrts_aliases.py cron job. Something has changed in how gmail is responding to emails here and claiming they're valid. I'm going to try see if we can change that to ignore the gmail check

Change #1091628 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] mx: Update vrts_aliases script not to check gmail for wm.o addresses

https://gerrit.wikimedia.org/r/1091628

Change #1091628 merged by EoghanGaffney:

[operations/puppet@production] mx: Update vrts_aliases script not to check gmail for wm.o addresses

https://gerrit.wikimedia.org/r/1091628

eoghan lowered the priority of this task from Unbreak Now! to High.Nov 15 2024, 11:52 AM

We've made a change to the aliases routing script which we believe has fixed the problem. I've verified that mail is delivering to vrts now, and we've seen two of our test tickets arrive.

image.png (78×1 px, 21 KB)

I'm going to keep this ticket open (and downgrade from UBN) until we talk about whether the correct thing to do is drop the gmail check entirely, or if we need to come up with an alternative. In the mean time, I think we can consider the mail routing issues solved for now. If anyone still sees issues, please let us know and we'll dig further.

So the issue is coming from the vrts_aliases.py cron job. Something has changed in how gmail is responding to emails here and claiming they're valid. I'm going to try see if we can change that to ignore the gmail check

OIT?

Other domains served by google still return

550-5.1.1 The email account that you tried to reach does not exist.

This looks like a wildcard mailbox being added at Google, or an equivalent configuration to that.

So the issue is coming from the vrts_aliases.py cron job. Something has changed in how gmail is responding to emails here and claiming they're valid. I'm going to try see if we can change that to ignore the gmail check

OIT?

Office IT, which is renamed to IT Services, is WMF dept handling employee account and their inboxes (which are handled by Google mailbox). (EDIT: I know this from my past employment at WMF, and it is probably usually-unseen group for almost all volunteers.)

(I've created meta redirect for Office IT and OIT.)

We're in touch with ITS and will be doing follow up testing with them next week.

So the issue is coming from the vrts_aliases.py cron job. Something has changed in how gmail is responding to emails here and claiming they're valid. I'm going to try see if we can change that to ignore the gmail check

OIT?

Office IT, which is renamed to IT Services, is WMF dept handling employee account and their inboxes (which are handled by Google mailbox). (EDIT: I know this from my past employment at WMF, and it is probably usually-unseen group for almost all volunteers.)

(I've created meta redirect for Office IT and OIT.)

heh, I was actually suggesting if the "Something" that "has changed in how gmail is responding to emails here and claiming they're valid." could be OIT, not asking what OIT meant. But the description may be useful of others. :)
Too bad I used the old OIT name instead of ITS.

We had a quick chat with ITS today where they disabled the change that caused the routing to change, and it did cause gmail to start returning 550 for unknown addresses again, so we have confirmed their change was what caused this to start behaving differently.

They're going to work on checking to see if they can adjust their change to stop this undesired behaviour from happening, and if not we'll work with them to try and find a better alternative that suits both sides. For the mean time, we're leaving the gmail check disabled for the wikimedia.org domain, so we shouldn't see another failure of this in the short term.

We had a quick chat with ITS today where they disabled the change that caused the routing to change, and it did cause gmail to start returning 550 for unknown addresses again, so we have confirmed their change was what caused this to start behaving differently.

What was the change?

We had a quick chat with ITS today where they disabled the change that caused the routing to change, and it did cause gmail to start returning 550 for unknown addresses again, so we have confirmed their change was what caused this to start behaving differently.

This taking about few days to arrive (weekend excluded) sounds like a result of ticketing fragmentation season 2; phab was to solve that fragmentation once and for all, and yet they don't seem to be on phab… and had they been here it would have resulted faster resolution.

@jhathaway It was a rule set up to change the envelope-to of a mail from a given source. When we disabled the rule, gmail started returning 550s for any address unknown in the wm.o domain, but when the rule was re-enabled, it was back to 250/ok for anything unknown. When we set the "Account types to affect" not to include the catch-all, it started returning 550s again. It's not clear whether leaving the catch-all unchecked is desirable behaviour on the ITS side, waiting to hear back on that. I'm sure there'll be a way to work around it if we have to.

image.png (412×710 px, 68 KB)

@revi While I won't comment about ticket fragmentation, any delay here was more an issue of timezones -- we got an answer from ITS fairly early on Friday afternoon with an indication that a change had been made at a suspiciously close time, but we didn't manage to get a time when we could test things out to confirm in real time on a call until this evening. I'm not convinced we would have been able to confirm it any faster if it was purely in phabricator.

I'd say this or any such problem should not occur again, as we definitely lost tickets, and the actual impact may never be determined. And I think it was pure luck that the issue was detected and identified so quickly. (Thank you very much revi for raising it!)

It was already discussed around T374090 that safeguards are needed, and it appears the currently existing are not sufficient. Can this be improved before any changes go live?

Hi,
I've got the information that someone wrote to permissions-it@wikimedia.org on 15/11/2024-11-15 12:20 and got an "unreachable address" message.

Hi,
I've got the information that someone wrote to permissions-it@wikimedia.org on 15/11/2024-11-15 12:20 and got an "unreachable address" message.

They were hit by this, and they should just re-send the email (which will route correctly this time).

Change #1097556 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] vrts: Update mail alias generation script to bail on too many changes

https://gerrit.wikimedia.org/r/1097556

Change #1097556 merged by JHathaway:

[operations/puppet@production] vrts: Update mail alias generation script to bail on too many changes

https://gerrit.wikimedia.org/r/1097556