Page MenuHomePhabricator

VRTS is spammed with bounce e-mails and is going to break
Open, MediumPublic

Assigned To
Authored By
Krd
Oct 28 2025, 8:09 PM
Referenced Files
F70139752: info pt.png
Nov 12 2025, 1:24 PM
F70139751: infi nl.png
Nov 12 2025, 1:24 PM
F70139750: info pl.png
Nov 12 2025, 1:24 PM
F70070832: Junk.png
Nov 10 2025, 8:44 AM
F69754786: image.png
Nov 3 2025, 10:07 AM
F68047951: Junk.png
Oct 28 2025, 8:13 PM

Description

There are 318k tickets in the Junk queue currently, while a normal number is around 20k.

All of them are some kind of bounces, perhaps a loop. Unable to determine this without assistance.

Ticket numbers appear to have become 17 digits instead of 16 because there have been more that 100k tickets within a day, which causes major disruption.

Observed production impacts:

  • Delay in ability for agents to log on to the system
  • Unable to contact customers / Agents not getting queue notifications (see T408967 )

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Krd thanks, I'm investigating, not sure of the cause either.

This comment was removed by Krd.

Ir appears to me that we are accepting bounces from phishing e-mails sent with fake sender info@wikipedia.org.

The 219.240.37.89 looks like a common factor. Can we block this source IP for SMTP as a first measure?

The 219.240.37.89 looks like a common factor. Can we block this source IP for SMTP as a first measure?

done, though a proper patch still needs to be cut

Change #1199507 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] postfix: add rspamd network discard map

https://gerrit.wikimedia.org/r/1199507

Change #1199507 merged by JHathaway:

[operations/puppet@production] postfix: add rspamd network discard map

https://gerrit.wikimedia.org/r/1199507

We need an analysis what exactly happened, and perhaps a strategy not to accept such fake bounces at all.

And we please need some monitoring that detects unusual e-mail rates and alarms SRE.

Also one ticket in he-queue - 20251028101571634

I just checked and the junk queue is close to 500k at this time.

Here's the increase in disk space and inode usage since October 27th:

image.png (224×428 px, 17 KB)

Following up on progress at #wikimedia-sre ; expected resource is not yet on shift. Let's give them some time before next escalation.

Change #1201083 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/puppet@production] spamassassin: add multi.uribl.com to deny list

https://gerrit.wikimedia.org/r/1201083

Change #1201087 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/alerts@master] vrts: alert on vrts junk queue size

https://gerrit.wikimedia.org/r/1201087

@Krd I see the junk mail queue is now at 600k, how can I help clear it out, I saw some of the scheduled jobs were run, but that does not seem to be enough. Also feel free to contact me on IRC for some real time triaging.

I think the focus should be to determine if the queue size is the cause of the impact or not. I.e. if there is another issue. The queue size itself is no issue as long as it stops growing.

Change #1201083 merged by AOkoth:

[operations/puppet@production] spamassassin: add multi.uribl.com to deny list

https://gerrit.wikimedia.org/r/1201083

After some analysis today, I think the cause of the bounces were as follows:

  1. Spammers set their Return-Path to info@wikimedia.org, which caused their bounce traffic to hit the info queue
  2. VRTS bounce traffic was redirected to the Junk queue
  3. VRTS users who monitored the Junk queue received so many emails that their email service providers rate limited their inboxes.
  4. The rate limiting of VRTS users caused more bounces to be created, which were also sent to Junk, amplifying the problem.

Steps taken:

  1. Created a new bounces queue and modified the filters to direct bounces to that queue
  2. Created two Generic Agent jobs to clean out existing bounces in Junk, "Delete postmaster@ in Junk" and "Delete MAILER-DAEMON backscatter in Junk" running every 5 minutes
  3. Removed bounces from the outbound queue

Spammers set their Return-Path to info@wikimedia.org, which caused their bounce traffic to hit the info queue

If that’s the case, shouldn’t we just refuse bouncing to/bounces from suspicious addresses?

Thanks, I think that the delay of mails from VRT solved. I've got my notifications at once.

Perhaps the junk queue should not be allowed to send agent notifications?

If I checked correctly, nobody is subscribed to the Junk queue, so no notifications for that should have been created.
If there were any, I think that should be investigated further.

If I checked correctly, nobody is subscribed to the Junk queue, so no notifications for that should have been created.
If there were any, I think that should be investigated further.

The one account, which has caused quite a bit of the backscatter, because their account is being rate limited, appears to be subscribed to the Junk queue, as well as 78 other queues. Given their mail is not reaching them, is their a procedure for disabling their account?

Please provide in private who that is and how you found the information.

Please provide in private who that is and how you found the information.

Shared via a private paste.

It appears my previous query must have been wrong. Please stand by an hour or two.

I have unsubscribed the mentioned user. This appears to be the only one, and I will monitor this from now on. It makes no sense to get copies of thousands of spams messages each day.

I have unsubscribed the mentioned user. This appears to be the only one, and I will monitor this from now on. It makes no sense to get copies of thousands of spams messages each day.

Great, thanks, it would be great if there was an admin toggle to disable subscribing, but I didn't see one.

@Krd we are still receiving bounces for that user as their email rate is still too high. Do they need to subscribe to the 77 remaining queues? Could we perhaps unsubscribe them from all, and pop them a note to resubscribe?

My personal opinion is that we should disable notifications completely, but this perhaps isn't consensus.

I will do as requested for the mentioned user.

The bounces queue is at 292k now, and increasing. Please have a look.

We are working on it, alarm notified us.

I am really really ignorant about postfix so please bear with me :)

I ran:

elukey@mx-in1001:~$ for queue in $(sudo postqueue -j | jq -r ' select(.recipients[0].address == "info@wikipedia.org") | select(.recipients[1].address == null) | .queue_id'); do sudo postcat -q $queue | grep log_client_address| cut -d "=" -f 2; done > T408632_info_wikipedia_org.log

With some bash sort/uniq there seem to be two IPs causing most of the spam to info@wikipedia.org, so my next step would be to block them via puppet private.

Mentioned in SAL (#wikimedia-operations) [2025-11-10T13:11:44Z] <elukey> restart postfix on mx-in2001 to apply an IP ban - T408632

Mentioned in SAL (#wikimedia-operations) [2025-11-10T13:12:24Z] <elukey> restart postfix on mx-in1001 to apply an IP ban - T408632

Judging from the metrics it seems to me that the queues stopped growing, and they are slowly getting processed. Let's wait a bit more to see if the mitigation worked as expected.

I have no idea how to create cleanup jobs like Jesse indicated in T408632#11338645.

Looks like we are back in acceptable ranges again! Please let me know if anything is missing.

Now again VRT number - 17 digits
20251110103208628
20251110103208173

I just checked and the junk queue is at a reasonable size.

We still need to look into a long term solution for problems like this one.

how
pl 5/4 --> 2/1
pt 109/108 --> 11/10
nl - my mistake

There is a subqueue which counts for the headline but not for the queue view.

my immersion was that it's not ok. But if it's ok then we done.

jhathaway lowered the priority of this task from High to Medium.Nov 17 2025, 3:37 PM

Change #1201087 merged by jenkins-bot:

[operations/alerts@master] vrts: alert on vrts junk queue size

https://gerrit.wikimedia.org/r/1201087

Change #1217548 had a related patch set uploaded (by AOkoth; author: AOkoth):

[operations/alerts@master] collab: add vrts junk queue alert

https://gerrit.wikimedia.org/r/1217548

Change #1217548 merged by jenkins-bot:

[operations/alerts@master] collab: add vrts junk queue alert

https://gerrit.wikimedia.org/r/1217548