Page MenuHomePhabricator

Backlog in mailing lists is increasing
Closed, ResolvedPublic

Description

Hi,

In the last hours a certain backlog has been generated in the mailing lists, after a quick glance it looks like mailman3 is slowed down. I just noticed that there are a lot of bounces (which are increasing). Could you please take a look?

Thanks

Details

Event Timeline

It looks like bouncing started today at 01:00 https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-24h&to=now&viewPanel=2

I'll check the last pupet changes and the service status on the host

mailman3.service prints a lot of Python stacktraces starting Apr 07 09:06 UTC

Apr 07 09:06:41 lists1004 mailman3[2696297]: Apr 07 09:06:41 2025 (2696297) (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
Apr 07 09:06:41 lists1004 mailman3[2696297]: (pymysql.err.OperationalError) (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement')

This is a bit earlier than the bouncing increase so it could be unrelated.

Yes I also noticed this, but tbh it doesn't seem to correspond to the fact that previous e-mails were not delivered either (at least that's what I was told), this sounds odd to me. It appears that some e-mails, prior to the increase in bounces, are showing up into the mailing list archives but not in the members' mailboxes. This occurred about 20 hours ago (~16 UTC)!

Mentioned in SAL (#wikimedia-operations) [2025-04-08T10:33:28Z] <jelto> restart mailman3.service on lists1004 - T391330

I restarted mailman3.service on lists1004 because the service stopped logging any activity right before bouncing increased (Apr 07 23:56:28).

It looks like the metrics are recovering: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-1h&to=now&viewPanel=2

The metrics are back to baseline. So from the system level this issue looks resolved.

I'm lacking a bit of mailman knowledge to verify it processes fresh mails. @Superpes15 you mentioned missing mails in the mailboxes. Can you check for the missing mails again?

Also a followup could be to create some alerting for a lot of bounced mail. I can create a followup task for that once this is resolved.

I'm lacking a bit of mailman knowledge to verify it processes fresh mails. @Superpes15 you mentioned missing mails in the mailboxes. Can you check for the missing mails again?

ATM The e-mails sent in the last few hours (since the bouncing started) have not yet been delivered to the mailman archives and consequently they are not present in the mailboxes!

EDIT: New e-mails are delivered correctly but the backlog has not yet been delivered to the archives (nor in the mailboxes)

Here's a representative example of 2 emails that I noticed are missing from my inbox, but included in the online archives:

missing-emails.png (499×1 px, 125 KB)

Apr 07 09:06:41 lists1004 mailman3[2696297]: (pymysql.err.OperationalError) (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement')

That sounds like T391237: m5 master db1228 rebooted itself

@Quiddity have you seen any improvement since the last time we checked?

@LSobanski I thought this bug was just a brief problem on Monday(?), but the missing emails still haven't appeared, if we were expecting/hoping for that to happen? -- So I'm not sure what you mean by 'improvement'.
I.e. I only noticed a problem with emails that were sent on Monday: specifically those 2 from wikitech-l@ in my screenshot; and yesterday I noticed the wikidata Weekly Summary (from Monday) also isn't in my inbox (but I assumed it was just "all emails sent around a certain time", so I didn't add another comment here about it). Hope that helps.

I deleted all frozen messages older than 14 days.. which was 1284 messages. And ran an exim command to make it try to re-deliver messages still stuck in the queue.

Checking now the mail queue is much smaller than before. (hundreds vs thousands). So missing mail might have been delivered by now and if it hasn't there probably isn't much to do about it.

Dzahn claimed this task.

Looking at the graph for the last 7 days there is nothing out of the ordinary anymore.

https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-7d&to=now&viewPanel=2

Please reopen if you disagree and still see an issue.

Dzahn removed Dzahn as the assignee of this task.Apr 15 2025, 8:33 PM

Change #1137212 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/alerts@master] mailman: add MailmanBounceQueueHigh alert

https://gerrit.wikimedia.org/r/1137212

Change #1137212 merged by jenkins-bot:

[operations/alerts@master] mailman: add MailmanBounceQueueHigh alert

https://gerrit.wikimedia.org/r/1137212