Hi,
In the last hours a certain backlog has been generated in the mailing lists, after a quick glance it looks like mailman3 is slowed down. I just noticed that there are a lot of bounces (which are increasing). Could you please take a look?
Thanks
| Superpes15 | |
| Apr 8 2025, 9:23 AM |
| F59018390: missing-emails.png | |
| Apr 8 2025, 7:56 PM |
Hi,
In the last hours a certain backlog has been generated in the mailing lists, after a quick glance it looks like mailman3 is slowed down. I just noticed that there are a lot of bounces (which are increasing). Could you please take a look?
Thanks
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| mailman: add MailmanBounceQueueHigh alert | operations/alerts | master | +17 -0 |
It looks like bouncing started today at 01:00 https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-24h&to=now&viewPanel=2
I'll check the last pupet changes and the service status on the host
Fixed time dashboard for reference: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=1744089600000&to=1744107600000
mailman3.service prints a lot of Python stacktraces starting Apr 07 09:06 UTC
Apr 07 09:06:41 lists1004 mailman3[2696297]: Apr 07 09:06:41 2025 (2696297) (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) Apr 07 09:06:41 lists1004 mailman3[2696297]: (pymysql.err.OperationalError) (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement')
This is a bit earlier than the bouncing increase so it could be unrelated.
Yes I also noticed this, but tbh it doesn't seem to correspond to the fact that previous e-mails were not delivered either (at least that's what I was told), this sounds odd to me. It appears that some e-mails, prior to the increase in bounces, are showing up into the mailing list archives but not in the members' mailboxes. This occurred about 20 hours ago (~16 UTC)!
Mentioned in SAL (#wikimedia-operations) [2025-04-08T10:33:28Z] <jelto> restart mailman3.service on lists1004 - T391330
I restarted mailman3.service on lists1004 because the service stopped logging any activity right before bouncing increased (Apr 07 23:56:28).
It looks like the metrics are recovering: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-1h&to=now&viewPanel=2
The metrics are back to baseline. So from the system level this issue looks resolved.
I'm lacking a bit of mailman knowledge to verify it processes fresh mails. @Superpes15 you mentioned missing mails in the mailboxes. Can you check for the missing mails again?
Also a followup could be to create some alerting for a lot of bounced mail. I can create a followup task for that once this is resolved.
ATM The e-mails sent in the last few hours (since the bouncing started) have not yet been delivered to the mailman archives and consequently they are not present in the mailboxes!
EDIT: New e-mails are delivered correctly but the backlog has not yet been delivered to the archives (nor in the mailboxes)
Here's a representative example of 2 emails that I noticed are missing from my inbox, but included in the online archives:
@LSobanski I thought this bug was just a brief problem on Monday(?), but the missing emails still haven't appeared, if we were expecting/hoping for that to happen? -- So I'm not sure what you mean by 'improvement'.
I.e. I only noticed a problem with emails that were sent on Monday: specifically those 2 from wikitech-l@ in my screenshot; and yesterday I noticed the wikidata Weekly Summary (from Monday) also isn't in my inbox (but I assumed it was just "all emails sent around a certain time", so I didn't add another comment here about it). Hope that helps.
I deleted all frozen messages older than 14 days.. which was 1284 messages. And ran an exim command to make it try to re-deliver messages still stuck in the queue.
Checking now the mail queue is much smaller than before. (hundreds vs thousands). So missing mail might have been delivered by now and if it hasn't there probably isn't much to do about it.
Looking at the graph for the last 7 days there is nothing out of the ordinary anymore.
https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-7d&to=now&viewPanel=2
Please reopen if you disagree and still see an issue.
Change #1137212 had a related patch set uploaded (by Jelto; author: Jelto):
[operations/alerts@master] mailman: add MailmanBounceQueueHigh alert
Change #1137212 merged by jenkins-bot:
[operations/alerts@master] mailman: add MailmanBounceQueueHigh alert