Page MenuHomePhabricator

Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working)
Closed, ResolvedPublic

Description

Follow up from T222983#8679437; it seems that the bot is not working. Help please? :)

Event Timeline

Not entirely sure what's going on. The bot is running but there do not seem to be any emails coming in:

https://gerrit-reviewer-bot.toolforge.org/

--> is the mediawiki-commits mailing list broken or did Gerrit stop posting there?
Unfortunately there are no archives at https://lists.wikimedia.org/postorius/lists/mediawiki-commits.lists.wikimedia.org/

@hashar maybe you know about Gerrit + mediawiki-commits mailing list?

I know nothing about the mediawiki-commits, in mailman3 it doesn't show up at all and the archive show 0 discussions at https://lists.wikimedia.org/hyperkitty/list/mediawiki-commits@lists.wikimedia.org/

In Gerrit I haven't any idea how emails can be send but there is a commitwatcher account with that email which has:

$ cd All-Users
$ git fetch origin refs/users/38/338
$ git show FETCH_HEAD:watch.config
[project "All-Projects"]
	notify = * [NEW_CHANGES, SUBMITTED_CHANGES]

I don't think we have any log about Gerrit email processing at least I can confirm it has no pending task for sending email which implies emails are properly send. The SMTP host is localhost and that is served by exim4. I can see emails being send there and apparently properly send to mx1001.wikimedia.org.

Maybe it is an issue with Wikimedia mail server (the mx1001) or it is a problem with the mailing list / Mailman.

I am guessing it is an issue with Mailman. https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 shows a large queue since March 7th 14:12:

mailman3_queue_depth.png (386×856 px, 33 KB)

There was a switch upgrade at that time (T329073) which might have affected lists1001.wikimedia.org.

Mentioned in SAL (#wikimedia-operations) [2023-03-09T16:02:59Z] <marostegui> Restart mailman service T331626

Icinga says OK: mailman3 queues are below the limits, but there is an alert about the runners:

PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-07 14:37:21
Last Notification: 2023-03-07 14:43:18 (notification 1)

I guess something is stall. Doesn't explain why the queue alert does not trigger though :)

hashar claimed this task.

Mail should be emitted again, it will take a bit of time to clear the queue though. That can be monitored via https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3

I am marking this resolved

PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-07 14:37:21
Last Notification: 2023-03-07 14:43:18 (notification 1)

That one got resolved:

PROCS OK: 14 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-09 16:28:55
Last Notification: N/A (notification 0)

Legoktm renamed this task from reviewer-bot is not working to Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working).Mar 9 2023, 6:22 PM
Legoktm reopened this task as Open.
Legoktm reassigned this task from hashar to Marostegui.
Legoktm triaged this task as Medium priority.
Legoktm removed a project: Gerrit.
Legoktm added subscribers: JJMC89, Vermont, MatthewVernon and 3 others.

Re-opening just for tracking while we wait for the queue to go down. Also tagging as incident worthy, I can work on the writeup tomorrow.

There was a switch upgrade at that time (T329073) which might have affected lists1001.wikimedia.org.

This seems to be it:

Mar  7 14:35:31 lists1001 mailman3[430]: Traceback (most recent call last):
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 691, in _read_bytes
Mar  7 14:35:31 lists1001 mailman3[430]:     data = self._rfile.read(num_bytes)
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3.7/socket.py", line 589, in readinto
Mar  7 14:35:31 lists1001 mailman3[430]:     return self._sock.recv_into(b)
Mar  7 14:35:31 lists1001 mailman3[430]: ConnectionResetError: [Errno 104] Connection reset by peer
Mar  7 14:35:31 lists1001 mailman3[430]: During handling of the above exception, another exception occurred:
Mar  7 14:35:31 lists1001 mailman3[430]: Traceback (most recent call last):
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 753, in _commit_impl
Mar  7 14:35:31 lists1001 mailman3[430]:     self.engine.dialect.do_commit(self.connection)
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2136, in do_commit
Mar  7 14:35:31 lists1001 mailman3[430]:     dbapi_connection.commit()
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 420, in commit
Mar  7 14:35:31 lists1001 mailman3[430]:     self._read_ok_packet()
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 394, in _read_ok_packet
Mar  7 14:35:31 lists1001 mailman3[430]:     pkt = self._read_packet()
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet
Mar  7 14:35:31 lists1001 mailman3[430]:     packet_header = self._read_bytes(4)
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 699, in _read_bytes
Mar  7 14:35:31 lists1001 mailman3[430]:     "Lost connection to MySQL server during query (%s)" % (e,))
Mar  7 14:35:31 lists1001 mailman3[430]: pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query ([Errno 104] Connection reset by peer)')
(it goes on)

Icinga says OK: mailman3 queues are below the limits, but there is an alert about the runners:

PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-07 14:37:21
Last Notification: 2023-03-07 14:43:18 (notification 1)

I guess something is stall. Doesn't explain why the queue alert does not trigger though :)

Yeah, the out runner crashed, triggering the icinga alert, but it didn't get investigated/escalated until people noticed the lack of mail.

The queue alert thresholds are pretty high just because it had historically been unreliable, we can probably lower them now and see how it goes.

There are 2,936 emails in the out queue, it takes ~5.1 seconds to send each, so we're looking at recovery in ~250 minutes or 4.2 hours. Because new mail will keep coming in, it'll probably be closer to 5 hours I'd guess.

Sent notification to listadmins@, of course most people won't see it until everything has caught up.

Not sure if there's anything actionable here left to do. Looking at the grafana dashboard though it looks like queue backlogs happen from time to time:

In the last 30 days, it looks mm3 queue got stuck/not working at least 4 times. See https://grafana.wikimedia.org/goto/pmtWIcLVk?orgId=1 and attached screenshot:

imagen.png (434×1 px, 82 KB)

That one has been solved after I have found out the mailman queue was full and not being processed.