Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	kostajh
	Mar 9 2023, 1:29 PM

Description

Follow up from T222983#8679437; it seems that the bot is not working. Help please? :)

Related Objects
Search...

Status	Assigned	Task
Open	None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved	ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved	ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved	ayounsi	T327862 Use mgmt_junos on all network devices
		Restricted Task
Open	None	T316539 Upgrade network devices to Junos 20+
Resolved	ayounsi	T327248 eqiad/codfw virtual-chassis upgrades
Resolved	Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved	cmooney	T329073 eqiad row A switches upgrade
Resolved	Marostegui	T331626 Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working)

Event Timeline

kostajh created this task.Mar 9 2023, 1:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 9 2023, 1:29 PM

kostajh mentioned this in T222983: reviewer-bot sometimes not working.Mar 9 2023, 1:29 PM

Not entirely sure what's going on. The bot is running but there do not seem to be any emails coming in:

https://gerrit-reviewer-bot.toolforge.org/

--> is the mediawiki-commits mailing list broken or did Gerrit stop posting there?
Unfortunately there are no archives at https://lists.wikimedia.org/postorius/lists/mediawiki-commits.lists.wikimedia.org/

@hashar maybe you know about Gerrit + mediawiki-commits mailing list?

I know nothing about the mediawiki-commits, in mailman3 it doesn't show up at all and the archive show 0 discussions at https://lists.wikimedia.org/hyperkitty/list/mediawiki-commits@lists.wikimedia.org/

In Gerrit I haven't any idea how emails can be send but there is a commitwatcher account with that email which has:

$ cd All-Users
$ git fetch origin refs/users/38/338
$ git show FETCH_HEAD:watch.config
[project "All-Projects"]
	notify = * [NEW_CHANGES, SUBMITTED_CHANGES]

I don't think we have any log about Gerrit email processing at least I can confirm it has no pending task for sending email which implies emails are properly send. The SMTP host is localhost and that is served by exim4. I can see emails being send there and apparently properly send to mx1001.wikimedia.org.

Maybe it is an issue with Wikimedia mail server (the mx1001) or it is a problem with the mailing list / Mailman.

Restricted Application added a project: Infrastructure-Foundations. · View Herald TranscriptMar 9 2023, 3:22 PM

Maintenance_bot added a project: SRE.Mar 9 2023, 3:29 PM

I am guessing it is an issue with Mailman. https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 shows a large queue since March 7th 14:12:

mailman3_queue_depth.png (386×856 px, 33 KB)

There was a switch upgrade at that time (T329073) which might have affected lists1001.wikimedia.org.

Marostegui added subscribers: Legoktm, Ladsgroup.Mar 9 2023, 3:58 PM

Mentioned in SAL (#wikimedia-operations) [2023-03-09T16:02:59Z] <marostegui> Restart mailman service T331626

Icinga says OK: mailman3 queues are below the limits, but there is an alert about the runners:

PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-07 14:37:21
Last Notification: 2023-03-07 14:43:18 (notification 1)

I guess something is stall. Doesn't explain why the queue alert does not trigger though :)

It looks like the restart I made fixed it or at least it is slowly going down:
https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=1678364055780&to=1678378446831

Mail should be emitted again, it will take a bit of time to clear the queue though. That can be monitored via https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3

I am marking this resolved

Aklapper mentioned this in T331633: Not receiving posts or moderation messages.Mar 9 2023, 4:14 PM

In T331626#8680354, @hashar wrote:

PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-07 14:37:21
Last Notification: 2023-03-07 14:43:18 (notification 1)

That one got resolved:

PROCS OK: 14 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-09 16:28:55
Last Notification: N/A (notification 0)

Legoktm renamed this task from reviewer-bot is not working to Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working).Mar 9 2023, 6:22 PM

Legoktm reopened this task as Open.

Legoktm reassigned this task from hashar to Marostegui.

Legoktm triaged this task as Medium priority.

Legoktm merged a task: T331633: Not receiving posts or moderation messages.

Legoktm removed a project: Gerrit.

Legoktm added subscribers: JJMC89, Vermont, MatthewVernon and 3 others.

Re-opening just for tracking while we wait for the queue to go down. Also tagging as incident worthy, I can work on the writeup tomorrow.

In T331626#8680273, @hashar wrote:

There was a switch upgrade at that time (T329073) which might have affected lists1001.wikimedia.org.

This seems to be it:

Mar  7 14:35:31 lists1001 mailman3[430]: Traceback (most recent call last):
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 691, in _read_bytes
Mar  7 14:35:31 lists1001 mailman3[430]:     data = self._rfile.read(num_bytes)
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3.7/socket.py", line 589, in readinto
Mar  7 14:35:31 lists1001 mailman3[430]:     return self._sock.recv_into(b)
Mar  7 14:35:31 lists1001 mailman3[430]: ConnectionResetError: [Errno 104] Connection reset by peer
Mar  7 14:35:31 lists1001 mailman3[430]: During handling of the above exception, another exception occurred:
Mar  7 14:35:31 lists1001 mailman3[430]: Traceback (most recent call last):
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 753, in _commit_impl
Mar  7 14:35:31 lists1001 mailman3[430]:     self.engine.dialect.do_commit(self.connection)
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2136, in do_commit
Mar  7 14:35:31 lists1001 mailman3[430]:     dbapi_connection.commit()
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 420, in commit
Mar  7 14:35:31 lists1001 mailman3[430]:     self._read_ok_packet()
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 394, in _read_ok_packet
Mar  7 14:35:31 lists1001 mailman3[430]:     pkt = self._read_packet()
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet
Mar  7 14:35:31 lists1001 mailman3[430]:     packet_header = self._read_bytes(4)
Mar  7 14:35:31 lists1001 mailman3[430]:   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 699, in _read_bytes
Mar  7 14:35:31 lists1001 mailman3[430]:     "Lost connection to MySQL server during query (%s)" % (e,))
Mar  7 14:35:31 lists1001 mailman3[430]: pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query ([Errno 104] Connection reset by peer)')
(it goes on)

In T331626#8680354, @hashar wrote:

Icinga says OK: mailman3 queues are below the limits, but there is an alert about the runners:

PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-07 14:37:21
Last Notification: 2023-03-07 14:43:18 (notification 1)

I guess something is stall. Doesn't explain why the queue alert does not trigger though :)

Yeah, the out runner crashed, triggering the icinga alert, but it didn't get investigated/escalated until people noticed the lack of mail.

The queue alert thresholds are pretty high just because it had historically been unreliable, we can probably lower them now and see how it goes.

Zabe subscribed.Mar 9 2023, 6:30 PM

There are 2,936 emails in the out queue, it takes ~5.1 seconds to send each, so we're looking at recovery in ~250 minutes or 4.2 hours. Because new mail will keep coming in, it'll probably be closer to 5 hours I'd guess.

Sent notification to listadmins@, of course most people won't see it until everything has caught up.

Peachey88 subscribed.Mar 9 2023, 8:04 PM

Izno subscribed.Mar 9 2023, 11:08 PM

https://wikitech.wikimedia.org/wiki/Incidents/2023-03-09_mailman still a draft but mostly complete, please edit, etc.

Marostegui added a parent task: T329073: eqiad row A switches upgrade.Mar 11 2023, 6:46 AM

cmooney subscribed.Mar 13 2023, 3:59 PM

Not sure if there's anything actionable here left to do. Looking at the grafana dashboard though it looks like queue backlogs happen from time to time:

In the last 30 days, it looks mm3 queue got stuck/not working at least 4 times. See https://grafana.wikimedia.org/goto/pmtWIcLVk?orgId=1 and attached screenshot:

That one has been solved after I have found out the mailman queue was full and not being processed.

JJMC89 mentioned this in T358020: Not receiving posts or moderation messages.Feb 22 2024, 12:47 AM

	F36940511: imagen.png
	Apr 5 2023, 7:39 PM

Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working)
Closed, ResolvedPublic
Actions

Related Objects
Search...