Follow up from T222983#8679437; it seems that the bot is not working. Help please? :)
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1) | |||
Resolved | ayounsi | T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 | |||
Resolved | ayounsi | T317175 Junos: resolve DNS through mgmt_junos | |||
Resolved | ayounsi | T327862 Use mgmt_junos on all network devices | |||
Restricted Task | |||||
Open | None | T316539 Upgrade network devices to Junos 20+ | |||
Resolved | ayounsi | T327248 eqiad/codfw virtual-chassis upgrades | |||
Resolved | Clement_Goubert | T327920 March 2023 Datacenter Switchover | |||
Resolved | cmooney | T329073 eqiad row A switches upgrade | |||
Resolved | Marostegui | T331626 Mailman hasn't delivered emails since 2023-03-07 14 UTC (was: reviewer-bot is not working) |
Event Timeline
Not entirely sure what's going on. The bot is running but there do not seem to be any emails coming in:
https://gerrit-reviewer-bot.toolforge.org/
--> is the mediawiki-commits mailing list broken or did Gerrit stop posting there?
Unfortunately there are no archives at https://lists.wikimedia.org/postorius/lists/mediawiki-commits.lists.wikimedia.org/
I know nothing about the mediawiki-commits, in mailman3 it doesn't show up at all and the archive show 0 discussions at https://lists.wikimedia.org/hyperkitty/list/mediawiki-commits@lists.wikimedia.org/
In Gerrit I haven't any idea how emails can be send but there is a commitwatcher account with that email which has:
$ cd All-Users $ git fetch origin refs/users/38/338 $ git show FETCH_HEAD:watch.config [project "All-Projects"] notify = * [NEW_CHANGES, SUBMITTED_CHANGES]
I don't think we have any log about Gerrit email processing at least I can confirm it has no pending task for sending email which implies emails are properly send. The SMTP host is localhost and that is served by exim4. I can see emails being send there and apparently properly send to mx1001.wikimedia.org.
Maybe it is an issue with Wikimedia mail server (the mx1001) or it is a problem with the mailing list / Mailman.
I am guessing it is an issue with Mailman. https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 shows a large queue since March 7th 14:12:
There was a switch upgrade at that time (T329073) which might have affected lists1001.wikimedia.org.
Mentioned in SAL (#wikimedia-operations) [2023-03-09T16:02:59Z] <marostegui> Restart mailman service T331626
Icinga says OK: mailman3 queues are below the limits, but there is an alert about the runners:
PROCS CRITICAL: 13 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-07 14:37:21
Last Notification: 2023-03-07 14:43:18 (notification 1)
I guess something is stall. Doesn't explain why the queue alert does not trigger though :)
It looks like the restart I made fixed it or at least it is slowly going down:
https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=1678364055780&to=1678378446831
Mail should be emitted again, it will take a bit of time to clear the queue though. That can be monitored via https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3
I am marking this resolved
That one got resolved:
PROCS OK: 14 processes with UID = 38 (list), regex args '/usr/lib/mailman3/bin/runner'
Last State Change: 2023-03-09 16:28:55
Last Notification: N/A (notification 0)
Re-opening just for tracking while we wait for the queue to go down. Also tagging as incident worthy, I can work on the writeup tomorrow.
This seems to be it:
Mar 7 14:35:31 lists1001 mailman3[430]: Traceback (most recent call last): Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 691, in _read_bytes Mar 7 14:35:31 lists1001 mailman3[430]: data = self._rfile.read(num_bytes) Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3.7/socket.py", line 589, in readinto Mar 7 14:35:31 lists1001 mailman3[430]: return self._sock.recv_into(b) Mar 7 14:35:31 lists1001 mailman3[430]: ConnectionResetError: [Errno 104] Connection reset by peer Mar 7 14:35:31 lists1001 mailman3[430]: During handling of the above exception, another exception occurred: Mar 7 14:35:31 lists1001 mailman3[430]: Traceback (most recent call last): Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 753, in _commit_impl Mar 7 14:35:31 lists1001 mailman3[430]: self.engine.dialect.do_commit(self.connection) Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2136, in do_commit Mar 7 14:35:31 lists1001 mailman3[430]: dbapi_connection.commit() Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 420, in commit Mar 7 14:35:31 lists1001 mailman3[430]: self._read_ok_packet() Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 394, in _read_ok_packet Mar 7 14:35:31 lists1001 mailman3[430]: pkt = self._read_packet() Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet Mar 7 14:35:31 lists1001 mailman3[430]: packet_header = self._read_bytes(4) Mar 7 14:35:31 lists1001 mailman3[430]: File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 699, in _read_bytes Mar 7 14:35:31 lists1001 mailman3[430]: "Lost connection to MySQL server during query (%s)" % (e,)) Mar 7 14:35:31 lists1001 mailman3[430]: pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query ([Errno 104] Connection reset by peer)') (it goes on)
Yeah, the out runner crashed, triggering the icinga alert, but it didn't get investigated/escalated until people noticed the lack of mail.
The queue alert thresholds are pretty high just because it had historically been unreliable, we can probably lower them now and see how it goes.
There are 2,936 emails in the out queue, it takes ~5.1 seconds to send each, so we're looking at recovery in ~250 minutes or 4.2 hours. Because new mail will keep coming in, it'll probably be closer to 5 hours I'd guess.
Sent notification to listadmins@, of course most people won't see it until everything has caught up.
https://wikitech.wikimedia.org/wiki/Incidents/2023-03-09_mailman still a draft but mostly complete, please edit, etc.
Not sure if there's anything actionable here left to do. Looking at the grafana dashboard though it looks like queue backlogs happen from time to time:
In the last 30 days, it looks mm3 queue got stuck/not working at least 4 times. See https://grafana.wikimedia.org/goto/pmtWIcLVk?orgId=1 and attached screenshot:
That one has been solved after I have found out the mailman queue was full and not being processed.