Page MenuHomePhabricator

Mailman3 bounce runner is running very slowly
Closed, ResolvedPublic

Description

Screenshot_2021-05-08 Mailman3 - Grafana.png (668×3 px, 83 KB)

When I looked at first, the bounce runner process had died and not been restarted. Then I restarted mailman3 and now the bounce runner is using a ton of CPU but the queue is still going up.

root@lists1001:/var/lib/mailman3/queue/bounces# ps -axjf | grep runner
14967 14969 14967 14967 ?           -1 S       38   0:01  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=archive:0:1
14967 14970 14967 14967 ?           -1 R       38  18:11  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=bounces:0:1
14967 14971 14967 14967 ?           -1 S       38   0:01  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=command:0:1
14967 14972 14967 14967 ?           -1 S       38   0:03  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=in:0:1
14967 14973 14967 14967 ?           -1 Sl      38   0:09  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=lmtp:0:1
14967 14974 14967 14967 ?           -1 S       38   0:01  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=nntp:0:1
14967 14975 14967 14967 ?           -1 S       38   0:07  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=out:0:1
14967 14976 14967 14967 ?           -1 S       38   0:02  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=pipeline:0:1
14967 14977 14967 14967 ?           -1 S       38   0:01  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=rest:0:1
14977 14997 14967 14967 ?           -1 R       38   0:22  |   \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=rest:0:1
14977 14998 14967 14967 ?           -1 S       38   0:21  |   \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=rest:0:1
14967 14978 14967 14967 ?           -1 S       38   0:00  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=retry:0:1
14967 14979 14967 14967 ?           -1 S       38   0:01  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=virgin:0:1
14967 14980 14967 14967 ?           -1 S       38   0:02  \_ /usr/bin/python3 /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=digest:0:1

I tried starting the runner manually with a bit more output:

1legoktm@lists1001:~$ sudo -u list /usr/lib/mailman3/bin/runner -C /etc/mailman3/mailman.cfg --runner=bounces:0:1 --once -v
2May 09 06:49:18 2021 (30730) bounces runner started.
3May 09 06:49:18 2021 (30730) bounces runner started.
4May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.postfix.Postfix'>
5May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.postfix.Postfix'>
6May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.simplewarning.SimpleWarning'>
7May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.simplewarning.SimpleWarning'>
8May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.simplematch.SimpleMatch'>
9May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.simplematch.SimpleMatch'>
10May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.llnl.LLNL'>
11May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.llnl.LLNL'>
12May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.microsoft.Microsoft'>
13May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.microsoft.Microsoft'>
14May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.dsn.DSN'>
15May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.dsn.DSN'>
16May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.yahoo.Yahoo'>
17May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.yahoo.Yahoo'>
18May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.sina.Sina'>
19May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.sina.Sina'>
20May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.exim.Exim'>
21May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.exim.Exim'>
22May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.netscape.Netscape'>
23May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.netscape.Netscape'>
24May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.aol.AOL'>
25May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.aol.AOL'>
26May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.qmail.Qmail'>
27May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.qmail.Qmail'>
28May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.smtp32.SMTP32'>
29May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.smtp32.SMTP32'>
30May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.groupwise.GroupWise'>
31May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.groupwise.GroupWise'>
32May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.caiwireless.Caiwireless'>
33May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.caiwireless.Caiwireless'>
34May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.exchange.Exchange'>
35May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.exchange.Exchange'>
36May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.yale.Yale'>
37May 09 06:49:18 2021 (30730) Running detector: <class 'flufl.bounce._detectors.yale.Yale'>
38May 09 06:49:18 2021 (30730) bounces runner exiting.
39May 09 06:49:18 2021 (30730) bounces runner exiting.

Event Timeline

Legoktm triaged this task as Unbreak Now! priority.May 9 2021, 6:49 AM
Legoktm created this task.

https://gitlab.com/mailman/mailman/-/issues/755 suggests upgrading flufl.bounce to 3.0.1 might help, we're on 3.0.0

https://gitlab.com/mailman/mailman/-/issues/887 is about the bounce runner dying and someone posted a ps with similar CPU usage.

Change 687511 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] mailman3: Enable debug logging

https://gerrit.wikimedia.org/r/687511

Change 687511 merged by Legoktm:

[operations/puppet@production] mailman3: Enable debug logging

https://gerrit.wikimedia.org/r/687511

So the bounce runner is running! It's just super slow

May 09 07:31:21 2021 (20029) [BounceRunner] checking short circuit
May 09 07:31:21 2021 (20029) [BounceRunner] processing filebase: 1620343841.1223717+d218e29be4dc05f76b193c3caaeacedb8056e20a
May 09 07:31:21 2021 (20029) [BounceRunner] processing onefile
May 09 07:31:21 2021 (20029) [BounceRunner] finishing filebase: 1620343841.1223717+d218e29be4dc05f76b193c3caaeacedb8056e20a
May 09 07:31:21 2021 (20029) [BounceRunner] doing periodic
May 09 07:31:41 2021 (20029) [BounceRunner] committing transaction
May 09 07:31:41 2021 (20029) [BounceRunner] checking short circuit
May 09 07:31:41 2021 (20029) [BounceRunner] processing filebase: 1620343842.0400379+a83b5ef1cbc5e3b5aaf6eeacf59be155a11f9b4e
May 09 07:31:41 2021 (20029) [BounceRunner] processing onefile
May 09 07:31:41 2021 (20029) [BounceRunner] finishing filebase: 1620343842.0400379+a83b5ef1cbc5e3b5aaf6eeacf59be155a11f9b4e
May 09 07:31:41 2021 (20029) [BounceRunner] doing periodic
May 09 07:32:00 2021 (20029) [BounceRunner] committing transaction
May 09 07:32:00 2021 (20029) [BounceRunner] checking short circuit
May 09 07:32:00 2021 (20029) [BounceRunner] processing filebase: 1620343846.132764+673a3a0d8e7cb47f4e1324f70482f4f039e50020
May 09 07:32:00 2021 (20029) [BounceRunner] processing onefile
May 09 07:32:00 2021 (20029) [BounceRunner] finishing filebase: 1620343846.132764+673a3a0d8e7cb47f4e1324f70482f4f039e50020
May 09 07:32:00 2021 (20029) [BounceRunner] doing periodic

It's taking 20s on the "doing periodic" step which is:

"""Send warnings to disabled users and remove them if needed."""
log.debug('Sending warnings to members with disabled delivery.')
self._processor.send_warnings_and_remove()

I tried to apply the patch from https://gitlab.com/mailman/mailman/-/merge_requests/811 but it made no difference.

My guess is that 20s is a timeout somewhere.

>>> from mailman.interfaces.member import DeliveryStatus, IMembershipManager
>>> from zope.component import getUtility
>>> manager = getUtility(IMembershipManager)
>>> p=manager.memberships_pending_warning()
>>> import time
>>> a=time.time();x=list(p);b=time.time()
>>> b-a
26.814743757247925

The offending code is:

query = store.query(Member).join(
    MailingList, Member.list_id == MailingList._list_id).join(
    Member.preferences).filter(and_(
        MailingList.process_bounces == True,       # noqa: E712
        Member.total_warnings_sent < MailingList.bounce_you_are_disabled_warnings,  # noqa: E501
        Preferences.delivery_status == DeliveryStatus.by_bounces))

# XXX(maxking): This is IMO a query that *should* work, but I haven't
# been able to get it to work in my tests. It could be due to lack of
# native datetime type in SQLite, which we use for tests. Hence, we are
# going to do some filtering in Python, which is inefficient, but
# works. We do filter as much as possible in SQL, so hopefully the
# output of the above query won't be *too* many.
# TODO: Port the Python loop below to SQLAlchemy.

# (func.DATETIME(Member.last_warning_sent) +
# func.DATETIME(MailingList.bounce_you_are_disabled_warnings_interval))
# < func.DATETIME(now())))

for member in query.all():
    if (member.last_warning_sent +
            member.mailing_list.bounce_you_are_disabled_warnings_interval) <= now():   # noqa: E501
        yield member

The first query returns 2,452 rows! Which it then has to filter in Python...

mysql:mailman3@m5-master.eqiad.wmnet [mailman3]> select count(*) from member join mailinglist on member.list_id = mailinglist.list_id join preferences on member.preferences_id = preferences.id where process_bounces=1 and member.total_warnings_sent < mailinglist.bounce_you_are_disabled_warnings and preferences.delivery_status = 3;
+----------+
| count(*) |
+----------+
|     2452 |
+----------+
1 row in set (0.248 sec)
Legoktm renamed this task from Mailman3 bounce runner not working to Mailman3 bounce runner is running very slowly.May 9 2021, 8:49 AM
Legoktm lowered the priority of this task from Unbreak Now! to High.

I live hacked in the following patch:

diff --git a/src/mailman/model/member.py b/src/mailman/model/member.py
index 1adeb6e41..6f70f990f 100644
--- a/src/mailman/model/member.py
+++ b/src/mailman/model/member.py
@@ -240,7 +240,7 @@ class MembershipManager:
         # could have been reset due to bounce info getting stale. We will send
         # warnings to people who have been disabled already, regardless of
         # their bounce score. Same is true below for removal.
-        query = store.query(Member).join(
+        query = store.query(Member, MailingList.bounce_you_are_disabled_warnings_interval).join(
             MailingList, Member.list_id == MailingList._list_id).join(
             Member.preferences).filter(and_(
                 MailingList.process_bounces == True,       # noqa: E712
@@ -259,9 +259,8 @@ class MembershipManager:
         # func.DATETIME(MailingList.bounce_you_are_disabled_warnings_interval))
         # < func.DATETIME(now())))
 
-        for member in query.all():
-            if (member.last_warning_sent +
-                    member.mailing_list.bounce_you_are_disabled_warnings_interval) <= now():   # noqa: E501
+        for member, interval in query.all():
+            if (member.last_warning_sent + interval) <= now():
                 yield member
 
     @dbconnection

Queue is processing about 1 bounce/second now.

The bounce runner died again because it tried to insert a too-big email into the bounceevent table (VARCHAR(255)):

May  9 09:38:37 lists1001 mailman3[19017]: sqlalchemy.exc.DataError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (pymysql.err.DataError) (1406, "Data too long for column 'email' at row 1") [SQL: 'INSERT INTO bounceevent (list_id, email, timestamp, message_id, context, processed) VALUES (%(list_id)s, %(email)s, %(timestamp)s, %(message_id)s, %(context)s, %(processed)s)'] [parameters: {'list_id': 'wikimediaannounce-l.lists.wikimedia.org', 'email': 'p style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: -webkit-standard; font-style: normal; font-variant-caps: normal; font-weight: n ... (67 characters truncated) ... ne; white-space: normal; word-spacing: 0px; -moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; text-align: center; "', 'timestamp': datetime.datetime(2021, 5, 9, 9, 38, 37, 601049), 'message_id': '<E1lf5Ua-0004qv-M2@lists1001.wikimedia.org>', 'context': 1, 'processed': 0}] (Background on this error at: http://sqlalche.me/e/9h9h)

that's another bug to file...

Mentioned in SAL (#wikimedia-operations) [2021-05-09T18:28:09Z] <legoktm> systemctl restart mailman3, bounce runner died again (T282348)

Mentioned in SAL (#wikimedia-operations) [2021-05-09T21:44:34Z] <legoktm> restarted mailman3 again (T282348) pymysql.err.InternalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')

Change 688354 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] mailman3: Enable debug runner logs

https://gerrit.wikimedia.org/r/688354

Change 688354 merged by Legoktm:

[operations/puppet@production] mailman3: Enable debug runner logs

https://gerrit.wikimedia.org/r/688354

Mentioned in SAL (#wikimedia-operations) [2021-05-10T21:26:38Z] <legoktm> upgraded flufl.bounce on lists1001 and restarted mailman3 T282348

Current status:

  • I upgraded flufl.bounce which I thought might be better at parsing bounce messages, hopefully causing less crashes and then it unsubscribed me and Amir from listadmins@ for bouncing too much, which I'm pretty sure isn't the case. So I downgraded it back.
    • Also listadmins@ bounce threshold was 1(!) which I bumped to 5.
  • The bounce runner is also crashing because of lock wait timeouts.
  • And while writing this it crashed again.
  • I upgraded flufl.bounce which I thought might be better at parsing bounce messages, hopefully causing less crashes and then it unsubscribed me and Amir from listadmins@ for bouncing too much, which I'm pretty sure isn't the case. So I downgraded it back.
    • Also listadmins@ bounce threshold was 1(!) which I bumped to 5.

I looked at the actual bounce messages and I'm pretty sure they're legit, unfortunately. People are sending spam/vrisus to lists, which gets forwarded to list owners and then the owner's MTA (Gmail) hard rejects it and mailman registers it as a bounce. The best solution to this is to auto-discard posts by non-members rather than holding them. And make sure bounce threshold isn't too low.

So I'm going to re-upgrade flufl.bounce and restart the bounce runner. Upstream seems to be making progress on identifying why it doesn't auto-restart.

Probably more a Feature Request for upstream, but I think mailman3 should parse that rejection message, find out the error is actually due to the specific message it was trying to deliver, and not increment the bounce counter. Giving semantics to the error messages isn't ideal, but I'm not sure that's possible with their enhanced status codes alone. At least, those error messages are very clear on why it is rejecting them. It also means hardcoding the messages used for certain vendors (to which more can be added in the future), but given the prevalence of gmail/gsuite is so large, and this issue will appear often enough, to make this worthwhile.

@Platonides that's being discussed upstream at https://gitlab.com/warsaw/flufl.bounce/-/issues/7 if you want to chime in there :)

I have no idea what's wrong this time, but the bounce handler is back to taking ~20s per message...

I applied https://gitlab.com/mailman/mailman/-/merge_requests/869 on lists1001 - my previous patch just didn't cover all the code paths.

Now crashing with:

May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/core/runner.py", line 134, in run
May 31 06:26:05 lists1001 mailman3[31349]:     filecnt = self._one_iteration()
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/core/runner.py", line 206, in _one_iteration
May 31 06:26:05 lists1001 mailman3[31349]:     self._do_periodic()
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/runners/bounce.py", line 100, in _do_periodic
May 31 06:26:05 lists1001 mailman3[31349]:     self._send_warnings()
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/runners/bounce.py", line 116, in _send_warnings
May 31 06:26:05 lists1001 mailman3[31349]:     self._processor.send_warnings_and_remove()
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/database/transaction.py", line 50, in wrapper
May 31 06:26:05 lists1001 mailman3[31349]:     rtn = function(*args, **kws)
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/database/transaction.py", line 85, in wrapper
May 31 06:26:05 lists1001 mailman3[31349]:     return function(args[0], config.db.store, *args[1:], **kws)
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/model/bounce.py", line 224, in send_warnings_and_remove
May 31 06:26:05 lists1001 mailman3[31349]:     self._remove_memberships()
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/database/transaction.py", line 85, in wrapper
May 31 06:26:05 lists1001 mailman3[31349]:     return function(args[0], config.db.store, *args[1:], **kws)
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/model/bounce.py", line 247, in _remove_memberships
May 31 06:26:05 lists1001 mailman3[31349]:     admin_notif=admin_notif, userack=True)
May 31 06:26:05 lists1001 mailman3[31349]:   File "/usr/lib/python3/dist-packages/mailman/app/membership.py", line 130, in delete_member
May 31 06:26:05 lists1001 mailman3[31349]:     raise NotAMemberError(mlist, email)
May 31 06:26:05 lists1001 mailman3[31349]: mailman.interfaces.member.NotAMemberError: ahalfaker@wikimedia.org is not a member of scoring-internal@lists.wikimedia.org

Just going to remove him manually.

Mentioned in SAL (#wikimedia-operations) [2021-05-31T06:33:31Z] <legoktm> manually unsubscribed ahalfaker [at] wikimedia.org from scoring-internal list, triggering mailman bounce loop T282348#7124014

May 31 06:26:05 lists1001 mailman3[31349]: mailman.interfaces.member.NotAMemberError: ahalfaker@wikimedia.org is not a member of scoring-internal@lists.wikimedia.org

With Joe's help, tracked this down to the query selecting members of all roles, including non-member, and then when it operates on those rows, only wanting members. Hot-patched:

diff --git a/src/mailman/model/member.py b/src/mailman/model/member.py
index 44684068e..c363610bc 100644
--- a/src/mailman/model/member.py
+++ b/src/mailman/model/member.py
@@ -245,6 +245,7 @@ class MembershipManager:
             MailingList.bounce_you_are_disabled_warnings_interval).join(
             MailingList, Member.list_id == MailingList._list_id).join(
             Member.preferences).filter(and_(
+                Member.role == MemberRole.member,
                 MailingList.process_bounces == True,       # noqa: E712
                 Member.total_warnings_sent < MailingList.bounce_you_are_disabled_warnings,  # noqa: E501
                 Preferences.delivery_status == DeliveryStatus.by_bounces))
@@ -277,6 +278,7 @@ class MembershipManager:
             MailingList.bounce_you_are_disabled_warnings).join(
             MailingList, Member.list_id == MailingList._list_id).join(
             Member.preferences).filter(and_(
+                Member.role == MemberRole.member,
                 MailingList.process_bounces == True,    # noqa: E712
                 Member.total_warnings_sent >= MailingList.bounce_you_are_disabled_warnings,     # noqa: E501
                 Preferences.delivery_status == DeliveryStatus.by_bounces))

Will upstream that tomorrow.

Filed upstream at https://gitlab.com/mailman/mailman/-/issues/910, waiting for my other MR to be merged before submitting this new one.

Legoktm claimed this task.

All of the issues we've run into so far have been patched upstream and deployed to lists.wm.o, so I'm going to close this. In general the bounce runner has been pretty stable over the past two weeks.