mailman emails taking long time for delivery, getting stuck in sodium
Closed, DeclinedPublic

Description

For the last month or so, the mailing lists have taken quite some time to forward emails. An email sent to a list takes several hours before being sent to it's subscribers. It seems to hang up, although the headers say otherwise. The emails sent date and time seem to match the day and time of when it was supposed to be sent, but the received time is usually hours, sometimes days, after it being sent apparently. I have checked my email client, and it doesn't seem to be the issue as all the other emails are received almost instantaneously.

My email client logs the date and time it's send, supposedly, and the date and time it was received. (EST)

Here are some examples of emails from mailing lists:
[Accounts-enwiki-l] TS ACC statistics, 2013-12-30 Sent: Mon 12/30/2013 7:00 PM Received: 1/4/2014 9:09 PM (5 days)

Re: [Accounts-enwiki-l] Fwd: Undelivered Mail Returned to Sender Sent: Sat 1/4/2014 11:02 AM Received: 1/4/2014 11:04 PM (12 hours)

Re: [Accounts-enwiki-l] [ACC #------] English Wikipedia Account Request Sent: Wed 1/1/2014 5:45 PM Received: 1/5/2014 12:09 AM (Almost 4 days)

Re: [Accounts-enwiki-l] [ACC #114635] English Wikipedia Account Request Sent: Mon 1/6/2014 8:26 AM Received: Mon 1/6/2014 12:59 PM (4.5 hours)

Here are some examples of non-list emails from various people:
GitHub:
Re: [waca] Splash page or screen (#22) Sent: Sat 1/4/2014 7:48 PM Received: Sat 1/4/2014 7:48 PM (no delay)

Re: [waca] Redirection script (#49) Sent: Sun 1/5/2014 6:17 PM Received: Sun 1/5/2014 6:17 PM (no delay)

Wikipedia:
Σ left you a message on Wikipedia Sent: Mon 1/6/2014 5:14 AM Received: Mon 1/6/2014 5:14 AM (no delay)

Hahc21 left you a message on Wikipedia Sent: Tue 12/24/2013 4:02 PM Received: Tue 12/24/2013 4:02 PM (no delay)

Random companies and users:
Fwd: Accounts-enwiki-l post from [redacted email] requires approval Sent from a wikipedia user's email at: Mon 1/6/2014 12:06 PM Received: Mon 1/6/2014 12:06 PM (no delay)

You are now an accounts-enwiki-l listmod! Sent: Mon 1/6/2014 3:28 AM Received: Mon 1/6/2014 3:28 AM From: a wikipedia user's email (no delay)

The list seems to hangup during sending at some point making it really hard to follow discussions properly. Note that I have redacted a few things from the subject headers for privacy.


Version: wmf-deployment
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=56414
https://bugzilla.wikimedia.org/show_bug.cgi?id=62838
https://bugzilla.wikimedia.org/show_bug.cgi?id=64795
https://bugzilla.wikimedia.org/show_bug.cgi?id=64818

Details

Reference
bz59731

Related Objects

StatusAssignedTask
Declinedchasemp
OpenNone
Resolved01tonythomas
Resolved01tonythomas
Resolved01tonythomas
Resolved01tonythomas
Resolved01tonythomas
Resolvedcsteipp
InvalidNone
ResolvedNone
Resolved01tonythomas
Resolvedaaron
Resolved01tonythomas
ResolvedNone
Declined01tonythomas
Resolved01tonythomas
OpenNone
ResolvedNone
OpenNone
Openherron
bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz59731.
bzimport added a subscriber: Unknown Object (MLST).

mlpearc wrote:

I don't have the email header to post, but I can verify that Cyberpower678 and I received an email from the list at virtually the same time time stamped
(7 hours ago), just to verify the issue.

I am providing a header from on of the mailing lists. It might be helpful to the devs.

X-Apparently-To: cybernet678@yahoo.com via 98.139.244.173; Mon, 06 Jan 2014 17:45:11 -0800
Return-Path: <mailman-bounces@lists.wikimedia.org>
Received-SPF: pass (domain of lists.wikimedia.org designates 208.80.154.4 as permitted sender)
X-Originating-IP: [208.80.154.4]
Authentication-Results: mta1373.mail.bf1.yahoo.com from=lists.wikimedia.org; domainkeys=neutral (no sig); from=lists.wikimedia.org; dkim=pass (ok)
Received: from 127.0.0.1 (EHLO lists.wikimedia.org) (208.80.154.4)

by mta1373.mail.bf1.yahoo.com with SMTP; Mon, 06 Jan 2014 17:45:11 -0800

DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.wikimedia.org; s=wikimedia;
h=Sender:List-Id:Date:Message-ID:Content-Type:MIME-Version:To:From:Subject; bh=Yraw8uzvHY+hP3XjwBELx/BzAqD+VumhVWyTCjyjGLQ=;
b=fCDsw3hNGIfOkpP8uWTMr3tSyhY6j8asniszjA1JcJdRYhnAZqt4HREK4sKWWmuS9GnGG0Yk1IpqEj5sGJstgQARps2Zw1JctWLwnqpmosBz7JY7EadLKYFDWAID+EboAgUInb+1Eh8ELWb0+swYZx8R/TBwor3ZQy5rkirUMdc=;
Received: from localhost ([::1]:57513 helo=sodium.wikimedia.org)
by sodium.wikimedia.org with esmtp (Exim 4.71)
(envelope-from <mailman-bounces@lists.wikimedia.org>)
id 1W0ICE-0008Rl-Hd; Mon, 06 Jan 2014 21:58:54 +0000
Received: from localhost ([::1]:57503 helo=sodium.wikimedia.org)
by sodium.wikimedia.org with esmtp (Exim 4.71)
(envelope-from <accounts-enwiki-l-bounces@lists.wikimedia.org>)
id 1W0ICB-0008R4-S0 for accounts-enwiki-l-owner@lists.wikimedia.org;
Mon, 06 Jan 2014 21:58:52 +0000
Subject: =?utf-8?q?Accounts-enwiki-l_post_from_originalsidesocket=40maste?=
=?utf-8?q?rfulpiece=2Ecom_requires_approval?=
From: accounts-enwiki-l-owner@lists.wikimedia.org
To: accounts-enwiki-l-owner@lists.wikimedia.org
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="===============7352548824120526592=="
Message-ID: <mailman.404061.1389045530.9349.accounts-enwiki-l@lists.wikimedia.org>
Date: Mon, 06 Jan 2014 21:58:50 +0000
Precedence: bulk
X-BeenThere: accounts-enwiki-l@lists.wikimedia.org
X-Mailman-Version: 2.1.13
List-Id: Internal discussion between the English Wikipedia's account creation
team <accounts-enwiki-l.lists.wikimedia.org>
X-List-Administrivia: yes
Sender: mailman-bounces@lists.wikimedia.org
Errors-To: mailman-bounces@lists.wikimedia.org
Content-Length: 8537

Has somebody from the default CC list been observing this thread?

I have, but this is out of my depth, email slowness can be so many things it's going to need someone from ops to dig in I think. I'm adding the keyword for them and will try to see who on our side may be good to ping.

The headers seem to agree that there was an almost 4 hour (3:45 or so) break between receiving it on our side and your provider receiving it but I don't know enough to start debugging why from there.

Thehelpfulonewiki wrote:

Is this issue isolated to solely the Accounts-enwiki-l list? If so is it possible all member posts are held for approval? That could be causing the delay as I might have noticed something similar on occasion for emails that were held in moderation queues.

mlpearc wrote:

I have always been under the impression that known (registered) user mailings are not set to the queue, and sent straight through, I could be mistaken.

(In reply to comment #5)

Is this issue isolated to solely the Accounts-enwiki-l list? If so is it
possible all member posts are held for approval? That could be causing the
delay as I might have noticed something similar on occasion for emails that
were held in moderation queues.

It is moderated but people on the mailing list are automatically approved, and are never moderated. But the issue still seems to happen. It's actually gotten less severe today, but that may change quickly.

[Accounts-enwiki-l] TS ACC statistics, 2013-12-30 Sent: Mon 12/30/2013 7:00 PM
Received: 1/4/2014 9:09 PM (5 days)

Was sent from a bot approved on the mailing list, but yet it took 5 days to go through.

The header above is email coming from -owner notifying about an email pending moderation in the queue. That took almost 4 hours to arrive as well.

(In reply to comment #5)

Is this issue isolated to solely the Accounts-enwiki-l list? If so is it
possible all member posts are held for approval? That could be causing the
delay as I might have noticed something similar on occasion for emails that
were held in moderation queues.

I didn't answer your first part. No it is not isolated to the accounts mailing list.

[Labs-l] Fwd: Proposal for biweekly Labs showcase Sent: Tue 12/31/2013 7:19 AM Received: Tue 12/31/2013 5:12 PM

The above comes from the labs mailing list.

And we are back to long delays.

lcarr wrote:

Member from ops here ---

I have noticed that yahoo is temporarily deferring our mail as spam " Messages from 208.80.154.4 temporarily deferred due to user complaints - 4.16.55.1; see http://postmaster.yahoo.com/421-ts01.html"

Sadly the only solution to that is to leave yahoo mail and get a provider like gmail who runs their mailservers responsibly.

I tried clearing out the queue of old messages (no large numbers stuck in it, FYI), and forcing a run of older messages as well -- the majority of "stuck" messages were old bounces and messages to yahoo. I even tried everything older than about 3 hours, when forcing this run, the queue did not reduce - indicating it's all bounces and deferrals on the end side.

Yahoo... Probably directly related: bug 56414 / RT #6151

You should write to Yahoo! customer support (especially if mail doesn't get delivered at all), but this is about us not respecting their guidelines; unless there's more evidence, this should be closed as duplicate of bug 56414 AFAICS.

mlpearc wrote:

It seems blaming Yahoo might be a little hasty, since (expanding on my earlier comment) I use Google exclusive for wikimedia and I received that eamil at the same time Cyberpower678 and it was also 7 hours late.

(In reply to comment #13)

It seems blaming Yahoo might be a little hasty, since (expanding on my
earlier
comment) I use Google exclusive for wikimedia and I received that eamil at
the
same time Cyberpower678 and it was also 7 hours late.

Ah. You hadn't mentioned this. What's "google exclusive"? If you could post full headers, that would be helpful.

mlpearc wrote:

(In reply to comment #14)

(In reply to comment #13)
> It seems blaming Yahoo might be a little hasty, since (expanding on my
> earlier
> comment) I use Google exclusive for wikimedia and I received that eamil at
> the
> same time Cyberpower678 and it was also 7 hours late.

Ah. You hadn't mentioned this. What's "google exclusive"? If you could post
full headers, that would be helpful.

I'm sorry I'm not up on things as far as bug reports go, as far as "google exclusive" I just meant that I use Gmail on all wikimedia projects, and as I said in my first comment, I do not have the header for the email in question.

Switching this back to a more generic title and providing some more headers since this just happened to me on advocacy_advisors here the email took multiple days to arrive to my work (google apps) account. I know at least Yana got it (within the same system) so... it oddly seems to be doing this for some people/some emails and not everyone.

Headers below

Delivered-To: jalexander@wikimedia.org
Received: by 10.76.177.167 with SMTP id cr7csp59628oac;

Sun, 12 Jan 2014 02:02:01 -0800 (PST)

X-Received: by 10.224.74.74 with SMTP id t10mr2986272qaj.82.1389520920897;

Sun, 12 Jan 2014 02:02:00 -0800 (PST)

Return-Path: <advocacy_advisors-bounces@lists.wikimedia.org>
Received: from mchenry.wikimedia.org (mchenry.wikimedia.org. [208.80.152.186])

by mx.google.com with ESMTP id r5si18530962qat.32.2014.01.12.02.02.00
for <multiple recipients>;
Sun, 12 Jan 2014 02:02:00 -0800 (PST)

Received-SPF: pass (google.com: domain of advocacy_advisors-bounces@lists.wikimedia.org designates 208.80.152.186 as permitted sender)
Authentication-Results: mx.google.com;

spf=pass (google.com: domain of advocacy_advisors-bounces@lists.wikimedia.org designates 208.80.152.186 as permitted sender) smtp.mail=advocacy_advisors-bounces@lists.wikimedia.org;
dkim=pass header.i=@lists.wikimedia.org

Received: from lists.wikimedia.org ([2620:0:861:1::2]:1046)
by mchenry.wikimedia.org with esmtp (Exim 4.69)
(envelope-from <advocacy_advisors-bounces@lists.wikimedia.org>)
id 1W2Hrj-0008Ji-Rm; Sun, 12 Jan 2014 10:01:59 +0000
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.wikimedia.org; s=wikimedia;
h=Sender:Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:List-Archive:List-Unsubscribe:List-Id:Reply-To:Subject:Cc:To:Message-ID:Date:From:References:In-Reply-To:MIME-Version; bh=SOFP+AsK8AUgIoaYZDX3rgTdy6b/ifmRG4oMqKqXwyE=;
b=VpfkT74roE5R+G2mfmoWMOBWO8J0grMaUd33cfplqT4YuhW7HS6UIM9JsWVo24NY7YbPKoBBizMAQ5VnAfdzsIxSgU5ZAQCYLw/dQ9ojLJqcCYituURdEcC/veb9s4QAfBLOqCCH5II+4F+F+L96xsEXcDwy7jCGVRdWEoO3j7E=;
Received: from localhost ([::1]:60612 helo=sodium.wikimedia.org)
by sodium.wikimedia.org with esmtp (Exim 4.71)
(envelope-from <advocacy_advisors-bounces@lists.wikimedia.org>)
id 1W2Hrj-0006Up-ED; Sun, 12 Jan 2014 10:01:59 +0000
Received: from mail-bk0-x233.google.com ([2a00:1450:4008:c01::233]:43674)
by sodium.wikimedia.org with esmtp (Exim 4.71)
(envelope-from <daniel.mietchen@googlemail.com>)
id 1W1QXJ-0005zG-KO; Fri, 10 Jan 2014 01:05:22 +0000
Received: by mail-bk0-f51.google.com with SMTP id 6so1349320bkj.24
for <multiple recipients>; Thu, 09 Jan 2014 17:05:20 -0800 (PST)
X-Received: by 10.204.170.129 with SMTP id d1mr1582684bkz.124.1389315920800;
Thu, 09 Jan 2014 17:05:20 -0800 (PST)
MIME-Version: 1.0
Received: by 10.204.118.1 with HTTP; Thu, 9 Jan 2014 17:04:59 -0800 (PST)
In-Reply-To: <CAOXkX7rCGNkdZDjuVZWBG_GtVmSLFc8suXLH_pOMkVJHOhLTCg@mail.gmail.com>
References: <CAOXkX7rCGNkdZDjuVZWBG_GtVmSLFc8suXLH_pOMkVJHOhLTCg@mail.gmail.com>
From: Daniel Mietchen <daniel.mietchen@googlemail.com>
Date: Fri, 10 Jan 2014 02:04:59 +0100
Message-ID: <CAN6n2b0e0tecuXqJXi8O+2JxcrN8z9aeYAJTVVauku31CjEojg@mail.gmail.com>
To: Yana Welinder <ywelinder@wikimedia.org>
X-Mailman-Approved-At: Sun, 12 Jan 2014 10:01:52 +0000
Cc: "legalteam@lists.wikimedia.org" <legalteam@lists.wikimedia.org>,
Corynne Mcsherry <corynne@eff.org>, Jake Orlowitz <jorlowitz@gmail.com>,
Parker Higgins <parker@eff.org>, Andrea Zanni <zanni.andrea84@gmail.com>,
advocacy_advisors@lists.wikimedia.org, Mitch Stoltz <mitch@eff.org>
Subject: Re: [Advocacy Advisors] Blog post about Wikipedia and the public
domain
X-BeenThere: advocacy_advisors@lists.wikimedia.org
X-Mailman-Version: 2.1.13
Precedence: list
Reply-To: Advocacy Advisory Group for Wikimedia
<advocacy_advisors@lists.wikimedia.org>
List-Id: Advocacy Advisory Group for Wikimedia
<advocacy_advisors.lists.wikimedia.org>
List-Unsubscribe: https://lists.wikimedia.org/mailman/options/advocacy_advisors,
<mailto:advocacy_advisors-request@lists.wikimedia.org?subject=unsubscribe>
List-Archive: http://lists.wikimedia.org/pipermail/advocacy_advisors
List-Post: <mailto:advocacy_advisors@lists.wikimedia.org>
List-Help: <mailto:advocacy_advisors-request@lists.wikimedia.org?subject=help>
List-Subscribe: https://lists.wikimedia.org/mailman/listinfo/advocacy_advisors,
<mailto:advocacy_advisors-request@lists.wikimedia.org?subject=subscribe>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Sender: advocacy_advisors-bounces@lists.wikimedia.org
Errors-To: advocacy_advisors-bounces@lists.wikimedia.org

Amusingly, the culprit there was *in* sodium, which received the mail but failed to deliver it to *self* for no less than 57 hours?! Judging from the line:

Received: from localhost ([::1]:60612 helo=sodium.wikimedia.org)

by sodium.wikimedia.org with esmtp

Nemo is correct as far as sodium goes however logs have been rotated and unless there is something I am missing we can not debug this further. Please next time this shows up have an ops team member like me look into sooner than 10 days.

By the way, ~4 hours (which is how much the first of the two messages in this bug having full headers displayed delayed) is not considered as a huge delay. Indeed it is the typical amount of time after which various MTAs retry after having the first set of tries fail.

(In reply to comment #18)

Nemo is correct as far as sodium goes however logs have been rotated and
unless
there is something I am missing we can not debug this further. Please next
time
this shows up have an ops team member like me look into sooner than 10 days.

By the way, ~4 hours (which is how much the first of the two messages in this
bug having full headers displayed delayed) is not considered as a huge delay.
Indeed it is the typical amount of time after which various MTAs retry after
having the first set of tries fail.

It is huge. I get my queue moderation messages 4 hours at least after they've been sent. Some of them are legitimate emails waiting for replies.

It is also the standard for many MTAs out there.

(In reply to comment #20)

It is also the standard for many MTAs out there.

I'm a bit confused by this to be honest, it's certainly possible that this is the standard for some but it is still very rare and irregular and not what people expect. Even a 10-15 minute delay causes large warnings and apologies for Google for example (and, in my opinion, should) a 4 hour delay would be an enormous news story (and I've seen that happen when the delay was significantly less).

Regarding the delay and the inability to debug this now because the logs have rotatedf: This bug was created as soon as the issue was discovered (and I added my headers as soon as I saw the same thing). While I am certainly capable of pinging ops through other means if that is desired those filing this bug seem to have done everything correctly to get something addressed. Is there a process that we can put in place to make sure bugs like this are addressed quickly enough to be able to be debugged.

James, I don't think Alexandros was suggesting or implying something was done wrong. Simply, we now know when the logs expire and are now in a better position to (get someone to) debug the issue when it happens again, specifically when delays over 4 hours happen.

Seriously. Please get this fixed. I just got a flood of 47 emails from various lists and only 10 of them were from the last 4 hours. It's driving me nuts.

(In reply to comment #23)

Seriously. Please get this fixed. I just got a flood of 47 emails from
various lists and only 10 of them were from the last 4 hours. It's driving
me
nuts.

If you just got these could you post some of the headers so that the ops team could use them to debug? (if necessary I'm happy to post them as a private bug comment or attachment as well so that not everyone can see them).

(In reply to comment #23)

Seriously. Please get this fixed. I just got a flood of 47 emails from
various lists and only 10 of them were from the last 4 hours. It's driving
me
nuts.

Our freshly added stats show that there were 15k messages in queue on the 29th, about 6k of which got delivered all together about the time cyberpower says.
https://ganglia.wikimedia.org/latest/?r=week&cs=1%2F27%2F2014+7%3A32&ce=1%2F30%2F2014+23%3A31&c=Miscellaneous+eqiad&h=sodium.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS
sodium seems to defer defer delivery 4 times more than it delivers something, hence I'm raising the priority/severity.

I just had a look at the logs. Data is not yet enough to make sure this is a recurring pattern (and when it started) but it seems like one big mail provider (namely Yahoo) has been replying with 421s effectively stalling mail delivery. This is on purpose per http://postmaster.yahoo.com/421-ts01.html (specifically the 4.16.55.1, related to user complaints) This may or may not be connected to https://bugzilla.wikimedia.org/show_bug.cgi?id=52915.

Cyberpower678 this would explain the problems you report. Unfortunately I am not sure what exactly we can do about this apart from following the guidelines Yahoo lists at that link.

(In reply to comment #26)

Cyberpower678 this would explain the problems you report. Unfortunately I am
not sure what exactly we can do about this apart from following the
guidelines
Yahoo lists at that link.

See bug 56414.

Per comment 26 setting priority to normal - more data needed; plus the one big mail provider is handled in bug 56414 already (high prio).

(In reply to Andre Klapper from comment #28)

Per comment 26 setting priority to normal - more data needed; plus the one
big mail provider is handled in bug 56414 already (high prio).

Has there been any progress regarding this? My workaround for this was to have Yahoo divert the emails to a different folder so they don't keep setting off the new email alert when it floods in.

stg_andy wrote:

I am on the English Wikipedia's Arbitration Committee Clerk mailing list. On the 15th I sent an email to the list using my Hotmail account. I never got a copy of the email like I used to. I posted on the clerk noticeboard about this. (https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Arbitration_Committee/Clerks&oldid=599904629) I was informed that my email did get through and it could be verified by checking the archives. I then noticed that I did not appear to be getting any emails from the list. I checked my list settings on the archive site and they appeared to be fine. I changed my email address I used for the list to a Gmail account of mine and sent out a test message. The test message got to the list and is viewable in the archives, but I still didn't get a copy of it in my inbox. Prior to switching email addresses, there was a message on the archive site stating I had a bounce score of 2.0 out of a max of 5.0. Because I'm still not getting emails from the list and I've tried two different email services I believe there is a partial problem with the mailing list.

I'm bumping this to highest priority due to bug 62838 being related.

Alexandros: Bug 62838 was fixed by you, and comment 26 from February said that we need more data in the logs. Is enough/sufficient data available now?

Wondering how to get some progress here, assuming this still happens.

Andre: There is sufficient enough data to know that yahoo (and yahoo alone) enforces greylisting on emails sent to yahoo accounts. What happens is that every few hours their system decides to temporarily fail messages from our systems. After a couple of hours per the de facto standard this is lifted and the messages get delivered en masse. This explains the delays described in this email. As I have already said earlier, there are some (very generic) guidelines yahoo provides for when sending email to their users. We do respect a lot of these guidelines (in fact the very first one is to retry on temporary failure, which we do).

As far as how to get some progress on this, at this point the entire mail system is being reworked on, upgraded (and becoming puppetized, which is a huge plus) and it is expected that it will solve some of the problems the old system has and have some better safeguards.

Has this been happening lately?

Also wondering: Has this been happening lately?

I have observed delays two times in the past few days. One was four hours and other was four minutes.

chasemp closed this task as Declined.May 29 2015, 6:45 PM
chasemp claimed this task.
chasemp added a subscriber: chasemp.

Since this is nonspecific and a few months since anyone was bitten enough to update I am closing for now. This can always be reopened if it surfaces.