Page MenuHomePhabricator

Only receiving few emails from Gerrit
Closed, ResolvedPublic

Description

While I would wish this would be true... I have only received a couple of emails from Gerrit today. For example mediawiki-commits seems to have received only three mails since 0700 UTC today.

Workaround

Fix is to close stall TCP connection to SMTP server by using gdb T131189#2161055

Fix

Upgrade to Gerrit 2.8.4 or later

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 29 2016, 7:28 PM

This breaks my workflow, but I haven't got explicit confirmation yet from anyone else observing the same issue.

Paladox added a subscriber: Paladox.EditedMar 29 2016, 7:40 PM

Yes I have had patches merged today and there were like 30+ yet I haven't had a email saying they have been merged yet. This has been happening over the last few days.

Problem happends to me too.

Nemo_bis updated the task description. (Show Details)Mar 29 2016, 8:10 PM
demon added a subscriber: demon.Mar 29 2016, 8:39 PM

I'm not seeing any errors on the Gerrit side, nor have there been any configuration changes.

I haven't gotten any Gerrit email recently either to my gmail address :(

I use yahoo, so seems to be happening to all email providers.

Nikerabbit triaged this task as Unbreak Now! priority.Mar 30 2016, 5:34 AM

Setting priority because a) I will miss comments in e.g. if patch fail tests or people ask questions without voting. b) I need to know what gets merged in core and some extensions so that I know what to look for if I see issues after deploying code.

This has already caused one bugfix to miss the weekly branching, for example.

Maybe it is a bug in gerrit.

Since in https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.8.4.html it says it fixed mail thread getting stuck

Peachey88 removed a subscriber: Operations.
demon added a comment.Mar 30 2016, 2:57 PM

Maybe it is a bug in gerrit.
Since in https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.8.4.html it says it fixed mail thread getting stuck

In which case it's stuck because it cannot reach the SMTP server, not because Gerrit can't send e-mails. Again, I see no evidence on Gerrit's side of errors.

Ok, then it is a bug. Which hopefully we will be on gerrit 2.12 soon so the problem should be fixed.

hashar added a subscriber: hashar.Mar 30 2016, 3:15 PM

Seems to be an issue with Gerrit queue:

ssh -p 29418 gerrit.wikimedia.org 'gerrit show-queue -w'
Task     State        StartTime         Command
------------------------------------------------------------------------------
ffcfd47f              Mar-29 08:28      send-email comments
27141179              15:05:24.127      git-upload-pack p/mediawiki/core.git
ff01149f waiting .... Mar-29 08:28      send-email comments
5f2dc03c waiting .... Mar-29 08:28      send-email merged
bf695cd1 waiting .... Mar-29 08:33      send-email comments
df7d30c5 waiting .... Mar-29 08:35      send-email comments
...

From the Gerrit 2.8.4 release notes @Paladox pointed:

Fix mail thread getting stuck when waiting for response from SMTP server.

So looks like whenever the SMTP relay flap for some reason, Gerrit ends up being stuck emitting mails. Good news: the events are still around, not sure how we can unblock them though.

@hashar would restarting gerrit fix that.

demon added a comment.Mar 30 2016, 3:18 PM

@hashar would restarting gerrit fix that.

No, it would flush the queue.

@hashar would restarting gerrit fix that.

No, it would flush the queue.

Flush or drop ? Cause flush does not sound so bad

@demon do you know when we can upgrade to gerrit 2.12.

Also if we do restart the problem will come back after a fews per the link above says. So we carn't keep restarting gerrit every few days since that will cause a lot of emails not to be sent.

demon added a comment.Mar 30 2016, 3:36 PM

@hashar would restarting gerrit fix that.

No, it would flush the queue.

Flush or drop ? Cause flush does not sound so bad

Drop, sorry.

Actually https://code.google.com/p/gerrit/issues/detail?id=1528#c8 is the most interesting bit. I'm testing the socket theory now actually...

@demon do you know when we can upgrade to gerrit 2.12.

I've been working on testing the schema upgrades this week, so soon.

Also if we do restart the problem will come back after a fews per the link above says. So we carn't keep restarting gerrit every few days since that will cause a lot of emails not to be sent.

Restarting Gerrit (especially continually) is a non-option here.

@demon do you know when we can upgrade to gerrit 2.12.
Also if we do restart the problem will come back after a fews per the link above says. So we carn't keep restarting gerrit every few days since that will cause a lot of emails not to be sent.

Gerrit upgrading is unrelated to this task. Please see T70271 :)

Mentioned in SAL [2016-03-30T15:37:36Z] <hashar> Gerrit has trouble sending emails T131189

@hashar upgrading gerrit will fix the problem. But yes that question should have been asked on the other task.

Would that task block this then. Since the only work around is to restart gerrit but that is a horrible thing todo.

demon added a comment.Mar 30 2016, 3:39 PM

Would that task block this then. Since the only work around is to restart gerrit but that is a horrible thing todo.

That's not the only workaround.

Would that task block this then. Since the only work around is to restart gerrit but that is a horrible thing todo.

That's not the only workaround.

Oh, then a temp fix. Until we upgrade unless there is another way.

@akosiaris fixed the problem. Ive now got gerrit mail.

demon lowered the priority of this task from Unbreak Now! to High.Mar 30 2016, 4:10 PM
demon added a project: Upstream.

So this ultimately was the upstream SMTP issue. @akosiaris was able to attach to the process to drop the open socket and e-mails should now be going out (there's a ~2300 email backlog as of writing).

Ultimately we need to upgrade to fix this problem which is being done via T70271. I've been working this week on testing the upgrade to make sure the schema changes go ok.

Lowering priority but not closing as we can't "fix" this yet and it could happen again.

So for posterity's sake here's a transcript of what I did.

Inspired by https://www.box.com/blog/remember-unix-runs-under-your-jvm/ (an acquaintance of mine), I 've done the following:

sudo lsof -p 7834 | grep smtp

I got 3 File descriptors as a result. However calling close directly on them did not work due to the java package having been upgraded in the meantime and debugging symbols missing from the system. After figuring out the correct package thanks to @demon and fetching it from /var/cache/apt, I 've unpacked in my homedir the package and debugging symbols. Then

sudo gdb -exec=a/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -p 7834 --symbols=a/usr/lib/debug/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java
...
call close(251)
$1 = 0

worked! I 've close all 3 file descriptors indiscriminately like the one above and gerrit reopened the SMTP connection and started pushing email.

At some point, I would restart gerrit anyway ;-)

there's a ~2300 email backlog as of writing

A disproportionately high percentage of which has now reached my inbox... show-queue no longer lists any email waiting.

Might be fixed in Gerrit 2.8.4 by {a7e343131777750d11c12d06354e52aaae9badc5}

Can we please close this contingent bug and split the unspecified expected long-term improvements to a separate task?

hashar changed the task status from Open to Stalled.Apr 4 2016, 9:58 AM
hashar lowered the priority of this task from High to Normal.
hashar updated the task description. (Show Details)

I have updated the task detail to:

  • point to @akosiaris comment explaining how to resume the queue processing.
  • Mention 2.8.4 will fix it

This task is now pending Gerrit upgrade

Nemo_bis removed a subscriber: Nemo_bis.Apr 7 2016, 5:00 PM

@Jdforrester-WMF reports that he is experiencing the same symptoms today (lack of receiving email from Gerrit, that is :) ).

greg updated the task description. (Show Details)Apr 11 2016, 3:05 PM
demon added a comment.Apr 11 2016, 3:20 PM

I've been getting gerrit e-mails this morning and I see nothing stuck in the queue.

The queue however is being processed normally from what I see.

ssh -p 29418 akosiaris@gerrit.wikimedia.org gerrit show-queue  |wc -l
8

so it must be something different.

greg added a comment.Apr 11 2016, 3:28 PM

Sorry for the false-alarm, it was all in his Spam folder :) (which is weird, but, definitely not because of this task...)

Paladox changed the task status from Stalled to Open.EditedJul 12 2016, 5:30 PM

changing status from stalled to open since gerrit 2.12 is now moving on.

Please go back to stalled if I am wrong or I should not change the status please.

demon added a comment.Jul 12 2016, 5:33 PM

What does 2.12 have to do with it?

@demon since it is fixed with gerrit. Meaning in a updated version of gerrit it is fixed

As we are updating to gerrit 2.12, we need to set sendemail.connectTimeout in gerrit, to resolve this task.

@demon , I'm wondering do you have a number in mind

https://gerrit.googlesource.com/gerrit/+/a7e343131777750d11c12d06354e52aaae9badc5%5E%21/#F4

+Values can be specified using standard time unit abbreviations
+('ms', 'sec', 'min', etc.).

It is default at 0 meaning indefinite.

please.

Change 300583 had a related patch set uploaded (by Chad):
Gerrit: Set sendemail.connectTimeout to 1 minute

https://gerrit.wikimedia.org/r/300583

Change 300583 merged by Dzahn:
Gerrit: Set sendemail.connectTimeout to 1 minute

https://gerrit.wikimedia.org/r/300583

Dzahn added a subscriber: Dzahn.Jul 22 2016, 7:01 PM

As we are updating to gerrit 2.12, we need to set sendemail.connectTimeout in gerrit, to resolve this task.

This should be the case now. merged and ran on lead. But can we close already?

demon added a comment.Jul 22 2016, 7:06 PM

I'd rather not until we swap over to the new host.

demon closed this task as Resolved.Jul 25 2016, 3:34 PM
demon claimed this task.

Should be resolved. If we encounter this again please reopen.