Page MenuHomePhabricator

Gerrit http threads stuck behind sendemail thread
Open, NormalPublic

Description

Gerrit threads have been getting stuck behind a single thread SendEmail

(see the threaddump: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMjcvLS1qc3RhY2stMTktMDUtMjctMjItNTItNTIuZHVtcC0tMjItNTMtMjA=)

The only way to resolve this issue is to restart Gerrit.

This issue is superficially similar to T131189; however:

  1. send-email doesn't show up in gerrit show-queue -w --by-queue (tried using various flags)
  2. lsof for the gerrit process at the time of these problems didn't show any smtp connections

I've tried killing the offending thread via the JavaMelody interface to no avail.

Ongoing upstream discussion: https://groups.google.com/forum/#!msg/repo-discuss/pBMh09-XJsw/vuhDiuTWAAAJ

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2019, 11:18 PM
thcipriani triaged this task as Normal priority.May 27 2019, 11:18 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-27T23:19:28Z] <thcipriani> gerrit back after restarting due to T224448

Mentioned in SAL (#wikimedia-operations) [2019-05-28T08:29:34Z] <volans> restarting gerrit due to stack threads - T224448

Volans added a subscriber: dcausse.May 28 2019, 8:39 AM

As suggested by @dcausse let's try to capture a jstack next time it happens:

sudo -u gerrit2 jstack $(pidof java)

Mentioned in SAL (#wikimedia-operations) [2019-05-28T08:40:25Z] <volans> T224448 sudo cumin -b 15 -p 95 'R:git::clone' 'run-puppet-agent -q --failed-only'

hashar added a subscriber: hashar.May 28 2019, 9:46 AM

We have plenty of stacktraces already and the one in this task description matches :-]

The sendEmail thread is on hold waiting for some lock which is a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject and we have no been able to identify who/what sets that lock nor why it is not released.

sendEmail does access the account cache and that set a reentrant lock on it. Any thread that uses the account cache ends up being locked as well waiting for sendEmail to release the lock which it does not because it is itself blocked on another unidentified lock :-\

This looks to be the same as what someone else had https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

dcausse added a comment.EditedMay 28 2019, 1:29 PM

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

I've tried killing the offending thread via the JavaMelody interface to no avail.

If JavaMelody uses Thread.stop() to kill threads then it might cause deadlocks. The threadsdump here was generated before or after trying to kill the SendEmail-1 thread?

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

I've tried killing the offending thread via the JavaMelody interface to no avail.

If JavaMelody uses Thread.stop() to kill threads then it might cause deadlocks. The threadsdump here was generated before or after trying to kill the SendEmail-1 thread?

Before.

I have other threaddumps from previous times this has happened as well (all posted on the upstream discussion that has been mostly fruitless):

2019-04-17: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTcvLS1qc3RhY2stMTktMDQtMTctMjAtNTgtMDIuZHVtcC0tMjEtNDctNA==
2019-04-23: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMjMvLS1qc3RhY2stMTktMDQtMjMtMjEtMTItMDQuZHVtcC0tMjEtMTItNDU=
2019-05-08: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvOC8tLWpzdGFjay0xOS0wNS0wOC0yMC0xMC0yMC5kdW1wLS0yMC0xMC01Ng==

Also, I take a threaddump every 10 minutes on cobalt. Today, the first time I see the problem is in the dump at 07:10UTC:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMjgvLS1qc3RhY2stMTktMDUtMjgtMDctMTAtMDMuZHVtcC0tMTMtNDYtNDk=

This looks to be the same as what someone else had https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

This does look like the same symptom. Also I do suspect something to do with guava. This problem has been present AFAICT on versions 2.15.8-2.15.13 (possibly before but whatever the trigger of this is wasn't present then).

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

Here it's clear that Send-Email1 has held the lock but failed to release it. Also it's the responsability of guava LocalCache to not allow this to happen. Since Send-Email-1 is clearly not in a zone where the lock can legitimately be held I suspect a bug in guava or in the loading method of the Account info (if LocalCache does not allow hard failures).
If it's due to an Error being thrown from the Send-Email-1 thread I'd check the logs for errors just before the thread count started to rise.

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

Here it's clear that Send-Email1 has held the lock but failed to release it. Also it's the responsability of guava LocalCache to not allow this to happen. Since Send-Email-1 is clearly not in a zone where the lock can legitimately be held I suspect a bug in guava or in the loading method of the Account info (if LocalCache does not allow hard failures).

Ah, then it may be likely the bug linked by @Paladox (although I haven't dug into that thread deeply -- I am nominally on vacation)

If it's due to an Error being thrown from the Send-Email-1 thread I'd check the logs for errors just before the thread count started to rise.

FWIW, I didn't see an issue in the thread dump from 7:00UTC, but I did see the issue in the thread dump from 7:10UTC. There are only two items in the error log that happened in those 10 minutes and they both seem pretty innocuous:

thcipriani@cobalt:~$ grep -A1 '2019-05-28 07:0' /var/log/gerrit/error_log
[2019-05-28 07:01:10,154] [HTTP-3707] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]
--
[2019-05-28 07:09:37,737] [HTTP-2244] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]

I think https://groups.google.com/forum/#!topic/repo-discuss/0yej8sQDcPo may be related. Also im leaning towards (based on the user reply in that thread) that this:

[2019-05-28 07:01:10,154] [HTTP-3707] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]

Is the cause.

This comment was removed by Paladox.

Mentioned in SAL (#wikimedia-operations) [2019-06-25T20:33:11Z] <thcipriani> restarting gerrit due to T224448

This happened twice in the past 24 hours.

Make that 3 times in 24 hours.

I've now rolled back gerrit 2.15.14 to gerrit 2.15.13. It seems that the increase in emails being triggered due to account changes has exacerbated this issue quite a bit.

Investigated this a bit today. I was hoping with 3 incidents in one day that the trigger for this event might be obvious.

EU AM

  • 06:50: SendEmail-2 Parked as normal
  • 06:5x: SOMETHING TRIGGERS THE ISSUE
  • 07:00: First jstack thread dump that shows SendEmail-2 owning a lock
  • 07:29: Gerrit has 10 threads, has been < 1 for most time prior to this, Gerrit never drops below 10 threads until restart
  • 07:30: First jstack thread dump that shows an HTTP thread blocked on SendEmail-2's lock
  • 08:40: ThreadDump shows SendEmail-2 blocking ~35 HTTP threads (everything except anon http, presumably)
  • 09:24: _joe_ restarts Gerrit https://tools.wmflabs.org/sal/log/AWuN84VUEHTBTPG-tGWP

US PM

  • 19:10: jstack thread dump shows both SendEmail's parked as normal
  • 19:1x: SOMETHING TRIGGERS THE ISSUE
  • 19:20: jstack thread dump shows SendEmail-1 owning a lock
  • 19:25: grafana shows 4 threads, previously < 1; never goes back down until restart
  • 19:30: jstack thread dump shows HTTP thread waiting on SendEmail-1 lock
  • 20:33: thcipriani restarts gerrit https://tools.wmflabs.org/sal/log/AWuQV6WnOwpQ-3Pkr4V5

US PM2

  • 20:40: jstack thread dump shows both SendEmail's parked as normal
  • 20:4x: SOMETHING TRIGGERS THE ISSUE
  • 20:50: jstack threaddump shows SendEmail-2 owning a lock
  • 20:56: grafana shows 9 threads, never comes back down
  • 21:00: jstack threaddump shows HTTP thread and, oddly, SendEmail-1 waiting on the lock
  • 21:23: thcipriani restarts gerrit on 2.15.13 https://tools.wmflabs.org/sal/log/AWuQhXtqEHTBTPG-tqUL

Digging

I decided to dig through a bunch of logs, looking for commonality for the period where the issue is triggered

  • exim logs
    • no errors
    • all events that started, completed
  • All-Users repo
    • nothing in common between the 3 time periods
    • nothing happened at all in this database during the first time period
  • gerrit error_log
    • nothing noteworthy/nothing common to all 3 time periods
    • timeout during push
    • failed login attmpts
  • ssh logs
    • only thing that is common people are pushing up patches (surprise!)
    • not even a noteworthy amount of patches for any period (max 4, min 1)
  • http logs
    • I'll keep digging, so far nothing common among non-GET requests AFAICT

Observations

Dzahn added a subscriber: Dzahn.Jun 27 2019, 11:30 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-22T18:15:10Z] <thcipriani> restarting gerrit due to T224448

Happened again today: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDcvMjIvLS1qc3RhY2stMTktMDctMjItMTgtMDctNTguZHVtcC0tMTgtOC0zMQ==

Seems like fewer threads were blocked for whatever reason, but still happening :(

- parking to wait for <0x00000006d75ed7e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2089)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
at com.google.common.cache.LocalCache.get(LocalCache.java:3943)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
at com.google.gerrit.server.account.AccountCacheImpl.get(AccountCacheImpl.java:85)
at com.google.gerrit.server.account.InternalAccountDirectory.fillAccountInfo(InternalAccountDirectory.java:69)
at com.google.gerrit.server.account.AccountLoader.fill(AccountLoader.java:91)
at com.google.gerrit.server.change.ChangeJson.formatQueryResults(ChangeJson.java:

That is still in the local account cache, it might well just be a nasty deadlock bug in com.google.common.cache / Guava. Or the SendEmail thread has some bug and sometime fails to release the local account cache lock :-\ Which bring us back to Upstream bug https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

it had a patch https://gerrit-review.googlesource.com/c/gerrit/+/154130/ but that got reverted https://gerrit-review.googlesource.com/c/gerrit/+/162870/ :-\

gerrit went unresponsive today again, and I had to restart it.
@Paladox later confirmed it was the threads problem

Happened once again this morning: