Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completely
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thcipriani
	May 27 2019, 11:18 PM

Description

Symptoms:

Gerrit is not responsive
Thread count skyrocketted
Monitoring probe Gerrit Health Check on gerrit.wikimedia.org is CRITICAL

Most probably indicate that HTTP threads are all deadlocked waiting for a lock that is never release. An obvious symptom is a SendEmail task being stuck (from ssh -p 29418 gerrit.wikimedia.org gerrit show-queue -w -q.

Upstream tasks:

Gerrit lock and Caffeine migration:
- https://gerrit-review.googlesource.com/c/gerrit/+/154130
- https://bugs.chromium.org/p/gerrit/issues/detail?id=7645
Guava (com.google.common.cache): https://github.com/google/guava/issues/3602

Workaround:

Restart Gerrit :-\

ssh cobalt.wikimedia.org sudo systemctl restart gerrit

And log:

!log Restarting Gerrit T224448

Gerrit threads have been getting stuck behind a single thread SendEmail

activeThreadsStuck.png (499×1 px, 22 KB)

(see the threaddump: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMjcvLS1qc3RhY2stMTktMDUtMjctMjItNTItNTIuZHVtcC0tMjItNTMtMjA=)

The only way to resolve this issue is to restart Gerrit.

This issue is superficially similar to T131189; however:

send-email doesn't show up in gerrit show-queue -w --by-queue (tried using various flags)
lsof for the gerrit process at the time of these problems didn't show any smtp connections

I've tried killing the offending thread via the JavaMelody interface to no avail.

Ongoing upstream discussion: https://groups.google.com/forum/#!msg/repo-discuss/pBMh09-XJsw/vuhDiuTWAAAJ

Summary by @hashar

- parking to wait for <0x00000006d75ed7e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2089)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
at com.google.common.cache.LocalCache.get(LocalCache.java:3943)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
at com.google.gerrit.server.account.AccountCacheImpl.get(AccountCacheImpl.java:85)
at com.google.gerrit.server.account.InternalAccountDirectory.fillAccountInfo(InternalAccountDirectory.java:69)
at com.google.gerrit.server.account.AccountLoader.fill(AccountLoader.java:91)
at com.google.gerrit.server.change.ChangeJson.formatQueryResults(ChangeJson.java:

That is still in the local account cache, it might well just be a nasty deadlock bug in com.google.common.cache / Guava. Or the SendEmail thread has some bug and sometime fails to release the local account cache lock :-\ Which bring us back to Upstream bug https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

it had a patch https://gerrit-review.googlesource.com/c/gerrit/+/154130/ but that got reverted https://gerrit-review.googlesource.com/c/gerrit/+/162870/ :-\

Details

	Subject	Repo	Branch	Lines +/-
	Gerrit: Up the "accounts" cache to unlimited	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		thcipriani	T224448 Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completely
		Invalid		None	T230138 Gerrit: Alert when deadlock occur ahead of Gerrit being unresponsive

Event Timeline

thcipriani created this task.May 27 2019, 11:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2019, 11:18 PM

thcipriani triaged this task as Medium priority.May 27 2019, 11:18 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-27T23:19:28Z] <thcipriani> gerrit back after restarting due to T224448

Paladox subscribed.May 27 2019, 11:20 PM

Mentioned in SAL (#wikimedia-operations) [2019-05-28T08:29:34Z] <volans> restarting gerrit due to stack threads - T224448

Restarted gerrit because was stuck and showed the same behaviour of the above graph:
https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?panelId=16&fullscreen&orgId=1&from=1559022591237&to=1559031673504

As suggested by @dcausse let's try to capture a jstack next time it happens:

sudo -u gerrit2 jstack $(pidof java)

Mentioned in SAL (#wikimedia-operations) [2019-05-28T08:40:25Z] <volans> T224448 sudo cumin -b 15 -p 95 'R:git::clone' 'run-puppet-agent -q --failed-only'

We have plenty of stacktraces already and the one in this task description matches :-]

The sendEmail thread is on hold waiting for some lock which is a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject and we have no been able to identify who/what sets that lock nor why it is not released.

sendEmail does access the account cache and that set a reentrant lock on it. Any thread that uses the account cache ends up being locked as well waiting for sendEmail to release the lock which it does not because it is itself blocked on another unidentified lock :-\

This looks to be the same as what someone else had https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

I've tried killing the offending thread via the JavaMelody interface to no avail.

If JavaMelody uses Thread.stop() to kill threads then it might cause deadlocks. The threadsdump here was generated before or after trying to kill the SendEmail-1 thread?

In T224448#5216469, @dcausse wrote:

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

I've tried killing the offending thread via the JavaMelody interface to no avail.

If JavaMelody uses Thread.stop() to kill threads then it might cause deadlocks. The threadsdump here was generated before or after trying to kill the SendEmail-1 thread?

Before.

I have other threaddumps from previous times this has happened as well (all posted on the upstream discussion that has been mostly fruitless):

2019-04-17: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTcvLS1qc3RhY2stMTktMDQtMTctMjAtNTgtMDIuZHVtcC0tMjEtNDctNA==
2019-04-23: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMjMvLS1qc3RhY2stMTktMDQtMjMtMjEtMTItMDQuZHVtcC0tMjEtMTItNDU=
2019-05-08: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvOC8tLWpzdGFjay0xOS0wNS0wOC0yMC0xMC0yMC5kdW1wLS0yMC0xMC01Ng==

Also, I take a threaddump every 10 minutes on cobalt. Today, the first time I see the problem is in the dump at 07:10UTC:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMjgvLS1qc3RhY2stMTktMDUtMjgtMDctMTAtMDMuZHVtcC0tMTMtNDYtNDk=

In T224448#5216171, @Paladox wrote:

This looks to be the same as what someone else had https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

This does look like the same symptom. Also I do suspect something to do with guava. This problem has been present AFAICT on versions 2.15.8-2.15.13 (possibly before but whatever the trigger of this is wasn't present then).

In T224448#5216526, @thcipriani wrote:

In T224448#5216469, @dcausse wrote:

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

Here it's clear that Send-Email1 has held the lock but failed to release it. Also it's the responsability of guava LocalCache to not allow this to happen. Since Send-Email-1 is clearly not in a zone where the lock can legitimately be held I suspect a bug in guava or in the loading method of the Account info (if LocalCache does not allow hard failures).
If it's due to an Error being thrown from the Send-Email-1 thread I'd check the logs for errors just before the thread count started to rise.

In T224448#5216643, @dcausse wrote:

In T224448#5216526, @thcipriani wrote:

In T224448#5216469, @dcausse wrote:

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

Here it's clear that Send-Email1 has held the lock but failed to release it. Also it's the responsability of guava LocalCache to not allow this to happen. Since Send-Email-1 is clearly not in a zone where the lock can legitimately be held I suspect a bug in guava or in the loading method of the Account info (if LocalCache does not allow hard failures).

Ah, then it may be likely the bug linked by @Paladox (although I haven't dug into that thread deeply -- I am nominally on vacation)

If it's due to an Error being thrown from the Send-Email-1 thread I'd check the logs for errors just before the thread count started to rise.

FWIW, I didn't see an issue in the thread dump from 7:00UTC, but I did see the issue in the thread dump from 7:10UTC. There are only two items in the error log that happened in those 10 minutes and they both seem pretty innocuous:

thcipriani@cobalt:~$ grep -A1 '2019-05-28 07:0' /var/log/gerrit/error_log
[2019-05-28 07:01:10,154] [HTTP-3707] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]
--
[2019-05-28 07:09:37,737] [HTTP-2244] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]

I think https://groups.google.com/forum/#!topic/repo-discuss/0yej8sQDcPo may be related. Also im leaning towards (based on the user reply in that thread) that this:

[2019-05-28 07:01:10,154] [HTTP-3707] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]

Is the cause.

Paladox added a comment.May 31 2019, 8:17 PM

This comment was removed by Paladox.

akosiaris edited projects, added serviceops-radar; removed serviceops.Jun 21 2019, 9:00 AM

greg added a project: Release-Engineering-Team-TODO.Jun 21 2019, 10:31 PM

greg edited projects, added Release-Engineering-Team (Development services); removed Release-Engineering-Team.Jun 24 2019, 7:50 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-25T20:33:11Z] <thcipriani> restarting gerrit due to T224448

This happened twice in the past 24 hours.

https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDYvMjUvLS1qc3RhY2stMTktMDYtMjUtMDgtNDAtMDMuZHVtcC0tMTItNTMtNTk=

and

https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDYvMjUvLS1qc3RhY2stMTktMDYtMjUtMjAtMTUtMzYuZHVtcC0tMjAtMTYtOQ==

At this point I suspect this is https://bugs.chromium.org/p/gerrit/issues/detail?id=7645 as that exactly describes the same symptoms.

thcipriani mentioned this in T218750: Re-enable use of Gerrit HTTP token to push patchsets.Jun 25 2019, 9:21 PM

In T224448#5284141, @thcipriani wrote:

This happened twice in the past 24 hours.

Make that 3 times in 24 hours.

activeThreads2019-06-25.png (499×1 px, 26 KB)

I've now rolled back gerrit 2.15.14 to gerrit 2.15.13. It seems that the increase in emails being triggered due to account changes has exacerbated this issue quite a bit.

brennen subscribed.Jun 25 2019, 9:46 PM

Investigated this a bit today. I was hoping with 3 incidents in one day that the trigger for this event might be obvious.

EU AM

06:50: SendEmail-2 Parked as normal
06:5x: SOMETHING TRIGGERS THE ISSUE
07:00: First jstack thread dump that shows SendEmail-2 owning a lock
07:29: Gerrit has 10 threads, has been < 1 for most time prior to this, Gerrit never drops below 10 threads until restart
07:30: First jstack thread dump that shows an HTTP thread blocked on SendEmail-2's lock
08:40: ThreadDump shows SendEmail-2 blocking ~35 HTTP threads (everything except anon http, presumably)
09:24: _joe_ restarts Gerrit https://tools.wmflabs.org/sal/log/AWuN84VUEHTBTPG-tGWP

US PM

19:10: jstack thread dump shows both SendEmail's parked as normal
19:1x: SOMETHING TRIGGERS THE ISSUE
19:20: jstack thread dump shows SendEmail-1 owning a lock
19:25: grafana shows 4 threads, previously < 1; never goes back down until restart
19:30: jstack thread dump shows HTTP thread waiting on SendEmail-1 lock
20:33: thcipriani restarts gerrit https://tools.wmflabs.org/sal/log/AWuQV6WnOwpQ-3Pkr4V5

US PM2

20:40: jstack thread dump shows both SendEmail's parked as normal
20:4x: SOMETHING TRIGGERS THE ISSUE
20:50: jstack threaddump shows SendEmail-2 owning a lock
20:56: grafana shows 9 threads, never comes back down
21:00: jstack threaddump shows HTTP thread and, oddly, SendEmail-1 waiting on the lock
21:23: thcipriani restarts gerrit on 2.15.13 https://tools.wmflabs.org/sal/log/AWuQhXtqEHTBTPG-tqUL

Digging

I decided to dig through a bunch of logs, looking for commonality for the period where the issue is triggered

exim logs
- no errors
- all events that started, completed
All-Users repo
- nothing in common between the 3 time periods
- nothing happened at all in this database during the first time period
gerrit error_log
- nothing noteworthy/nothing common to all 3 time periods
- timeout during push
- failed login attmpts
ssh logs
- only thing that is common people are pushing up patches (surprise!)
- not even a noteworthy amount of patches for any period (max 4, min 1)
http logs
- I'll keep digging, so far nothing common among non-GET requests AFAICT

Observations

The final incident's stack trace shows that SendEmail-1 is waiting on SendEmail-2 it may hold some clues (https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDYvMjcvLS1qc3RhY2stMTktMDYtMjUtMjEtMDAtNTcuZHVtcC0tMTgtMTctNTY7Oy0tanN0YWNrLTE5LTA2LTI1LTIxLTAwLTU3LmR1bXAtLTE4LTE4LTE2)
SendEmail-1 seems to be trying to send review comments com.google.gerrit.server.change.EmailReviewComments.run
This could be triggered by a common request across a handful of users
This could also be a strange concurrency deadlock that's impossible to trigger :\

Dzahn subscribed.Jun 27 2019, 11:30 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-22T18:15:10Z] <thcipriani> restarting gerrit due to T224448

Happened again today: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDcvMjIvLS1qc3RhY2stMTktMDctMjItMTgtMDctNTguZHVtcC0tMTgtOC0zMQ==

Seems like fewer threads were blocked for whatever reason, but still happening :(

- parking to wait for <0x00000006d75ed7e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2089)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
at com.google.common.cache.LocalCache.get(LocalCache.java:3943)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
at com.google.gerrit.server.account.AccountCacheImpl.get(AccountCacheImpl.java:85)
at com.google.gerrit.server.account.InternalAccountDirectory.fillAccountInfo(InternalAccountDirectory.java:69)
at com.google.gerrit.server.account.AccountLoader.fill(AccountLoader.java:91)
at com.google.gerrit.server.change.ChangeJson.formatQueryResults(ChangeJson.java:

it had a patch https://gerrit-review.googlesource.com/c/gerrit/+/154130/ but that got reverted https://gerrit-review.googlesource.com/c/gerrit/+/162870/ :-\

greg edited projects, added Release-Engineering-Team-TODO (201907); removed Release-Engineering-Team-TODO.Jul 22 2019, 9:48 PM

greg edited projects, added Release-Engineering-Team-TODO (201908); removed Release-Engineering-Team-TODO (201907).Jul 29 2019, 4:36 PM

greg moved this task from INBOX to Blocked externally on the Release-Engineering-Team-TODO (201908) board.

gerrit went unresponsive today again, and I had to restart it.
@Paladox later confirmed it was the threads problem

thcipriani mentioned this in T230138: Gerrit: Alert when deadlock occur ahead of Gerrit being unresponsive.Aug 8 2019, 3:48 PM

Happened once again this morning:

activeThreads2019-08-09.png (499×1 px, 20 KB)

thcipriani closed subtask T230138: Gerrit: Alert when deadlock occur ahead of Gerrit being unresponsive as Invalid.Sep 3 2019, 4:07 PM

thcipriani edited projects, added Release-Engineering-Team-TODO; removed Release-Engineering-Team-TODO (201908).Sep 3 2019, 4:15 PM

thcipriani moved this task from Should be empty (use Release-Engineering-Team) to Blocked externally on the Release-Engineering-Team-TODO board.

Mentioned in SAL (#wikimedia-operations) [2019-09-10T15:58:27Z] <thcipriani> restarting gerrit (again) https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1&from=1568109359163&to=1568130959163&var-Application=&var-Window=30m due to T224448

elukey subscribed.Sep 11 2019, 6:18 AM

Mentioned in SAL (#wikimedia-operations) [2019-09-11T07:00:43Z] <hashar> Restarting Gerrit - T224448

Gerrit got stuck again this morning. There was a SendEmail task stuck (from gerrit show-queue -w -q)) which I tried to kill (ssh gerrit kill XXX) but that hasn't done the magic. jstack

gerrit-jstack.txt554 KBDownload

https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDkvMTEvLS1nZXJyaXQtanN0YWNrLnR4dC0tNy01LTg=

The HTTP threads and the SendEmail-2 thread have:

- parking to wait for  <0x000000056fe77528> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

And all those traces point to the usual com.google.gerrit.server.account.AccountCacheImpl -> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad -> java.util.concurrent.locks.ReentrantLock.lock.

hashar updated the task description. (Show Details)Sep 11 2019, 7:19 AM

Question asked on IRC: should a monitor/alarm be added for active thread count, coupled with a runbook about what to do when it happens (getting a thread dump, etc..) ? Not talking about a runbook for this specific problem, but instead a generic one that we could re-use also in the future if anything similar happens. A list of outstanding issues could also be added to the runbook so people that are unaware will be one click away from getting up to speed (instead of reading SAL tasks etc..).

In T224448#5482259, @elukey wrote:

Question asked on IRC: should a monitor/alarm be added for active thread count, coupled with a runbook about what to do when it happens (getting a thread dump, etc..) ? Not talking about a runbook for this specific problem, but instead a generic one that we could re-use also in the future if anything similar happens. A list of outstanding issues could also be added to the runbook so people that are unaware will be one click away from getting up to speed (instead of reading SAL tasks etc..).

Or even maybe some sort of auto-healing, if XXXX parameter is above YY, restart gerrit.

And that happened again https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDkvMTEvLS1nZXJyaXQtanN0YWNrLnR4dC0tMTEtNTEtNDI=

Apparently that is only HTTP threads this time but that still refers to the account cache.

# grep  -B3 'parking to wait for.*0x000000056eb8d848' gerrit-jstack.txt|egrep '^"'|cut -b-19
"HTTP-12344" #12344
"HTTP-12338" #12338
"HTTP-12337" #12337
"HTTP-12330" #12330
"HTTP-12329" #12329
"HTTP-12328" #12328
"HTTP-12325" #12325
"HTTP-12323" #12323
"HTTP-12322" #12322
"HTTP-12321" #12321
"HTTP-12317" #12317
"HTTP-12308" #12308
"HTTP-12236" #12236
"HTTP-12235" #12235
"HTTP-12234" #12234
"HTTP-12233" #12233
"HTTP-12160" #12160
"HTTP-12159" #12159
"HTTP-12158" #12158
"HTTP-12157" #12157
"HTTP-12156" #12156
"HTTP-12155" #12155
"HTTP-12152" #12152
"HTTP-12151" #12151
"HTTP-12150" #12150
"HTTP-12149" #12149
"HTTP-12148" #12148
"HTTP-12147" #12147
"HTTP-12144" #12144
"HTTP-12143" #12143
"HTTP-12142" #12142
"HTTP-12141" #12141
"HTTP-12139" #12139
"HTTP-12052" #12052
"HTTP-12048" #12048
"HTTP-12047" #12047
"HTTP-12045" #12045
"HTTP-12042" #12042
"HTTP-12040" #12040
"HTTP-12036" #12036
"HTTP-12034" #12034
"HTTP-12031" #12031
"HTTP-11831" #11831
"HTTP-11781" #11781
"HTTP-11780" #11780
"HTTP-11776" #11776
"HTTP-11775" #11775
"HTTP-11770" #11770
"HTTP-11518" #11518
"HTTP-11513" #11513
"HTTP-11511" #11511
"HTTP-11458" #11458
"HTTP-11457" #11457
"HTTP-11455" #11455
"HTTP-11274" #11274
"HTTP-11191" #11191
"HTTP-11095" #11095
"HTTP-10496" #10496

gerrit-jstack.txt554 KBDownload

Mentioned in SAL (#wikimedia-operations) [2019-09-11T11:59:41Z] <hashar> Restarting Gerrit due to deadlock in the account cache # T224448

hashar renamed this task from Gerrit http threads stuck behind sendemail thread to Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completely.Sep 11 2019, 12:03 PM

zeljkofilipin subscribed.Sep 11 2019, 12:29 PM

I've filed https://github.com/google/guava/issues/3602

hashar added a project: Upstream.Sep 13 2019, 8:13 AM

hashar updated the task description. (Show Details)

hashar moved this task from Backlog to Reported Upstream on the Upstream board.

hashar reopened subtask T230138: Gerrit: Alert when deadlock occur ahead of Gerrit being unresponsive as Open.Oct 3 2019, 1:56 PM

Traces from today incident.

Trace with locks (jstack -l), the usual traces we had so far.

jstack-l.txt573 KBDownload

The HTTP threads are locked on a lock by SendEmail-2 which itself is parked on:

"SendEmail-2" #243 prio=5 os_prio=0 tid=0x00007fd830004800 nid=0x2740 waiting on condition [0x00007fd8e4321000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000002c0218158> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   Locked ownable synchronizers:
    - <0x00000002ca2807c8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

The SendEmail thread is parked on the lock held by SendEmail-2, just like the other HTTP threads.

jstack -m which includes native traces. Being available since I got the openjdk debugging symbol installed (openjdk-8-dbg).

jstack-m.txt62 KBDownload

The SendEmail-2 has nid=0x273f which is 10047:

----------------- 10047 -----------------
0x00007fdaa99b004f  __pthread_cond_wait + 0xbf
0x00007fdaa8a9451a  Unsafe_Park + 0xfa
0x00007fda922fabea  <Unknown compiled code>

So hmm. I don't know :)

There is still a lock that does not show up. Maybe there is a way to ask the JVM for a list of every single locks and have a trace as to which part of the code has set it. But that is above my league.

I have found a similar bug report for Jira: https://jira.atlassian.com/browse/JRASERVER-63834

What I note in my last dump is the SendEmail-2 uses sun.misc.Unsafe.park().

"SendEmail-2" #243 prio=5 os_prio=0 tid=0x00007fd830004800 nid=0x2740 waiting on condition [0x00007fd8e4321000]
   java.lang.Thread.State: WAITING (parking)                                                  
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000002c0218158> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   ...
   Locked ownable synchronizers:
    - <0x00000002ca2807c8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

And that threads hold the lock apparently to cached segment of the account cache which end up blocking everything. The $100 question, and I will happily pay that in person, is why that thread is itself waiting on a lock (0x00000002c0218158) which is not found?!!!

Maybe something in the system "unsafely" dies and fails to release the lock :-\ But it is not clear from the trace what would be causing it.

I / we should dig in the other thread dumps we have to check whether that is a common pattern. Namely one of the SendEmail threads being parked via Unsafe.park().

After a chat with @Paladox and @thcipriani:

Whatever the issue is seems to be above our league. The issue could be in Gerrit or in the Google Guava cache library. We have be pointed at favoring replacing Guava cache with Caffeine which is a better design and is actively maintained. Thus the upstream issue filled will be declined https://github.com/google/guava/issues/3602

A patch to add Caffeine had to be reverted but maybe it can be restored.

Original change: https://gerrit-review.googlesource.com/c/gerrit/+/154130

Reverts: https://gerrit-review.googlesource.com/q/If65560b4a9bfcf0a03decaedd83ad000c6b28f4f because of a circular dependency somewhere in Gerrit 2.14 / 2.15 :-\

Revert due to: https://bugs.chromium.org/p/gerrit/issues/detail?id=8464

hashar updated the task description. (Show Details)Oct 3 2019, 8:22 PM

Status Update:

Upstream have been very kind to give us a workaround after asking for one here https://github.com/google/guava/issues/3602#issuecomment-538104189

The work around is: https://github.com/google/guava/issues/3602#issuecomment-538119157

I've gone ahead and uploaded it to here https://gerrit-review.googlesource.com/c/gerrit/+/239436 (for 2.15)!

I've uploaded a test patch here so i could test this on master here: https://gerrit-review.googlesource.com/c/gerrit/+/239494 (works locally)!

So i'm going to be pushing this upstream hoping that we won't get any objections at least one of the gerrit maintainers called it reasonable but is unsure whether other's will reject it.

I have restarted gerrit again today as it was unresponsive

screenshot-gerrit.wikimedia.org-2019.10.09-07_14_30.png (519×1 px, 59 KB)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T16:04:38Z] <thcipriani> restarting gerrit due to T224448

In T224448#5563533, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2019-10-10T16:04:38Z] <thcipriani> restarting gerrit due to T224448

https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMTAvMTAvLS1hcGktMDZlMmFhZjMtMzM1NS00ODI4LWFiNDgtMWY5NzIzNWQ1M2JmNGUwYWI3MmYtNDhlNi00N2Y0LTlmYWEtZTMyODc4ODY0N2Y0LnR4dC0t&

Change 542174 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Gerrit: Up the "accounts" cache to unlimited

https://gerrit.wikimedia.org/r/542174

gerritbot added a project: Patch-For-Review.Oct 10 2019, 5:00 PM

Change 542174 abandoned by Paladox:
Gerrit: Up the "accounts" cache to unlimited

https://gerrit.wikimedia.org/r/542174

Mentioned in SAL (#wikimedia-operations) [2019-10-11T10:11:15Z] <hashar> Restarting Gerrit # T224448

Paladox mentioned this in T222472: Investigate gerrit session expiration.Oct 13 2019, 4:47 PM

Since we moved to new hardware (with increased resources) things have looked much better! Threads have been much lower and we haven't had to restart for over 3 weeks!

More good news, upstream have a change on the master branch to switch to caffeine cache https://gerrit-review.googlesource.com/c/gerrit/+/244612 !

Interestingly this may have been a kernel bug: https://gerrit-review.googlesource.com/c/gerrit/+/153090/4#message-10cc8f42c24c28b1cb29c441c9aff13684555654

So when we upgraded to buster, that came with a new kernel that included the fix!

In T224448#5650235, @Paladox wrote:

Interestingly this may have been a kernel bug: https://gerrit-review.googlesource.com/c/gerrit/+/153090/4#message-10cc8f42c24c28b1cb29c441c9aff13684555654

So when we upgraded to buster, that came with a new kernel that included the fix!

That commit is from 2014. It is in the kernel since version 3.18 and in all the 4.x kernels (based on the listed tags on that commit page). We definitely had the patch on cobalt, although Jessie comes with a patched 3.16, wikimedia runs 4.9.0 (backported from Stretch). :)

This hasn't happened in 2 months; i.e., since we migrated to gerrit1001 from cobalt.

I do think this is still a problem upstream; however, whatever was causing the problem on our system seems to no longer be happening.

Optimistically closing this task.

thcipriani closed subtask T230138: Gerrit: Alert when deadlock occur ahead of Gerrit being unresponsive as Invalid.Jan 7 2020, 7:13 PM

\o/

	F30601960: screenshot-gerrit.wikimedia.org-2019.10.09-07_14_30.png
	Oct 9 2019, 5:14 AM

	F30535604: jstack-l.txt
	Oct 3 2019, 2:11 PM

	F30535603: jstack-m.txt
	Oct 3 2019, 2:11 PM

	F30299091: gerrit-jstack.txt
	Sep 11 2019, 11:59 AM

	F30298162: gerrit-jstack.txt
	Sep 11 2019, 7:13 AM

Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completelyClosed, ResolvedPublicActions