Page MenuHomePhabricator

Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completely
Closed, ResolvedPublic

Description

Symptoms:

  • Gerrit is not responsive
  • Thread count skyrocketted
  • Monitoring probe Gerrit Health Check on gerrit.wikimedia.org is CRITICAL

Most probably indicate that HTTP threads are all deadlocked waiting for a lock that is never release. An obvious symptom is a SendEmail task being stuck (from ssh -p 29418 gerrit.wikimedia.org gerrit show-queue -w -q.

Upstream tasks:

Workaround:

Restart Gerrit :-\

ssh cobalt.wikimedia.org sudo systemctl restart gerrit

And log:

!log Restarting Gerrit T224448

Gerrit threads have been getting stuck behind a single thread SendEmail

activeThreadsStuck.png (499×1 px, 22 KB)

(see the threaddump: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMjcvLS1qc3RhY2stMTktMDUtMjctMjItNTItNTIuZHVtcC0tMjItNTMtMjA=)

The only way to resolve this issue is to restart Gerrit.

This issue is superficially similar to T131189; however:

  1. send-email doesn't show up in gerrit show-queue -w --by-queue (tried using various flags)
  2. lsof for the gerrit process at the time of these problems didn't show any smtp connections

I've tried killing the offending thread via the JavaMelody interface to no avail.

Ongoing upstream discussion: https://groups.google.com/forum/#!msg/repo-discuss/pBMh09-XJsw/vuhDiuTWAAAJ


Summary by @hashar

- parking to wait for <0x00000006d75ed7e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2089)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
at com.google.common.cache.LocalCache.get(LocalCache.java:3943)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
at com.google.gerrit.server.account.AccountCacheImpl.get(AccountCacheImpl.java:85)
at com.google.gerrit.server.account.InternalAccountDirectory.fillAccountInfo(InternalAccountDirectory.java:69)
at com.google.gerrit.server.account.AccountLoader.fill(AccountLoader.java:91)
at com.google.gerrit.server.change.ChangeJson.formatQueryResults(ChangeJson.java:

That is still in the local account cache, it might well just be a nasty deadlock bug in com.google.common.cache / Guava. Or the SendEmail thread has some bug and sometime fails to release the local account cache lock :-\ Which bring us back to Upstream bug https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

it had a patch https://gerrit-review.googlesource.com/c/gerrit/+/154130/ but that got reverted https://gerrit-review.googlesource.com/c/gerrit/+/162870/ :-\

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-05-27T23:19:28Z] <thcipriani> gerrit back after restarting due to T224448

Mentioned in SAL (#wikimedia-operations) [2019-05-28T08:29:34Z] <volans> restarting gerrit due to stack threads - T224448

As suggested by @dcausse let's try to capture a jstack next time it happens:

sudo -u gerrit2 jstack $(pidof java)

Mentioned in SAL (#wikimedia-operations) [2019-05-28T08:40:25Z] <volans> T224448 sudo cumin -b 15 -p 95 'R:git::clone' 'run-puppet-agent -q --failed-only'

We have plenty of stacktraces already and the one in this task description matches :-]

The sendEmail thread is on hold waiting for some lock which is a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject and we have no been able to identify who/what sets that lock nor why it is not released.

sendEmail does access the account cache and that set a reentrant lock on it. Any thread that uses the account cache ends up being locked as well waiting for sendEmail to release the lock which it does not because it is itself blocked on another unidentified lock :-\

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

I've tried killing the offending thread via the JavaMelody interface to no avail.

If JavaMelody uses Thread.stop() to kill threads then it might cause deadlocks. The threadsdump here was generated before or after trying to kill the SendEmail-1 thread?

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

I've tried killing the offending thread via the JavaMelody interface to no avail.

If JavaMelody uses Thread.stop() to kill threads then it might cause deadlocks. The threadsdump here was generated before or after trying to kill the SendEmail-1 thread?

Before.

I have other threaddumps from previous times this has happened as well (all posted on the upstream discussion that has been mostly fruitless):

2019-04-17: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTcvLS1qc3RhY2stMTktMDQtMTctMjAtNTgtMDIuZHVtcC0tMjEtNDctNA==
2019-04-23: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMjMvLS1qc3RhY2stMTktMDQtMjMtMjEtMTItMDQuZHVtcC0tMjEtMTItNDU=
2019-05-08: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvOC8tLWpzdGFjay0xOS0wNS0wOC0yMC0xMC0yMC5kdW1wLS0yMC0xMC01Ng==

Also, I take a threaddump every 10 minutes on cobalt. Today, the first time I see the problem is in the dump at 07:10UTC:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDUvMjgvLS1qc3RhY2stMTktMDUtMjgtMDctMTAtMDMuZHVtcC0tMTMtNDYtNDk=

This looks to be the same as what someone else had https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

This does look like the same symptom. Also I do suspect something to do with guava. This problem has been present AFAICT on versions 2.15.8-2.15.13 (possibly before but whatever the trigger of this is wasn't present then).

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

Here it's clear that Send-Email1 has held the lock but failed to release it. Also it's the responsability of guava LocalCache to not allow this to happen. Since Send-Email-1 is clearly not in a zone where the lock can legitimately be held I suspect a bug in guava or in the loading method of the Account info (if LocalCache does not allow hard failures).
If it's due to an Error being thrown from the Send-Email-1 thread I'd check the logs for errors just before the thread count started to rise.

Here SendEmail-1 is not "blocked" it's just waiting for jobs, however the dump says:
Locked ownable synchronizers:- <0x00000001c13617f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync).
Meaning that this SendEmail-1 thread holds the ReentrantLock lock which is blocking Http threads.
So something happened previously in that thread that caused the lock to still be held.

Is it also possible that it's some kind of blocking read causing a deadlock that doesn't show up in dumps? i.e., https://dzone.com/articles/java-concurrency-hidden-thread

Here it's clear that Send-Email1 has held the lock but failed to release it. Also it's the responsability of guava LocalCache to not allow this to happen. Since Send-Email-1 is clearly not in a zone where the lock can legitimately be held I suspect a bug in guava or in the loading method of the Account info (if LocalCache does not allow hard failures).

Ah, then it may be likely the bug linked by @Paladox (although I haven't dug into that thread deeply -- I am nominally on vacation)

If it's due to an Error being thrown from the Send-Email-1 thread I'd check the logs for errors just before the thread count started to rise.

FWIW, I didn't see an issue in the thread dump from 7:00UTC, but I did see the issue in the thread dump from 7:10UTC. There are only two items in the error log that happened in those 10 minutes and they both seem pretty innocuous:

thcipriani@cobalt:~$ grep -A1 '2019-05-28 07:0' /var/log/gerrit/error_log
[2019-05-28 07:01:10,154] [HTTP-3707] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]
--
[2019-05-28 07:09:37,737] [HTTP-2244] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]

I think https://groups.google.com/forum/#!topic/repo-discuss/0yej8sQDcPo may be related. Also im leaning towards (based on the user reply in that thread) that this:

[2019-05-28 07:01:10,154] [HTTP-3707] WARN  com.google.gerrit.httpd.ProjectBasicAuthFilter : Authentication failed for [username]
com.google.gerrit.server.auth.NoSuchUserException: No such user: [username]

Is the cause.

This comment was removed by Paladox.

Mentioned in SAL (#wikimedia-operations) [2019-06-25T20:33:11Z] <thcipriani> restarting gerrit due to T224448

This happened twice in the past 24 hours.

Make that 3 times in 24 hours.

activeThreads2019-06-25.png (499×1 px, 26 KB)

I've now rolled back gerrit 2.15.14 to gerrit 2.15.13. It seems that the increase in emails being triggered due to account changes has exacerbated this issue quite a bit.

Investigated this a bit today. I was hoping with 3 incidents in one day that the trigger for this event might be obvious.

EU AM

  • 06:50: SendEmail-2 Parked as normal
  • 06:5x: SOMETHING TRIGGERS THE ISSUE
  • 07:00: First jstack thread dump that shows SendEmail-2 owning a lock
  • 07:29: Gerrit has 10 threads, has been < 1 for most time prior to this, Gerrit never drops below 10 threads until restart
  • 07:30: First jstack thread dump that shows an HTTP thread blocked on SendEmail-2's lock
  • 08:40: ThreadDump shows SendEmail-2 blocking ~35 HTTP threads (everything except anon http, presumably)
  • 09:24: _joe_ restarts Gerrit https://tools.wmflabs.org/sal/log/AWuN84VUEHTBTPG-tGWP

US PM

  • 19:10: jstack thread dump shows both SendEmail's parked as normal
  • 19:1x: SOMETHING TRIGGERS THE ISSUE
  • 19:20: jstack thread dump shows SendEmail-1 owning a lock
  • 19:25: grafana shows 4 threads, previously < 1; never goes back down until restart
  • 19:30: jstack thread dump shows HTTP thread waiting on SendEmail-1 lock
  • 20:33: thcipriani restarts gerrit https://tools.wmflabs.org/sal/log/AWuQV6WnOwpQ-3Pkr4V5

US PM2

  • 20:40: jstack thread dump shows both SendEmail's parked as normal
  • 20:4x: SOMETHING TRIGGERS THE ISSUE
  • 20:50: jstack threaddump shows SendEmail-2 owning a lock
  • 20:56: grafana shows 9 threads, never comes back down
  • 21:00: jstack threaddump shows HTTP thread and, oddly, SendEmail-1 waiting on the lock
  • 21:23: thcipriani restarts gerrit on 2.15.13 https://tools.wmflabs.org/sal/log/AWuQhXtqEHTBTPG-tqUL

Digging

I decided to dig through a bunch of logs, looking for commonality for the period where the issue is triggered

  • exim logs
    • no errors
    • all events that started, completed
  • All-Users repo
    • nothing in common between the 3 time periods
    • nothing happened at all in this database during the first time period
  • gerrit error_log
    • nothing noteworthy/nothing common to all 3 time periods
    • timeout during push
    • failed login attmpts
  • ssh logs
    • only thing that is common people are pushing up patches (surprise!)
    • not even a noteworthy amount of patches for any period (max 4, min 1)
  • http logs
    • I'll keep digging, so far nothing common among non-GET requests AFAICT

Observations

Mentioned in SAL (#wikimedia-operations) [2019-07-22T18:15:10Z] <thcipriani> restarting gerrit due to T224448

- parking to wait for <0x00000006d75ed7e0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2089)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2046)
at com.google.common.cache.LocalCache.get(LocalCache.java:3943)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3967)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4952)
at com.google.gerrit.server.account.AccountCacheImpl.get(AccountCacheImpl.java:85)
at com.google.gerrit.server.account.InternalAccountDirectory.fillAccountInfo(InternalAccountDirectory.java:69)
at com.google.gerrit.server.account.AccountLoader.fill(AccountLoader.java:91)
at com.google.gerrit.server.change.ChangeJson.formatQueryResults(ChangeJson.java:

That is still in the local account cache, it might well just be a nasty deadlock bug in com.google.common.cache / Guava. Or the SendEmail thread has some bug and sometime fails to release the local account cache lock :-\ Which bring us back to Upstream bug https://bugs.chromium.org/p/gerrit/issues/detail?id=7645

it had a patch https://gerrit-review.googlesource.com/c/gerrit/+/154130/ but that got reverted https://gerrit-review.googlesource.com/c/gerrit/+/162870/ :-\

gerrit went unresponsive today again, and I had to restart it.
@Paladox later confirmed it was the threads problem

activeThreads.png (499×1 px, 24 KB)

Happened once again this morning:

activeThreads2019-08-09.png (499×1 px, 20 KB)

Gerrit got stuck again this morning. There was a SendEmail task stuck (from gerrit show-queue -w -q)) which I tried to kill (ssh gerrit kill XXX) but that hasn't done the magic. jstack

https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDkvMTEvLS1nZXJyaXQtanN0YWNrLnR4dC0tNy01LTg=

The HTTP threads and the SendEmail-2 thread have:

- parking to wait for  <0x000000056fe77528> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

And all those traces point to the usual com.google.gerrit.server.account.AccountCacheImpl -> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad -> java.util.concurrent.locks.ReentrantLock.lock.

Question asked on IRC: should a monitor/alarm be added for active thread count, coupled with a runbook about what to do when it happens (getting a thread dump, etc..) ? Not talking about a runbook for this specific problem, but instead a generic one that we could re-use also in the future if anything similar happens. A list of outstanding issues could also be added to the runbook so people that are unaware will be one click away from getting up to speed (instead of reading SAL tasks etc..).

Question asked on IRC: should a monitor/alarm be added for active thread count, coupled with a runbook about what to do when it happens (getting a thread dump, etc..) ? Not talking about a runbook for this specific problem, but instead a generic one that we could re-use also in the future if anything similar happens. A list of outstanding issues could also be added to the runbook so people that are unaware will be one click away from getting up to speed (instead of reading SAL tasks etc..).

Or even maybe some sort of auto-healing, if XXXX parameter is above YY, restart gerrit.

And that happened again https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDkvMTEvLS1nZXJyaXQtanN0YWNrLnR4dC0tMTEtNTEtNDI=

Apparently that is only HTTP threads this time but that still refers to the account cache.

# grep  -B3 'parking to wait for.*0x000000056eb8d848' gerrit-jstack.txt|egrep '^"'|cut -b-19
"HTTP-12344" #12344
"HTTP-12338" #12338
"HTTP-12337" #12337
"HTTP-12330" #12330
"HTTP-12329" #12329
"HTTP-12328" #12328
"HTTP-12325" #12325
"HTTP-12323" #12323
"HTTP-12322" #12322
"HTTP-12321" #12321
"HTTP-12317" #12317
"HTTP-12308" #12308
"HTTP-12236" #12236
"HTTP-12235" #12235
"HTTP-12234" #12234
"HTTP-12233" #12233
"HTTP-12160" #12160
"HTTP-12159" #12159
"HTTP-12158" #12158
"HTTP-12157" #12157
"HTTP-12156" #12156
"HTTP-12155" #12155
"HTTP-12152" #12152
"HTTP-12151" #12151
"HTTP-12150" #12150
"HTTP-12149" #12149
"HTTP-12148" #12148
"HTTP-12147" #12147
"HTTP-12144" #12144
"HTTP-12143" #12143
"HTTP-12142" #12142
"HTTP-12141" #12141
"HTTP-12139" #12139
"HTTP-12052" #12052
"HTTP-12048" #12048
"HTTP-12047" #12047
"HTTP-12045" #12045
"HTTP-12042" #12042
"HTTP-12040" #12040
"HTTP-12036" #12036
"HTTP-12034" #12034
"HTTP-12031" #12031
"HTTP-11831" #11831
"HTTP-11781" #11781
"HTTP-11780" #11780
"HTTP-11776" #11776
"HTTP-11775" #11775
"HTTP-11770" #11770
"HTTP-11518" #11518
"HTTP-11513" #11513
"HTTP-11511" #11511
"HTTP-11458" #11458
"HTTP-11457" #11457
"HTTP-11455" #11455
"HTTP-11274" #11274
"HTTP-11191" #11191
"HTTP-11095" #11095
"HTTP-10496" #10496

Mentioned in SAL (#wikimedia-operations) [2019-09-11T11:59:41Z] <hashar> Restarting Gerrit due to deadlock in the account cache # T224448

hashar renamed this task from Gerrit http threads stuck behind sendemail thread to Gerrit account cache has a faulty reentrant lock causing http/sendemail threads to stall completely.Sep 11 2019, 12:03 PM
hashar updated the task description. (Show Details)
hashar moved this task from Backlog to Reported Upstream on the Upstream board.

Traces from today incident.


Trace with locks (jstack -l), the usual traces we had so far.

The HTTP threads are locked on a lock by SendEmail-2 which itself is parked on:

"SendEmail-2" #243 prio=5 os_prio=0 tid=0x00007fd830004800 nid=0x2740 waiting on condition [0x00007fd8e4321000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000002c0218158> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   Locked ownable synchronizers:
    - <0x00000002ca2807c8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

The SendEmail thread is parked on the lock held by SendEmail-2, just like the other HTTP threads.


jstack -m which includes native traces. Being available since I got the openjdk debugging symbol installed (openjdk-8-dbg).

The SendEmail-2 has nid=0x273f which is 10047:

----------------- 10047 -----------------
0x00007fdaa99b004f  __pthread_cond_wait + 0xbf
0x00007fdaa8a9451a  Unsafe_Park + 0xfa
0x00007fda922fabea  <Unknown compiled code>

So hmm. I don't know :)

There is still a lock that does not show up. Maybe there is a way to ask the JVM for a list of every single locks and have a trace as to which part of the code has set it. But that is above my league.

I have found a similar bug report for Jira: https://jira.atlassian.com/browse/JRASERVER-63834

What I note in my last dump is the SendEmail-2 uses sun.misc.Unsafe.park().

"SendEmail-2" #243 prio=5 os_prio=0 tid=0x00007fd830004800 nid=0x2740 waiting on condition [0x00007fd8e4321000]
   java.lang.Thread.State: WAITING (parking)                                                  
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000002c0218158> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   ...
   Locked ownable synchronizers:
    - <0x00000002ca2807c8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

And that threads hold the lock apparently to cached segment of the account cache which end up blocking everything. The $100 question, and I will happily pay that in person, is why that thread is itself waiting on a lock (0x00000002c0218158) which is not found?!!!

Maybe something in the system "unsafely" dies and fails to release the lock :-\ But it is not clear from the trace what would be causing it.

I / we should dig in the other thread dumps we have to check whether that is a common pattern. Namely one of the SendEmail threads being parked via Unsafe.park().

After a chat with @Paladox and @thcipriani:

Whatever the issue is seems to be above our league. The issue could be in Gerrit or in the Google Guava cache library. We have be pointed at favoring replacing Guava cache with Caffeine which is a better design and is actively maintained. Thus the upstream issue filled will be declined https://github.com/google/guava/issues/3602

A patch to add Caffeine had to be reverted but maybe it can be restored.

Original change: https://gerrit-review.googlesource.com/c/gerrit/+/154130

Reverts: https://gerrit-review.googlesource.com/q/If65560b4a9bfcf0a03decaedd83ad000c6b28f4f because of a circular dependency somewhere in Gerrit 2.14 / 2.15 :-\

Revert due to: https://bugs.chromium.org/p/gerrit/issues/detail?id=8464

Status Update:

Upstream have been very kind to give us a workaround after asking for one here https://github.com/google/guava/issues/3602#issuecomment-538104189

The work around is: https://github.com/google/guava/issues/3602#issuecomment-538119157

I've gone ahead and uploaded it to here https://gerrit-review.googlesource.com/c/gerrit/+/239436 (for 2.15)!

I've uploaded a test patch here so i could test this on master here: https://gerrit-review.googlesource.com/c/gerrit/+/239494 (works locally)!

So i'm going to be pushing this upstream hoping that we won't get any objections at least one of the gerrit maintainers called it reasonable but is unsure whether other's will reject it.

I have restarted gerrit again today as it was unresponsive

screenshot-gerrit.wikimedia.org-2019.10.09-07_14_30.png (519×1 px, 59 KB)

Mentioned in SAL (#wikimedia-operations) [2019-10-10T16:04:38Z] <thcipriani> restarting gerrit due to T224448

Change 542174 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Gerrit: Up the "accounts" cache to unlimited

https://gerrit.wikimedia.org/r/542174

Change 542174 abandoned by Paladox:
Gerrit: Up the "accounts" cache to unlimited

https://gerrit.wikimedia.org/r/542174

Since we moved to new hardware (with increased resources) things have looked much better! Threads have been much lower and we haven't had to restart for over 3 weeks!

More good news, upstream have a change on the master branch to switch to caffeine cache https://gerrit-review.googlesource.com/c/gerrit/+/244612 !

Interestingly this may have been a kernel bug: https://gerrit-review.googlesource.com/c/gerrit/+/153090/4#message-10cc8f42c24c28b1cb29c441c9aff13684555654

So when we upgraded to buster, that came with a new kernel that included the fix!

Interestingly this may have been a kernel bug: https://gerrit-review.googlesource.com/c/gerrit/+/153090/4#message-10cc8f42c24c28b1cb29c441c9aff13684555654

So when we upgraded to buster, that came with a new kernel that included the fix!

That commit is from 2014. It is in the kernel since version 3.18 and in all the 4.x kernels (based on the listed tags on that commit page). We definitely had the patch on cobalt, although Jessie comes with a patched 3.16, wikimedia runs 4.9.0 (backported from Stretch). :)

thcipriani claimed this task.

This hasn't happened in 2 months; i.e., since we migrated to gerrit1001 from cobalt.

I do think this is still a problem upstream; however, whatever was causing the problem on our system seems to no longer be happening.

Optimistically closing this task.