Page MenuHomePhabricator

Gerrit thread use GC thrashing
Closed, ResolvedPublic

Assigned To
Authored By
thcipriani
Apr 15 2019, 6:20 PM
Referenced Files
F28854085: jvm-peak-allocation.png
Apr 29 2019, 1:56 PM
F28854068: heap-usage-after.png
Apr 29 2019, 1:56 PM
F28854072: bytes-reclaimed-full-gc.png
Apr 29 2019, 1:56 PM
F28682710: Screenshot from 2019-04-16 19-22-21.png
Apr 17 2019, 12:23 AM
F28677333: usedMemory2019-04-16T15:38:15.png
Apr 16 2019, 3:42 PM
F28677335: activeThreads2019-04-16T15:38:15.png
Apr 16 2019, 3:42 PM
F28667184: gc-thrashing-today.png
Apr 15 2019, 6:25 PM
Tokens
"Orange Medal" token, awarded by mmodell.

Description

Gerrit has run out of HTTP threads and needed a restart several times in several weeks. Initially I blamed the upgrade to 2.15.12 [0]; however, our thread woes have continued after down-grading to a version on which we were previously stable (2.15.8).

Watching javamelody monitoring over the period, I think symptoms are similar to T148478 -- that is, gerrit seems to run fine for a few days and then we hit some kind of condition that causes GC thrashing.

GC thrashing causes a slow down of http timings resulting in concurrent thread-use increase (threads stay active for longer hence more threads in parallel).

These longer than average http average times seem to begin when we hit the gerrit heap limit and GC (I guess) brings memory use back down. Memory use climbs back up to our heap-size very rapidly triggering java gc thrashing.

I think we'll need to monitor java GC more closely an fine-tune some values in terms of gerrit thread-use. The gerrit scaling manual has some advice[1] on how to do this.

[0]. https://groups.google.com/forum/#!msg/repo-discuss/pBMh09-XJsw/k2L6ZAgcCAAJ
[1]. https://gerrit.googlesource.com/homepage/+/md-pages/docs/Scaling.md#java-heap-and-gc

Event Timeline

Change 504073 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] Gerrit: Enable gc logging

https://gerrit.wikimedia.org/r/504073

Just to clarify, when you say GC, do you mean Java virtual machine garbage collection or git repository object garbage collection?

Just to clarify, when you say GC, do you mean Java virtual machine garbage collection or git repository object garbage collection?

Java virtual machine garbage collection.

Today, when I got up, I noticed that around 10UTC we started to build up thread-use again. You can see what happens to our memory use around 10UTC -- this is what I think is thrashing:

gc-thrashing-today.png (499×1 px, 46 KB)

Paladox triaged this task as High priority.Apr 15 2019, 8:31 PM

Change 504073 merged by CDanis:
[operations/puppet@production] Gerrit: Enable gc logging

https://gerrit.wikimedia.org/r/504073

From the "Scaling Gerrit" link above:

Operations on git repositories can consume lots of memory. If you consume more memory than your java heap, your server may either run out of memory and fail, or simply thrash forever while java gciing. Large fetches such as clones tend to be the largest RAM consumers on a Gerrit system.

The majority of traffic is clones/fetch/pull git ops. It does seem like our packedGitLimit may be a bit small to keep these running smoothly.

core.packedGitLimit
Maximum number of bytes to load and cache in memory from pack files. If JGit needs to access more than this many bytes it will unload less frequently used windows to reclaim memory space within the process.

It's currently set to 2g. Recommended sizes vary: I've seen recommendations of 1g-4g up to 1/2 heap size.

Looking at our top 10 most downloaded repos -- their packfiles sum to 1.996g so we're cutting it pretty close.

thcipriani@cobalt:~$ awk -F'[. ]' '/upload-pack/ {print $8}' /var/log/gerrit/sshd_log | sort | uniq -c | sort -rn | head
   1571 /mediawiki/core
   1441 /operations/puppet
    678 /wikibase/termbox
    662 /mediawiki/extensions/Wikibase
    552 /mediawiki/services/parsoid
    384 /mediawiki/extensions/WikibaseLexeme
    376 /mediawiki/extensions/WikibaseMediaInfo
    324 /mediawiki/skins/MinervaNeue
    280 /operations/mediawiki-config
    241 /mediawiki/extensions/ContentTranslation
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/mediawiki/core.git/objects/pack/
841M    total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/operations/puppet.git/objects/pack/
457M    total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/wikibase/termbox.git/objects/pack/
21M     total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/mediawiki/extensions/Wikibase.git/objects/pack/    
171M    total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/mediawiki/services/parsoid.git/objects/pack/
208M    total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/mediawiki/extensions/WikibaseLexeme.git/objects/pack/
20M     total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/mediawiki/extensions/WikibaseMediaInfo.git/objects/pack/
14M     total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/mediawiki/skins/MinervaNeue.git/objects/pack/
18M     total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/operations/mediawiki-config.git/objects/pack/
210M    total
gerrit2@cobalt:~$ du -chs /srv/gerrit/git/mediawiki/extensions/ContentTranslation.git/objects/pack/
36M     total

I imagine a lot of fetches come from mirroring, which means they are cron'd to run over some large set of repos, which means we're probably ejecting useful packfiles from cache quite often.

In terms of "right-sizing" our packedGitLimit -- there's a long-tail of fetches (913 unique repos fetched today) -- as of 2019-04-15T21:15 there are 109 repos that have been fetched more than 20 times (guessing they're mirrored). Pack sizes seem to vary by a lot. Hard to even wager a guess -- median value of top 11 repos is 36m. 109 repos at 36MB of packed files per repo puts us at 3.924g.

Once again, around 10UTC JVM GC started kicking in and tread use is up and has not dropped back down:

usedMemory2019-04-16T15:38:15.png (499×1 px, 42 KB)

activeThreads2019-04-16T15:38:15.png (499×1 px, 22 KB)

At least this seems to validate cause and effect here.

Should check traffic starting around 10UTC for yesterday and today -- figure out if there's any pattern that might help inform how to tune the system.

Change 504448 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] gerrit: increase core.packedGitLimit

https://gerrit.wikimedia.org/r/504448

Change 504448 merged by Dzahn:
[operations/puppet@production] gerrit: increase core.packedGitLimit

https://gerrit.wikimedia.org/r/504448

Here's the GCEasy report from around the time gerrit started thrashing today:
https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTYvLS1qdm1fZ2MuZ2Vycml0LmxvZy4xLS0yMC01NS0zMQ==&channel=WEB

A threaddump from right before and right after the first GC at 10UTC:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTYvLS1qc3RhY2stMTktMDQtMTYtMDktNTAtMDMuZHVtcC0tMjItMjUtMjE7Oy0tanN0YWNrLTE5LTA0LTE2LTEwLTAwLTAyLmR1bXAtLTIyLTI1LTIx

The notable change I see in the threaddumps is there some blocking on reading a packfile in the second dump. That lock clears in subsequent dumps. That kind of BLOCKING doesn't seem uncommon -- 13% of the threaddump files I have (currently about 550 files from 5 or so days).

From looking at http requests per minute in javamelody, over 1 year, I see that traffic has increased a lot:

https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=httpHitsRate

Screenshot from 2019-04-16 19-22-21.png (517×1 px, 27 KB)

Each simultaneous request allocates a significant chunk of ram. I think we need to increase the heap size a bit and possibly reduce the size of the thread pool to limit memory use.

From looking at http requests per minute in javamelody, over 1 year, I see that traffic has increased a lot:

https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=httpHitsRate

Screenshot from 2019-04-16 19-22-21.png (517×1 px, 27 KB)

Each simultaneous request allocates a significant chunk of ram. I think we need to increase the heap size a bit and possibly reduce the size of the thread pool to limit memory use.

One thing that's curious about that graph is that the mean stays roughly the same, but the max request per minute has been going up. I'm not sure what that's indicative of: there are minutes where we're serving more traffic, but overall the same amount of traffic? Is there more traffic clustered around the hour than there used to be, but overall the same traffic?

Here's the GCEasy report from around the time gerrit started thrashing today:
https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTYvLS1qdm1fZ2MuZ2Vycml0LmxvZy4xLS0yMC01NS0zMQ==&channel=WEB

A threaddump from right before and right after the first GC at 10UTC:
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTYvLS1qc3RhY2stMTktMDQtMTYtMDktNTAtMDMuZHVtcC0tMjItMjUtMjE7Oy0tanN0YWNrLTE5LTA0LTE2LTEwLTAwLTAyLmR1bXAtLTIyLTI1LTIx

The notable change I see in the threaddumps is there some blocking on reading a packfile in the second dump. That lock clears in subsequent dumps. That kind of BLOCKING doesn't seem uncommon -- 13% of the threaddump files I have (currently about 550 files from 5 or so days).

@Gehel you were able to comment on T148478 previously -- your insights on that task seem to be true for this task as well. Could I ask if anything stands out to you looking at these reports?

Some past behavior we have observed is a reentrant lock being held on the account cache. Potentially exacerbated when lot of concurrent HTTP requests are made.

@mmodell finding about HTTP max request rate shows that the peak doubled. So there must be changes somewhere that are hitting Gerrit hard. One we have noticed on some other task is some bots running on wmflabs that clone all MediaWiki repositories, the other obvious one is CI / Quibble.

If HTTP requests are a concern, we can probably parse the Apache access logs and try to figure out a correlation of some sort / top users etc.


Would it make sense to set a readonly replica such as git.wikimedia.org to offload Gerrit?

The bots/scripts running on WMCS could be easily made to point to that mirror.

Zuul / zuul-merger really does a bunch of assumption that it is interacting with the canonical repository, but Zuul has some support to wait for replication. The CI jobs do clone from gerrit.wikimedia.org before checking out patch from the zuul-merger, though if the replication is delayed/broken, CI would no more be tested the proper tip of branches :-(

Also Phabricator uses apachetop which might be helpful to see what is going live (rather than looking at JavaMelody) :]

Change 504611 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] gerrit: lower sshd threadpool size

https://gerrit.wikimedia.org/r/504611

Out of 623k https requests in today access logs:

RequestsIPDNS PTR
84110172.16.1.221codesearch4.codesearch.eqiad.wmflabs.
699212620:0:861:102:10:64:16:8phab1001.eqiad.wmnet.
51736172.16.1.85extdist-02.extdist.eqiad.wmflabs.
51736172.16.1.84extdist-01.extdist.eqiad.wmflabs.
51676172.16.1.86extdist-03.extdist.eqiad.wmflabs.
16465172.16.5.187integration-slave-docker-1051
16116172.16.5.162integration-slave-docker-1048
14709172.16.5.181integration-slave-docker-1050
13660172.16.1.36integration-slave-docker-1041
13579172.16.0.26integration-slave-docker-1054
12990172.16.6.184integration-slave-docker-1043
12909172.16.3.86integration-slave-docker-1040
11672172.16.7.168integration-slave-docker-1034
10847172.16.5.190integration-slave-docker-1052
9705xxxxxsome public internet IP
8786172.16.3.87integration-slave-docker-1037

Probably codesearch ( https://codesearch.wmflabs.org/ ), Phabricator and extdist ( https://www.mediawiki.org/wiki/Extension:ExtensionDistributor ) could be moved to a use a mirror.

The CI slaves do hammer Gerrit :-/

Note that its for any HTTP request, not just git-upload-pack. But the result is similar when filtering for upload-pack.

Not taken in account, git fetch from the zuul-mergers which are done over ssh with the jenkins-bot user.

Change 504611 merged by Dzahn:
[operations/puppet@production] gerrit: lower sshd threadpool size

https://gerrit.wikimedia.org/r/504611

@hashar: yeah I think a read-only mirror might make sense, we could have code search and several other things reading from that instead of the master. I'd like to get a better idea of what codesearch is doing, is it following the changes stream or just periodically fetching all refs, I don't really know but I plan to look into it (or maybe @Legoktm can comment about that)

according to https://www.mediawiki.org/wiki/Codesearch :

codesearch uses hound as the search implementation. It indexes the origin/master branch of all specified repositories, and updates them every hour. Each search profile is backed by a different instance of hound. A small Python Flask >server acts as the proxy for the web frontend, backed by gunicorn, to the individual hound instances.

So if it's just fetching origin/master, that shouldn't be too bad, however, if it's using bare repos then it might be fetching more than that unintentionally.

RequestsIPDNS PTR
699212620:0:861:102:10:64:16:8phab1001.eqiad.wmnet.

I realized that phab was just doing anonymous http cloning, and therefore using the same threadpool as regular users. For mediawiki/core and operations/puppet I changed the clone url to be an authorized HTTP url using the phab user. I put the phab user into the Non-Interactive Users group. So now it's sharing the Batch threadpool rather than the primary threadpool. That may take pressure off a bit.

The CI slaves do hammer Gerrit :-/

I updated the repos we use for --reference on those machines (for operations/puppet and mediawiki/core) via cumin yesterday evening, so that's maybe a little helpful.

I upgraded Gerrit to 2.15.12 in preparation for plugins still in development. I'd like to not change too many things at once, but I am a bit stuck

Prior to the upgrade I ran into the same issue I've seen previously in 2.15.11 and 2.15.12 [0] -- we were still running 2.15.8 at the time.

I continued with the upgrade as scheduled as I think that the real issue is the one outlined in this task: GC thrashing, threads piling up, gerrit becoming unresponsive rather than a specific version.

I managed to capture a threaddump from the moment tasks started piling up: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTcvLS1qc3RhY2stMTktMDQtMTctMjAtNTgtMDIuZHVtcC0tMjEtNDctNA==

I'll post this upstream to see if it reveals any insights.

[0]. https://groups.google.com/d/msg/repo-discuss/pBMh09-XJsw/vuhDiuTWAAAJ

Other thing to note in gerrit show-caches output:

  Name                          |Entries              |  AvgGet |Hit Ratio|
                                |   Mem   Disk   Space|         |Mem  Disk|
--------------------------------+---------------------+---------+---------+
  ...
  projects                      |  2048               |  12.9ms | 94%     |
  ...

In our gerrit.config we have:

[cache "projects"]
    memoryLimit = 2048
    loadOnStartup = true

So it seems we've hit the cache limit (2048), and are still only serving 94% of requests from memory.

From https://groups.google.com/d/msg/repo-discuss/ojur5qz8BGo/NNjCR-bYBQAJ

we consider a host "unhealthy" if the cache is not fully loaded, especially the projects and groups.

We have 2221 projects so it looks like we may need to raise our projects memoryLimit up a bit. 3096 seems like a value that would give us some head room. It seems like we have room on the heap for this as well.

Change 504912 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] gerrit: increase projects cache

https://gerrit.wikimedia.org/r/504912

changeid_projects cache is, today, looking like it's in poor shape:

  Name                          |Entries              |  AvgGet |Hit Ratio|                                                                            
                                |   Mem   Disk   Space|         |Mem  Disk|                                                                            
--------------------------------+---------------------+---------+---------+                                                                            
  changeid_project              |  2048               |         | 78%     |

Change 504912 merged by CRusnov:
[operations/puppet@production] gerrit: increase projects cache

https://gerrit.wikimedia.org/r/504912

I managed to capture a threaddump from the moment tasks started piling up: https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMTcvLS1qc3RhY2stMTktMDQtMTctMjAtNTgtMDIuZHVtcC0tMjEtNDctNA==

I'll post this upstream to see if it reveals any insights.

[0]. https://groups.google.com/d/msg/repo-discuss/pBMh09-XJsw/vuhDiuTWAAAJ

Yesterday, theads locked in the same way as they did on 2019-04-17:

https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMjMvLS1qc3RhY2stMTktMDQtMjMtMjEtMTItMDQuZHVtcC0tMjEtMTItNDU=

That is, HTTP-x threads blocked on SendEmail-1. I don't see anything unusual in exim logs at that time. Some "address does not exist" errors, but those are not uncommon.

@hashar: yeah I think a read-only mirror might make sense, we could have code search and several other things reading from that instead of the master. I'd like to get a better idea of what codesearch is doing, is it following the changes stream or just periodically fetching all refs, I don't really know but I plan to look into it (or maybe @Legoktm can comment about that)

My understanding is that codesearch's usage of git is not ideal, there's a really good explanation upstream about the problems: https://github.com/hound-search/hound/issues/249

As you found, codesearch currently pulls/polls every hour. I can increase that delay as needed, just let me know what it should be. While we want search indexes to be up to date, Gerrit being up is significantly more important.

And it should be relatively trivial for codesearch to switch over to a read-only mirror.

Change 506452 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] gerrit: raise changeid_project cache

https://gerrit.wikimedia.org/r/506452

Change 506452 merged by CDanis:
[operations/puppet@production] gerrit: raise changeid_project cache

https://gerrit.wikimedia.org/r/506452

Change 327763 had a related patch set uploaded (by Paladox; owner: Paladox):
[operations/puppet@production] gerrit: Enable G1 GC

https://gerrit.wikimedia.org/r/327763

It seems like we've tuned a good number of gerrit parameters at this point and we're still experiencing GC thrashing (although less than previous) which means we ought to start looking at JVM GC tuning.

Looking at a report from today:
https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMDQvMjkvLS1qdm1fZ2MuZ2Vycml0LmxvZy41LS0xMi0zMS0yMQ==&channel=WEB

Some observations:

  • We are currently using Parallel GC, which is optimized for throughput as opposed to latency
  • We suffer from many concurrent full GCs

heap-usage-after.png (503×867 px, 47 KB)

  • Concurrent full GCs are not reclaiming bytes

bytes-reclaimed-full-gc.png (501×876 px, 29 KB)

  • Times of full GCs co-vary with times of increased HTTP latency
  • We have over-allocated metaspace by at least an order of magnitude

jvm-peak-allocation.png (284×733 px, 19 KB)

Thoughts/recommendations:

  • Increased full GCs means decreased throughput, and increased HTTP times would seem a direct result of that
  • Would like to work from the most basic tuning, to the least basic tuning
  • I'd like to try the G1 GC to decrease the number of GCs at the expense of throughput. Many folks who are helpful on the upstream mailing list who run large gerrit instances use this GC with success.
  • In conjunction with the GC change, I'd like to try setting -XX:MaxGCPauseMillis to a lowish number. Our current 90th % for GC is 300ms, that may be a reasonable number to start with.
  • Decrease metaspace
    • this doesn't seem like such a dire need, but something to look at
  • Seems like we run out of oldgen space with some frequency; however, that may be due to younggen promotion rather than oldgen space problems
    • We either need to add size to YoungGen, OldGen, or Heap
    • I'd like other opinions before doing any tuning so fine-grain

Change 327763 merged by Dzahn:
[operations/puppet@production] gerrit: Enable G1 GC

https://gerrit.wikimedia.org/r/327763

17:07 < mutante> Inst openjdk-8-jdk [8u181-b13-2] (8u212-b01-1~deb8u1 Wikimedia:8/jessie-wikimedia [amd64]) []
17:07 < mutante> then we would be 8u212
..

17:10 < mutante> !log cobalt (gerrit) upgrading openjdk 8 minor version
17:13 < mutante> !log restarting gerrit

I have noticed 2 problems:

  1. GC spiral of doom
  2. HTTP threads get stuck behind some lock held by a SendEmail thread.

I had assumed they were related problems since:

  • GC Spiral of Doom causes a pile-up of threads
  • HTTP threads getting stuck behind SendEmail piles up threads
  • Thread pile up has (at least until the thread-pile-up today) meant non-responsive gerrit

Now I'm not sure if I *do* think they are the same issue since, after upgrading to use G1GC, HTTP threads began to get stuck behind sendemail; however, HTTP response times remained low, gerrit seemed responsive, AFAICT. Eventually gerrit would have run out of threads in the threadpool and become un-responsive, but I that is different than GC or CPU causing slow response times. This makes me think: maybe there are two issues here.

17:07 < mutante> Inst openjdk-8-jdk [8u181-b13-2] (8u212-b01-1~deb8u1 Wikimedia:8/jessie-wikimedia [amd64]) []
17:07 < mutante> then we would be 8u212
..

17:10 < mutante> !log cobalt (gerrit) upgrading openjdk 8 minor version
17:13 < mutante> !log restarting gerrit

This seems to be the same symptom that we had: i.e., SendEmail thread remained stuck. In our case (see fastthread.io links in this task) we were not stuck in socketRead0 (according to the stack trace in the dumps). I have wondered by the sendemail thread wasn't killed by timeout, hopefully the jvm change resolves that. I even tried killing the SendEmail thread once and (same experience as the person from that task mentions) no tasks waiting on that thread ever finished and I still had to restart gerrit.

I'm hopeful that the upgrade to 8u212 will resolve the SendEmail problem.

I'm also hopeful that G1GC will help mitigate any GC thrashing.

Will continue to monitor.

Change 507858 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] gerrit: bump heap limit

https://gerrit.wikimedia.org/r/507858

Change 507858 merged by CDanis:
[operations/puppet@production] gerrit: bump heap limit

https://gerrit.wikimedia.org/r/507858

Regarding the SendEmail thread, it took a while to remember about it but T131189 was about SendEmail having stuck tcp connections eventually blocking the task and thus the thread pool. Alexandros used gdb to close the sockets "manually" :D The task is worth reading.

Our Gerrit currently has sendemail.connectTimeout = 30 seconds and sendemail.threadPoolSize = 2. Apparently sending an email hold a lock on the account cache, so I would imagine that whenever an issue occurs when sending email, that in turn block all http requests that have to acquire a lock on the account cache. It is really just me imagining things. Potentially, once the email has been prepared, the account cache could be released before actually sending it. If that is the issue at all.

If it ever occurs again, it would be interesting to check/monitor established smtp connections (via netstat / lsof). I would also love to have detailed statistics about each of the queue, but that is really for another task.

Meanwhile, it seems switching to G1GC properly mitigated the slowness issue?

thcipriani claimed this task.

Regarding the SendEmail thread, it took a while to remember about it but T131189 was about SendEmail having stuck tcp connections eventually blocking the task and thus the thread pool. Alexandros used gdb to close the sockets "manually" :D The task is worth reading.

I did see that task, unfortunately symptoms are slightly different in this case:

  1. send-email doesn't show up in gerrit show-queue -w --by-queue (tried using various flags)
  2. lsof for the gerrit process at the time of these problems didn't show any smtp connections

Our Gerrit currently has sendemail.connectTimeout = 30 seconds and sendemail.threadPoolSize = 2. Apparently sending an email hold a lock on the account cache, so I would imagine that whenever an issue occurs when sending email, that in turn block all http requests that have to acquire a lock on the account cache. It is really just me imagining things. Potentially, once the email has been prepared, the account cache could be released before actually sending it. If that is the issue at all.

If it ever occurs again, it would be interesting to check/monitor established smtp connections (via netstat / lsof). I would also love to have detailed statistics about each of the queue, but that is really for another task.

I did check with lsof the last time this happened; however, there was no connection to an smtp port. Also, there was no mail in the exim backlog, nothing in the logs for exim that would indicate a problem. Sending a simple email from that server during the problem worked from the commandline almost instantly. Restarting exim seemed to have no effect either.

Upstream suggested we might be running out of file descriptors, which would make sense given those symptoms, but JavaMelody has a file descriptor graph that showed we were doing just fine in terms of open file descriptors during the last incident.

I'm still at a loss as to what could be the root cause.

Meanwhile, it seems switching to G1GC properly mitigated the slowness issue?

Yes, it seems to have. Multiple concurrent GCs don't seem to happen, and our total GC pause looks a lot healthier. I will close this task and open a new one for the SendEmail issue if/when it recurs.

@thcipriani now deserves some well deserved time off after all the madness that task has caused :-]

As a side note, we now have JavaMelody being exposed on grafana: https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody . That might prove to be helpful in the future

I agree with @hashar, @thcipriani did a fantastic job at tuning gerrit!

I have missed the follow up about creating a mirror for our git repositories. I have filled it as T226240