Page MenuHomePhabricator

Gerrit crashed due to out of Heap
Closed, ResolvedPublic

Description

Today Gerrit required a restart, it appears from the error log that it is due to heap exhaustion.

[2019-06-05 22:07:36,449] [Thread-24] ERROR com.google.gerrit.pgm.Daemon : Thread Thread-24 threw exception
java.lang.OutOfMemoryError: Java heap space
        at java.lang.Integer.valueOf(Integer.java:832)
        at sun.nio.ch.EPollPort$EventHandlerTask.poll(EPollPort.java:223)
        at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:268)
        at java.lang.Thread.run(Thread.java:748)
[2019-06-05 22:07:36,455] [HTTP-140393] WARN  org.eclipse.jetty.servlet.ServletHandler : Error for /r/mediawiki/extensions/LiquidThreads/info/refs
java.lang.OutOfMemoryError: Java heap space
[2019-06-05 22:07:36,455] [HTTP-140054] WARN  /r : Internal error during upload-pack from /srv/gerrit/git/operations/puppet.git
java.lang.OutOfMemoryError: Java heap space
[2019-06-05 22:07:36,455] [HTTP-139507] WARN  org.eclipse.jetty.servlet.ServletHandler : Error for /r/mediawiki/extensions/JsonConfig/info/refs
java.lang.OutOfMemoryError: Java heap space
[2019-06-05 22:07:36,454] [accounts NRT] ERROR com.google.gerrit.pgm.Daemon : Thread accounts NRT threw exception
java.lang.OutOfMemoryError: Java heap space
[2019-06-05 22:07:36,454] [HTTP-140373] WARN  org.eclipse.jetty.servlet.ServletHandler : Error for /r/mediawiki/extensions/Echo/info/refs
java.lang.OutOfMemoryError: Java heap space

Event Timeline

Some curious stuff in the monitoring data:

I have gc info from right when this happened: https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTkvMDYvNS8tLWp2bV9nYy5nZXJyaXQubG9nLjcuY3VycmVudC0tMjMtMjgtNTE=&channel=WEB

Looks like there was a sudden spike in heap usage starting at 22:06:03, that triggered a full gc (causing a huge pause) at 22:06:53.

gerrit-gc-pause-zoom.png (515×887 px, 27 KB)

The first out of OutOfMemory error happened at 22:07:36:

thcipriani@cobalt:~$ grep -B1 -i 'heap' /var/log/gerrit/error_log | head -n2
[2019-06-05 22:07:36,449] [Thread-24] ERROR com.google.gerrit.pgm.Daemon : Thread Thread-24 threw exception
java.lang.OutOfMemoryError: Java heap space

Gerrit is on a new server now with 64GB RAM but we still need to tune settings to make use of that.

https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=usedMemory

Change 545381 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] gerrit: increase heap_size from 20G to 32G

https://gerrit.wikimedia.org/r/545381

Change 545381 had a related patch set uploaded (by Paladox; owner: Dzahn):
[operations/puppet@production] gerrit: increase heap_size from 20G to 32G

https://gerrit.wikimedia.org/r/545381

Change 545381 merged by Dzahn:
[operations/puppet@production] gerrit: increase heap_size from 20G to 32G

https://gerrit.wikimedia.org/r/545381

Mentioned in SAL (#wikimedia-operations) [2019-10-24T00:03:18Z] <mutante> restarting gerrit to increase heap_size from 20G to 32G (T225166 T222391)

Gerrit restarted and running with 32GB heap_size now:

[gerrit1001:~] $ sudo systemctl status gerrit | grep Xm
           └─16713 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -XX:+UseG1GC -Xmx32g -Xms32g

This is one of those tickets where it's hard to decide when we close it again. Just "absence of crash" for X time or do we want more tuning?

Comments on Gerrit sound like, yes, we want more tuning, quotes:

Hashar: "And due to -XX:G1NewSizePercent=15 , the Eden space would grow from 3G to 4,8G which is probably fine :]"

Tyler: "NewGen space should grow proportionally from 3G to 5G (automagically), but we should also grow core.packedGitLimit proportionally. I like the idea of doing that in separate, discrete steps and monitoring impact."

hashar subscribed.

The heap size has been grown from 20G to 32G ( https://gerrit.wikimedia.org/r/545381 )

We have a new server which is way more powerful

We did a lot more tweaking recently (owl bot spam, setting up a replica and shifting workload to it, larger Eden space via -XX:G1NewSizePercent=15 etc).

We also have a better understanding of Gerrit memory pressure.

As a result I am claiming this task to be fixed due to all the above fixes / enhancements conducted over the last few months. If we encounter an out of memory error again, I guess we will want a fresh new task :-]

The issue was a memory leak I have found in Gerrit (T263008). It has been addressed in Gerrit 3.2.7 which we deployed in February 2020.