Page MenuHomePhabricator

gerrit-replica was down for a while and no one noticed
Closed, ResolvedPublic

Description

gerrit-replica was down during the weekend bringing down codesearch with itself (T267507: Codesearch down (for a while)) and no one noticed, it should some monitoring, page, etc.

I also suggest moving it to eqiad, as lots of tools in eqiad use gerrit-replica (maybe have two nodes per dc?)

Event Timeline

It apparently went out of memory which is T263008. I have started it again as part of an unrelated routine maintenance to restart it after a java upgrade.

We do have a monitoring probe that checks the Gerrit process is present.

Last state change: 2020-06-27 19:00:30

PROCS OK: 1 process with regex args '^/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site'

But apparently after the process went of Java heap space, java was till running or surely the probe would have flagged the issue. When I issued a service restart, systemd was still knowing the process and send it a SIGKILL. From the journal:

Nov 07 13:43:35 gerrit2001 java[30197]: java.lang.OutOfMemoryError: Java heap space
Nov 07 13:43:35 gerrit2001 java[30197]: Dumping heap to /srv/gerrit/java_pid30197.hprof ...
Nov 07 13:47:02 gerrit2001 java[30197]: Heap dump file created [35616147146 bytes in 206.962 secs]
Nov 09 07:18:04 gerrit2001 systemd[1]: Stopping Gerrit code review tool...
Nov 09 07:19:34 gerrit2001 systemd[1]: gerrit.service: State 'stop-sigterm' timed out. Killing.
Nov 09 07:19:34 gerrit2001 systemd[1]: gerrit.service: Killing process 30197 (java) with signal SIGKILL.
Nov 09 07:19:34 gerrit2001 systemd[1]: gerrit.service: Main process exited, code=killed, status=9/KILL
Nov 09 07:19:34 gerrit2001 systemd[1]: gerrit.service: Failed with result 'timeout'.
Nov 09 07:19:34 gerrit2001 systemd[1]: Stopped Gerrit code review tool.
Nov 09 07:19:34 gerrit2001 systemd[1]: Started Gerrit code review tool.

So I guess the issue is why java does not exit after the heap dump.

@hashar as FYI there was a systemd restart before yours, but an OOM is not enough to kill a jvm (you'd need something like -XX:+ExitOnOutOfMemoryError IIRC). We'd also need to improve grafana dashboards, maybe it was me without too much coffee but finding metrics about gerrit2001 was not easy :)

That is the exact same issue we had with ElasticSearch at T76090 . We need to pass -XX:+ExitOnOutOfMemoryError

Change 640074 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] gerrit: exit the JVM after OutOfMemoryError

https://gerrit.wikimedia.org/r/640074

@hashar as FYI there was a systemd restart before yours, but an OOM is not enough to kill a jvm (you'd need something like -XX:+ExitOnOutOfMemoryError IIRC). We'd also need to improve grafana dashboards, maybe it was me without too much coffee but finding metrics about gerrit2001 was not easy :)

Hard lesson learned :] Thank you for confirming!

Change 640074 merged by Elukey:
[operations/puppet@production] gerrit: exit the JVM after OutOfMemoryError

https://gerrit.wikimedia.org/r/640074

Mentioned in SAL (#wikimedia-operations) [2020-11-09T09:52:33Z] <hashar> Restarting Gerrit on gerrit1001 and gerrit2001 in order to have the JVM to exit after OutOfMemory # T267517

Should be good now.