Proximate reason why we want to make this change
(Context: Original name for this ticket was Elasticsearch (omega cluster) failed with OOME on elastic1096)
Elasticsearch (omega cluster) failed on elastic1096 with an OutOfMemoryError (see logs from journald below). The process was restarted by puppet and the elasticsearch instance is running again. It might be worth checking memory consumption and maybe adapting heap size.
Note: the systemd logs seem to be full of GC logs, which are more noisy than anything else.
AC:
- Remove current optimizations and use upstream default
- Cluster is restarted to reload configuration
- we don't hit an OOME within 2 weeks
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age ] GC(14233) - age 10: 15464 bytes, 3589384 total Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age ] GC(14233) - age 11: 499352 bytes, 4088736 total Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age ] GC(14233) - age 12: 775672 bytes, 4864408 total Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age ] GC(14233) - age 13: 609240 bytes, 5473648 total Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age ] GC(14233) - age 14: 131232 bytes, 5604880 total Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age ] GC(14233) - age 15: 223488 bytes, 5828368 total Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Leaving safepoint region Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Total time for which application threads were stopped: 0.5278124 seconds, Stopping threads took: 0.0001074 seconds Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: java.lang.OutOfMemoryError: Java heap space Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: Dumping heap to /srv/elasticsearch/production-search-omega-eqiad/java_pid1483043.hprof ... Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Application time: 0.0005076 seconds Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Entering safepoint region: HeapDumper Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.930s][info ][safepoint] Leaving safepoint region Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.930s][info ][safepoint] Total time for which application threads were stopped: 11.9172981 seconds, Stopping threads took: 0.0000750 seconds Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: Heap dump file created [6878669594 bytes in 11.918 secs] Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: Terminating due to java.lang.OutOfMemoryError: Java heap space Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.939s][info ][safepoint] Application time: 0.0083047 seconds Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Failed with result 'exit-code'. Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Consumed 18h 17min 919ms CPU time.
Changing GC options
There's been tuning done in the past, much of which we are not confident is still relevant. Let's go back to defaults, gather some data and see what adjustments need to be made from there.
As part of this effort let's take a look at the opensearch and/or logstash options as well and see if we want to carry our elasticsearch changes forward into those projects as well.