Page MenuHomePhabricator

Reset to upstream java GC options and remove redundant JVM options
Open, Needs TriagePublic

Description

Proximate reason why we want to make this change

(Context: Original name for this ticket was Elasticsearch (omega cluster) failed with OOME on elastic1096)

Elasticsearch (omega cluster) failed on elastic1096 with an OutOfMemoryError (see logs from journald below). The process was restarted by puppet and the elasticsearch instance is running again. It might be worth checking memory consumption and maybe adapting heap size.

Note: the systemd logs seem to be full of GC logs, which are more noisy than anything else.

AC:

  • Remove current optimizations and use upstream default
  • Cluster is restarted to reload configuration
  • we don't hit an OOME within 2 weeks
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  10:      15464 bytes,    3589384 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  11:     499352 bytes,    4088736 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  12:     775672 bytes,    4864408 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  13:     609240 bytes,    5473648 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  14:     131232 bytes,    5604880 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  15:     223488 bytes,    5828368 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Leaving safepoint region
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Total time for which application threads were stopped: 0.5278124 seconds, Stopping threads took: 0.0001074 seconds
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: java.lang.OutOfMemoryError: Java heap space
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: Dumping heap to /srv/elasticsearch/production-search-omega-eqiad/java_pid1483043.hprof ...
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Application time: 0.0005076 seconds
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Entering safepoint region: HeapDumper
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.930s][info ][safepoint] Leaving safepoint region
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.930s][info ][safepoint] Total time for which application threads were stopped: 11.9172981 seconds, Stopping threads took: 0.0000750 seconds
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: Heap dump file created [6878669594 bytes in 11.918 secs]
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.939s][info ][safepoint] Application time: 0.0083047 seconds
Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Failed with result 'exit-code'.
Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Consumed 18h 17min 919ms CPU time.
Changing GC options

There's been tuning done in the past, much of which we are not confident is still relevant. Let's go back to defaults, gather some data and see what adjustments need to be made from there.

As part of this effort let's take a look at the opensearch and/or logstash options as well and see if we want to carry our elasticsearch changes forward into those projects as well.

Event Timeline

Elasticsearch no longer recommends disabling explicit GC ; the linked commit message mentions a scenario that could explain the OOM listed above. Thus, we should disable this config across our Elastic hosts.

We should probably start by resetting to upstream default configuration and see how it works.

MPhamWMF set the point value for this task to 3.
Gehel removed the point value for this task.

Change 838248 had a related patch set uploaded (by Gehel; author: Bking):

[operations/puppet@production] elastic: change java GC options to default for ES7

https://gerrit.wikimedia.org/r/838248

RKemper renamed this task from Elasticsearch (omega cluster) failed with OOME on elastic1096 to Reset to default java JC options.Thu, Nov 10, 8:59 PM
RKemper updated the task description. (Show Details)
RKemper updated the task description. (Show Details)
RKemper renamed this task from Reset to default java JC options to Reset to default java GC options.Thu, Nov 10, 9:03 PM
RKemper renamed this task from Reset to default java GC options to Reset to upstream java GC options and remove redundant JVM options.Thu, Nov 10, 9:07 PM

Change 838248 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: change java GC options to default for ES7

https://gerrit.wikimedia.org/r/838248

Change 855719 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] opensearch/logstash: make default gc options same as ES 7

https://gerrit.wikimedia.org/r/855719

Change 838248 merged by Bking:

[operations/puppet@production] elastic: change java GC options to default for ES7

https://gerrit.wikimedia.org/r/838248

In addition to the patches listed above, we wrote a new test as well.

Keeping this open, as we still need to restart all clusters to apply the new JVM options.

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:52:40Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:52:58Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:55:35Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:59:24Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T16:18:51Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T16:49:19Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T14:38:33Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T17:59:28Z] <bking@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T18:00:19Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T19:34:37Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T23:02:33Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-22T14:43:39Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020

Rough way of validating that the extraneous JVM options have been removed:

ps aux | grep elastic | grep java | tr ' ' '\n' | grep . | sort > java.out

This will capture the JVM options for 2 running java processes on each host, so before the change, it's expected to show the extraneous options 4 times. One example of an extraneous option is -Dio.netty.recycler.maxCapacityPerThread=0.

After the change, this and other extraneous options should only be repeated twice.

Mentioned in SAL (#wikimedia-operations) [2022-11-22T17:17:33Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020

Will create a new task for

  • we don't hit an OOME within 2 weeks
  • gather some data

See original comment for more details.

Mentioned in SAL (#wikimedia-operations) [2022-11-23T17:42:41Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-23T17:44:06Z] <ryankemper> [Elastic] T319020 Kicked off rolling restart of cloudelastic to apply new heap size 8->10G; see ryankemper@cumin1001 tmux session cloudelastic_restarts

Mentioned in SAL (#wikimedia-operations) [2022-11-23T18:12:17Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020