Maniphest T319020

Reset to upstream java GC options and remove redundant JVM options
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Sep 30 2022, 12:55 PM

Description

Proximate reason why we want to make this change

(Context: Original name for this ticket was Elasticsearch (omega cluster) failed with OOME on elastic1096)

Elasticsearch (omega cluster) failed on elastic1096 with an OutOfMemoryError (see logs from journald below). The process was restarted by puppet and the elasticsearch instance is running again. It might be worth checking memory consumption and maybe adapting heap size.

Note: the systemd logs seem to be full of GC logs, which are more noisy than anything else.

AC:

Remove current optimizations and use upstream default
Cluster is restarted to reload configuration
we don't hit an OOME within 2 weeks

Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  10:      15464 bytes,    3589384 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  11:     499352 bytes,    4088736 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  12:     775672 bytes,    4864408 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  13:     609240 bytes,    5473648 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  14:     131232 bytes,    5604880 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431072.490s][trace][gc,age   ] GC(14233) - age  15:     223488 bytes,    5828368 total
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Leaving safepoint region
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Total time for which application threads were stopped: 0.5278124 seconds, Stopping threads took: 0.0001074 seconds
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: java.lang.OutOfMemoryError: Java heap space
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: Dumping heap to /srv/elasticsearch/production-search-omega-eqiad/java_pid1483043.hprof ...
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Application time: 0.0005076 seconds
Sep 30 12:34:56 elastic1096 elasticsearch[1483043]: [1431073.013s][info ][safepoint] Entering safepoint region: HeapDumper
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.930s][info ][safepoint] Leaving safepoint region
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.930s][info ][safepoint] Total time for which application threads were stopped: 11.9172981 seconds, Stopping threads took: 0.0000750 seconds
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: Heap dump file created [6878669594 bytes in 11.918 secs]
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Sep 30 12:35:08 elastic1096 elasticsearch[1483043]: [1431084.939s][info ][safepoint] Application time: 0.0083047 seconds
Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Failed with result 'exit-code'.
Sep 30 12:35:08 elastic1096 systemd[1]: elasticsearch_7@production-search-omega-eqiad.service: Consumed 18h 17min 919ms CPU time.

Changing GC options

There's been tuning done in the past, much of which we are not confident is still relevant. Let's go back to defaults, gather some data and see what adjustments need to be made from there.

As part of this effort let's take a look at the opensearch and/or logstash options as well and see if we want to carry our elasticsearch changes forward into those projects as well.

Details

Other Assignee: bking

	Subject	Repo	Branch	Lines +/-
	elastic: change java GC options to default for ES7	operations/puppet	production	+9 -132
	opensearch/logstash: make default gc options same as ES 7	operations/puppet	production	+3 -14

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Gehel	T319020 Reset to upstream java GC options and remove redundant JVM options
		Resolved		Gehel	T323646 Observe results from JVM options/heap memory changes

Event Timeline

Gehel created this task.Sep 30 2022, 12:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 30 2022, 12:55 PM

Gehel mentioned this in T319021: review elasticsearch GC logging options now that we migrated to Java 11.Sep 30 2022, 12:57 PM

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Oct 3 2022, 3:38 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Elasticsearch no longer recommends disabling explicit GC ; the linked commit message mentions a scenario that could explain the OOM listed above. Thus, we should disable this config across our Elastic hosts.

We should probably start by resetting to upstream default configuration and see how it works.

MPhamWMF updated the task description. (Show Details)Oct 10 2022, 3:44 PM

Gehel updated the task description. (Show Details)Oct 10 2022, 3:45 PM

MPhamWMF updated the task description. (Show Details)Oct 10 2022, 3:46 PM

MPhamWMF set the point value for this task to 3.

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel updated the task description. (Show Details)Oct 10 2022, 3:47 PM

Gehel removed the point value for this task.

Change 838248 had a related patch set uploaded (by Gehel; author: Bking):

[operations/puppet@production] elastic: change java GC options to default for ES7

https://gerrit.wikimedia.org/r/838248

gerritbot added a project: Patch-For-Review.Oct 10 2022, 3:49 PM

EBernhardson moved this task from Ready for Dev -- SWE to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.Oct 24 2022, 3:29 PM

RKemper renamed this task from Elasticsearch (omega cluster) failed with OOME on elastic1096 to Reset to default java JC options.Nov 10 2022, 8:59 PM

RKemper updated the task description. (Show Details)

RKemper renamed this task from Reset to default java JC options to Reset to default java GC options.Nov 10 2022, 9:03 PM

RKemper renamed this task from Reset to default java GC options to Reset to upstream java GC options and remove redundant JVM options.Nov 10 2022, 9:07 PM

Change 838248 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: change java GC options to default for ES7

https://gerrit.wikimedia.org/r/838248

Change 855719 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] opensearch/logstash: make default gc options same as ES 7

https://gerrit.wikimedia.org/r/855719

Gehel moved this task from Ready for Dev -- SRE/Ops to Incoming on the Discovery-Search (Current work) board.Nov 14 2022, 4:52 PM

Change 838248 merged by Bking:

[operations/puppet@production] elastic: change java GC options to default for ES7

https://gerrit.wikimedia.org/r/838248

In addition to the patches listed above, we wrote a new test as well.

Keeping this open, as we still need to restart all clusters to apply the new JVM options.

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:52:40Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:52:58Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:55:35Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T15:59:24Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T16:18:51Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-18T16:49:19Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T14:38:33Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.Nov 21 2022, 4:39 PM

Gehel updated Other Assignee, added: bking.

Mentioned in SAL (#wikimedia-operations) [2022-11-21T17:59:28Z] <bking@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T18:00:19Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T19:34:37Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply config changes - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-21T23:02:33Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - bking@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-22T14:43:39Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020

Rough way of validating that the extraneous JVM options have been removed:

This will capture the JVM options for 2 running java processes on each host, so before the change, it's expected to show the extraneous options 4 times. One example of an extraneous option is -Dio.netty.recycler.maxCapacityPerThread=0.

After the change, this and other extraneous options should only be repeated twice.

Mentioned in SAL (#wikimedia-operations) [2022-11-22T17:17:33Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply config changes - bking@cumin2002 - T319020

Will create a new task for

we don't hit an OOME within 2 weeks
gather some data

See original comment for more details.

bking mentioned this in T323646: Observe results from JVM options/heap memory changes.Nov 22 2022, 8:33 PM

Mentioned in SAL (#wikimedia-operations) [2022-11-23T17:42:41Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020

Mentioned in SAL (#wikimedia-operations) [2022-11-23T17:44:06Z] <ryankemper> [Elastic] T319020 Kicked off rolling restart of cloudelastic to apply new heap size 8->10G; see ryankemper@cumin1001 tmux session cloudelastic_restarts

Mentioned in SAL (#wikimedia-operations) [2022-11-23T18:12:17Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster restart; prev restart was done before some hosts had ran puppet - ryankemper@cumin1001 - T319020

Gehel moved this task from Blocked/Waiting to Needs Reporting on the Discovery-Search (Current work) board.Dec 19 2022, 4:30 PM

Gehel closed this task as Resolved.Jan 13 2023, 9:55 AM

Gehel claimed this task.

Gehel closed subtask T323646: Observe results from JVM options/heap memory changes as Resolved.

Reset to upstream java GC options and remove redundant JVM optionsClosed, ResolvedPublicActions

Description

Proximate reason why we want to make this change

Changing GC options

Details

Related ObjectsSearch...

Event Timeline

Reset to upstream java GC options and remove redundant JVM options
Closed, ResolvedPublic
Actions

Related Objects
Search...