Page MenuHomePhabricator

Investigate and fix GC issues on cloudelastic machines
Open, MediumPublic

Description

Since cloudelastic is receiving production updates cloudelastic1001 has suffered from weird GC behaviors causing slowdows on this cluster making it hard for it to keep up to date with live updates.

Symptoms
  • the number old GC operations/sec increases up to insane values (300 ops/hour)
  • the gc logs reports Full GC (Ergonomics) with very high pause times
2019-08-29T06:52:28.077+0000: 484409.002: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1802631K(14554624K)] [ParOldGen: 31457212K->31456931K(31457280K)] 44837820K->33259562K(46011904K), [Metaspace: 77274K->77274K(90112K)], 6.6373629 secs] [Times: user=92.26 sys=0.50, real=6.64 secs] 
2019-08-29T06:52:34.714+0000: 484415.639: Total time for which application threads were stopped: 6.6400531 seconds, Stopping threads took: 0.0002179 seconds
2019-08-29T06:52:38.751+0000: 484419.676: Total time for which application threads were stopped: 0.0024733 seconds, Stopping threads took: 0.0001990 seconds
2019-08-29T06:52:39.476+0000: 484420.401: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1782371K(14554624K)] [ParOldGen: 31456931K->31457151K(31457280K)] 44837539K->33239522K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.2366785 secs] [Times: user=100.64 sys=0.77, real=7.24 secs] 
2019-08-29T06:52:46.713+0000: 484427.638: Total time for which application threads were stopped: 7.2393857 seconds, Stopping threads took: 0.0001504 seconds
2019-08-29T06:52:49.902+0000: 484430.827: Total time for which application threads were stopped: 0.0024902 seconds, Stopping threads took: 0.0001788 seconds
2019-08-29T06:52:50.391+0000: 484431.316: [Full GC (Ergonomics) [PSYoungGen: 13380608K->1844540K(14554624K)] [ParOldGen: 31457151K->31456756K(31457280K)] 44837759K->33301296K(46011904K), [Metaspace: 77274K->77274K(90112K)], 7.6572372 secs] [Times: user=106.90 sys=0.73, real=7.65 secs] 
2019-08-29T06:52:58.048+0000: 484438.973: Total time for which application threads were stopped: 7.6619386 seconds, Stopping threads took: 0.0019666 seconds
What have been tried so far
  • Set NewRatio to 2 to increase the old gen size and try to prevent the GC from trying to reshape its memory layout
  • Increase the heap up to 45G as we thought that we saturated the heap
  • Reduce refresh_interval and remove custom index merging limits to reduce segment count. Reduced from ~65k to ~58k. No obvious effect on tracked memory buckets.

None of these solutions showed worked.
What we do not know is the reason the GC behaves like this on these machines.

What we haven't tried:

  • Switch to G1GC
  • Deactivate UseAdaptiveSizePolicy -XX:-UseAdaptiveSizePolicy
  • Reduce replica count from 2 to 1 to reduce cluster data size
Workaround

Restart the affected JVM when the GC starts misbehaving

Immediate actions:

  • T231516: Alert when a jvm hits more than 100ops/hour

Event Timeline

dcausse created this task.Aug 29 2019, 7:14 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 29 2019, 7:14 AM
dcausse triaged this task as High priority.Aug 29 2019, 7:14 AM
dcausse updated the task description. (Show Details)
dcausse moved this task from needs triage to Ops / SRE on the Discovery-Search board.
dcausse updated the task description. (Show Details)Aug 29 2019, 8:28 AM
dcausse updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T17:15:20Z] <dcausse> restarted elasticsearch on cloudelastic1004 (T231517)

Looking into the graphs, it seems to me that the underlying problem is that the min heap keeps growing over time. When the node gets to 2/3 (~30.8GB) heap used the old GC goes crazy. We can re-apply NewRatio with the 45G heap and the old gen will be able to grow by a few more GB, but unless we can figure out what the final steady state value is for the old gen we can only really keep trying larger values.

The terms data doesn't seem large enough to account for the heap growth, in particular because terms memory doesn't decrease when restarting a node but used heap drops 10-20GB. A bit puzzled where this memory is going.

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:23:23Z] <ebernhardson> restart elasticsearch on cloudelastic1001 (T231517)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:41:48Z] <ebernhardson> set index.merge.scheduler.max_thread_count to null to accept default values on cloudelastic-chi (T231517)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:50:15Z] <ebernhardson> restart elasticsearch on cloudelastic1002 (T231517)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T18:59:00Z] <ebernhardson> restart elasticsearch on cloudelastic1003 (T231517)

Not sure it's helping or not, but i increased refresh_interval on all cloudelastic-chi indices to 5 minutes, and removed their index.merge.max_thread_count settings (was 1, now takes default of 3) to see if we could cut back on the number of tiny indices. Segment count reduced from 65k to 57k, about 10%. Might be a minor memory savings but likely very little as the tradeoff is an increase of IndexWriter buffers from ~200M/server to ~1GB/server.

EBernhardson updated the task description. (Show Details)Aug 29 2019, 7:31 PM
EBernhardson updated the task description. (Show Details)Sep 3 2019, 4:01 PM

Reducing replica count from 2 to 1 had a dramatic effect on the cluster. Things are generally looking happy now.

Dzahn added a subscriber: Dzahn.EditedMar 13 2020, 12:28 AM
00:27 <+icinga-wm> ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 179 gt 100 daniel_zahn 
                   https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs 
                   https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad
00:27 <+icinga-wm> ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 210.5 gt 100 daniel_zahn 
                   https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs 
                   https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad
00:27 <+icinga-wm> ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 134.2 gt 100 daniel_zahn 
                   https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs 
                   https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad
00:27 <+icinga-wm> ACKNOWLEDGEMENT - Old JVM GC check - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 107.8 gt 100 daniel_zahn 
                   https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs 
                   https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad

The current occurances are due to a reindexing procedure that is running across all wikis/clusters. The reindexing process is implemented internal to elasticsearch and does respond to backpressure, but it's still pushing hard enough to keep triggering these GC alerts.

Change 583106 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [cirrus] force cloudelastic replica count to 1

https://gerrit.wikimedia.org/r/583106

Mentioned in SAL (#wikimedia-operations) [2020-03-24T17:10:38Z] <ebernhardson> update cloudelastic-chi replica counts from 2 to 1 T231517

elukey added a subscriber: elukey.Mar 25 2020, 5:07 PM

After the above changes the GC timings are way better, no more flapping alerts.

elukey lowered the priority of this task from High to Medium.Mar 25 2020, 5:07 PM

Change 583106 merged by jenkins-bot:
[operations/mediawiki-config@master] [cirrus] force cloudelastic replica count to 1

https://gerrit.wikimedia.org/r/583106

Mentioned in SAL (#wikimedia-operations) [2020-03-26T11:21:12Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T231517: [cirrus] force cloudelastic replica count to 1 (duration: 01m 06s)

Mentioned in SAL (#wikimedia-operations) [2020-03-26T11:23:07Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T231517: [cirrus] force cloudelastic replica count to 1 (duration: 01m 05s)

The issue seems intermittent, and since there is a big re-index in progress we cannot really do much. The long term fix should be add more nodes to the cluster, waiting for a rack/setup/deploy after T233720

Mentioned in SAL (#wikimedia-operations) [2020-03-31T17:38:00Z] <elukey> restart elasticsearch_6@cloudelastic-chi-eqiad.service on cloudelastic1001 to see if it recovers from a trashing/gc state - T231517

Mentioned in SAL (#wikimedia-operations) [2020-04-06T07:35:51Z] <elukey> restart elasticsearch_6@cloudelastic-chi-eqiad on cloudelastic1003 as attempt to fix heavy GC runs (old gen) - T231517

elukey added a comment.Apr 9 2020, 7:53 AM

Adding a couple of notes after a chat with Gehel this morning:

  • it seems that the issue happens also to ES in the ELK stack, same format. I also happened on some elastic-prod-chi nodes, but it auto-resolved by itself.
    • due to https://phabricator.wikimedia.org/T167685 we are not using CMS, so before thinking about G1 it would be nice to re-add it and see how it performs (the bug seemed to be present for ES 5.3, now we are at 5.6)

Change 587978 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Java CMS GC for cloudelastic chi cluster

https://gerrit.wikimedia.org/r/587978

production-search-psi on elastic1052 alerted today at 22:20 UTC, could be similar issue. psi cluster is one of the ones with 10G heaps.

Change 587978 merged by Elukey:
[operations/puppet@production] Set Java CMS GC for cloudelastic chi cluster

https://gerrit.wikimedia.org/r/587978

Mentioned in SAL (#wikimedia-operations) [2020-04-14T07:33:36Z] <elukey> apply CMS GC settings to chi on cloudelastic1001 - T231517

Mentioned in SAL (#wikimedia-operations) [2020-04-14T08:49:30Z] <elukey> restart elastic-chi on cloudelastic1001 with -XX:NewSize=10G - T231517

elukey added a comment.EditedApr 14 2020, 9:17 AM

Today I applied the CMS options that were present in jvm.options (but commented) on cloudelastic1001 chi: -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly

A strange issue appeared, namely Eden/Survivor max heap space limited to something like 1.2G (over a 45G heap size). Metrics for the curious:

https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&from=1586846576778&to=1586855038634

I am not an expert of CMS but the result looked really weird, and also caused a ton of young gen collections. Major drawback of the intense gc activity was a mild increase in CPU usage for the host, didn't observe other horrible things. After a bit of research, I came across this great explanation: https://github.com/elastic/elasticsearch/issues/29504#issuecomment-381126640

Tested the following: java -Xms45G -Xmx45G -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version | grep -i "ParallelGCThreads"
The number returned was 15, that multiplied by 64M led to ~960MB, that was very close to the eden limit. So TL;DR, it seemed that the young gen's size was calculated with the dx side of the following:

MIN2(max_heap/(NewRatio+1), ScaleForWordSize(young_gen_per_worker * parallel_gc_threads))

Made another test (currently ongoing) setting in jvm.options -XX:NewSize=10G, and restarted again ES. Results look better now (visible even from the above graphs linked). Notes:

  • there is more space for Old/tenured gen in the heap, and the GC will trigger when occupancy crosses the 75%.
  • the young gen's gc time is higher than before, but my speculation is that part of the time is spent is concurrent to the application's thread (so few of it is related to threads being stopped for marking/updating/etc..)
  • Overall Heap usage seems decreased by ~5G

Puppet is currently disabled on 1001 and settings applied manually, if we decide to go further I'll puppetize the change in a better way. The alternative is rollback and wait for Java 11 / G1 :)

Change 588753 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set 10G for the JVM's young gen size to cloudelastic-chi

https://gerrit.wikimedia.org/r/588753

Change 588753 merged by Elukey:
[operations/puppet@production] Set 10G for the JVM's young gen size to cloudelastic-chi

https://gerrit.wikimedia.org/r/588753

Mentioned in SAL (#wikimedia-operations) [2020-04-15T07:35:30Z] <elukey> restart cloudelastic-chi on cloudelastic1002 to apply new jvm settings - T231517

Change 589007 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::elasticsearch::cloudelastic: add more heap space for young gen

https://gerrit.wikimedia.org/r/589007

Mentioned in SAL (#wikimedia-operations) [2020-04-16T15:54:54Z] <elukey> restart chi on cloudelastic1001 with -XX:NewRatio=3 - T231517

Change 589007 abandoned by Elukey:
role::elasticsearch::cloudelastic: add more heap space for young gen

Reason:
Better to use NewRatio as suggested by Gehel!

https://gerrit.wikimedia.org/r/589007

Change 589356 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Use -XX:NewRatio=3 for cloudelastic-chi instead of setting a specific size

https://gerrit.wikimedia.org/r/589356

Change 589356 merged by Elukey:
[operations/puppet@production] Use -XX:NewRatio=3 for cloudelastic-chi instead of setting a specific size

https://gerrit.wikimedia.org/r/589356

Change 589472 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] elasticsearch::instance: avoid UseCMSInitiatingOccupancyOnly

https://gerrit.wikimedia.org/r/589472

Change 589472 merged by Elukey:
[operations/puppet@production] elasticsearch::instance: avoid UseCMSInitiatingOccupancyOnly

https://gerrit.wikimedia.org/r/589472

Mentioned in SAL (#wikimedia-operations) [2020-04-27T07:25:42Z] <elukey> roll restart elastic-chi on cloudelastic100[1-4] to pick up the last JVM GC settings - T231517

Change 592621 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85

https://gerrit.wikimedia.org/r/592621

Change 592621 merged by Elukey:
[operations/puppet@production] role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85

https://gerrit.wikimedia.org/r/592621

Dzahn removed a subscriber: Dzahn.Apr 27 2020, 1:10 PM

Mentioned in SAL (#wikimedia-operations) [2020-04-27T13:10:49Z] <elukey> roll restart elastic on cloudelastic-chi again to pick up new JVM settings - T231517

Some notes to avoid forgetting about what was tested today. First, the cluster was rolled restarted to get to the same config, namely:

  • no UseCMSInitiatingOccupancyOnly
  • NewRatio=3 (oldgen 3 times the young gen)
  • CMSInitiatingOccupancyFraction=75 (was already set in the CMS config of jvm.options)

The above turned out to be a problem since despite not using UseCMSInitiatingOccupancyOnly anymore, the JVM was forcing old gen collections when the old gen pool was crossing the 75% mark. We then increased the threshold to 85% and this brought old gen runs down to zero, except the few times that it crossed 85% in which some runs occurs to free memory. Overall it seems good, let's see what happens when we reindex!

Sadly the new settings are not enough to prevent the same trashing behavior to happen during re-index. The CMS collector seems to end up in very long old gen runs, failing to complete the concurrent mark and sweep step ending up logging:

(concurrent mode failure): 35299987K->35388694K(35389440K), 14.6819334 secs] 45916819K->36162273K(46006272K), [Metaspace: 80934K->80934K(94208K)], 14.6821468 secs] [Times: user=18.82 sys=0.02, real=14.68 secs]

This is what we think it is happening (after some chats on IRC):

  • on one side, cloud elastic is clearly overloaded, we'll try to focus on adding the new nodes asap in T249062
  • on the GC side, it seems that too much pressure in creating new objects causes the young gen pool to saturate, ending up in frequent GC runs, that tend to promote objects sooner to the old gen. In turn, the old gen fills up quickly and CMS / Throughput GCs cannot really keep up, end up in trashing.

On the GC side, one test that I'd like to do is NewRatio=2, to see if giving more space to young gen helps.