Page MenuHomePhabricator

Reduce impact of GC pauses on elasticsearch response time
Closed, ResolvedPublic

Description

The 99%-ile of elasticsearch response time is significantly higher than the 95%-ile. We have a full GC every hour, with a pause time of between 1 and 1.5 second. With 32 servers, this means one full GC every 2 minute. So the GC pauses could explain the response time we see. It make sense to try to optimize them.

Ideas to investigate:

  • moving to G1: strongly discouraged by elastic, might lead to data corruption, see Garbage First
  • reducing heap size: we seem to have less than 10Go of long lived objects, but a configured heap size of 30Go, reducing heap size might give us shorter but more frequent full GC
  • tuning new / old ratio: increasing the size of the young generations might help keep objects out of old space and reduce the number of full GC (this is border line black magic and very much needs to be validated in testing)
  • aggressive timeouts on queries: elasticsearch supports client timeouts and will return partial results if some shards do not answer fast enough. For completion suggester, where median resposne time is 20ms and the users are probably not waiting for > 100ms this probably make sense. Note that this might hide other issues, so we need to have some metrics on partial results to track potential issues.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Jan 26 2017, 11:17 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Once we're done with GC on WDQS, let's see if this is still needed and/or still effecting latency.

Mentioned in SAL (#wikimedia-operations) [2018-06-04T12:49:09Z] <gehel> restart elastic2001 to enable G1 GC - T156137

Since the bugs seen with G1 were for early version of Java 8 and that we've upgraded to much newer JVM everywhere, let's try G1 and see if we gain anything.

Change 437231 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: enable G1 garbage collector

https://gerrit.wikimedia.org/r/437231

moving to G1: strongly discouraged by elastic, might lead to data corruption, see Garbage First

This used to be the case, but has been resolved. From the elastic docs:

Early versions of the HotSpot JVM that shipped with JDK 8 are known to have issues that can lead to index corruption when the G1GC collector is enabled. The versions impacted are those earlier than the version of HotSpot that shipped with JDK 8u40.

We are well past 8u40 so should be in the clear.

Looking at GC logs with G1 on elastic2001: It looks like pool sizes are fairly unstable, with the GC trading Eden for Tenured (and vice versa) at a fairly high rate.

pool sizes after GC.png (980×1 px, 137 KB)

I'm not entirely sure what we should do about it. Even on 2001 which does not have user traffic, I still see long GC pauses (slightly lower and with less occurences than with CMS, so slightly better, but probably not significant either way).

I also see (both CMS and G1) high kernel times. I'm not sure how to investigate that. Do we have another process that consumes significant CPU (unlikely?).

This looks good enough to be enabled on a node in eqiad and see how this works with more load. Further tuning that I would like to try:

  • enable parallel ref processing (seems that parallel ref are a significant part of time spent in GC)
  • include increasing heap size
  • possibly removing -Xms

The idea above will need to be revisited depending on what we see under load.

Mentioned in SAL (#wikimedia-operations) [2018-06-11T08:27:32Z] <gehel> restart elastic1020 to enable G1 GC - T156137

Change 437231 merged by Gehel:
[operations/puppet@production] elasticsearch: enable G1 garbage collector

https://gerrit.wikimedia.org/r/437231

Looking at 14 hours of GC logs on elastic1020, I can see that the max pause time was ~400 ms, not bad!

pause times G1 elastic1020.png (988×1 px, 272 KB)

Looking at allocation rate, it looks like that server was reasonably busy for most of that interval, so the numbers above are probably significant. The patch to activate G1 everywhere is merged, but will wait until next cluster restart to be activated.

This task is not yet ready to be closed, I'll wait to see what the impact is once we deplyoed G1 everywhere, and we might be able to do some more tuning.

It looks like G1 introduced some issues where we see hot threads stuck on map access in thread locals inside of lucene (strange problem). Let's at least roll back that G1 and see if the problem disappear.

Change 443388 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Revert "elasticsearch: enable G1 garbage collector"

https://gerrit.wikimedia.org/r/443388

Change 443388 merged by Gehel:
[operations/puppet@production] Revert "elasticsearch: enable G1 garbage collector"

https://gerrit.wikimedia.org/r/443388

Cluster has still not been restarted. I was waiting for another reason to restart it so that we reduce the number of restarts, but if nothing comes up, I'll do the restart anyway.

Mentioned in SAL (#wikimedia-operations) [2018-07-17T18:39:00Z] <ebernhardson> ban elastic1030 from eqiad search cluster for latency issues. Likely related to T156137

Happened today again on elastic1030,

calling /_cache/clear did not help, I still see the same pattern with several threads (~20) stuck in ThreadLocalMap.getEntryAfterMiss.
while a reasonably loaded node never reports a thread stuck in this method.

Mentioned in SAL (#wikimedia-operations) [2018-07-18T10:10:56Z] <dcausse> T156137: banning elastic1030 from the cluster (high load)

Mentioned in SAL (#wikimedia-operations) [2018-07-18T12:52:10Z] <dcausse> T156137: restarting elastic1030 to disable G1GC

Mentioned in SAL (#wikimedia-operations) [2018-07-23T20:45:24Z] <ebernhar|lunch> T156137: restarting elastic1027 to disable G1GC

Mentioned in SAL (#wikimedia-operations) [2018-07-23T20:51:21Z] <ebernhar|lunch> T156137: restarting elastic1035 to disable G1GC

Mentioned in SAL (#wikimedia-operations) [2018-07-24T14:47:51Z] <dcausse> T156137: banning elastic1031 due to high load (same "getEntryAfterMiss" symptoms)

Mentioned in SAL (#wikimedia-operations) [2018-07-24T15:59:17Z] <dcausse> T156137: restarting elasticsearch on elastic1031 to disable G1GC

Mentioned in SAL (#wikimedia-operations) [2018-07-25T07:35:12Z] <gehel> rolling restart of elasticsearch / cirrus / codfw to disable G1 - T156137

Mentioned in SAL (#wikimedia-operations) [2018-07-25T08:02:28Z] <gehel> pausing elasticsearch cluster restart on codfw - T156137

Mentioned in SAL (#wikimedia-operations) [2018-07-25T09:20:25Z] <gehel> restarting elasticsearch cluster restart on codfw - T156137

Mentioned in SAL (#wikimedia-operations) [2018-07-25T15:49:00Z] <gehel> elasticsearch cluster restart on codfw completed - T156137

Mentioned in SAL (#wikimedia-operations) [2018-07-26T09:27:48Z] <gehel> rolling restart of elasticsearch / cirrus / eqiad to disable G1 - T156137

Mentioned in SAL (#wikimedia-operations) [2018-07-26T16:08:51Z] <gehel> rolling restart of elasticsearch / cirrus / eqiad completed - T156137

Gehel removed Gehel as the assignee of this task.Jun 10 2020, 8:19 AM
Gehel claimed this task.

nothing happening on this for a really long time, let's close and reopen if we have real issues.