Page MenuHomePhabricator

Investigate perf regression after elasticsearch 5.3.2 deployment
Closed, ResolvedPublic

Description

Elasticsearch 5.3.2 seems to have caused a visible perf regression on query percentiles:

  • fulltext: +20ms
  • morelike: +40ms
  • compsuggest: +2ms

Grafana - Elasticsearch Percentiles - Google Chrome_009.png (1×2 px, 310 KB)

Young GC activity seems to have jumped as well while the amount of heap used seems to have decreased:

Grafana - Elasticsearch Memory - Google Chrome_010.png (535×2 px, 204 KB)

Details

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 358383 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elasticsearch: remove UseConcMarkSweepGC

https://gerrit.wikimedia.org/r/358383

Gehel triaged this task as High priority.Jun 12 2017, 2:57 PM

Mentioned in SAL (#wikimedia-operations) [2017-06-13T14:22:28Z] <gehel> restarting elasticsearch on relforge to validate GC configuration - T167636

Mentioned in SAL (#wikimedia-operations) [2017-06-13T15:09:12Z] <gehel> applying new GC configuration on elastic1018 - T167636

elastic1018 is looking good, with significantly lower GC times than other nodes (see grafana). Next test is to roll out to the whole cluster...

Change 358383 merged by Gehel:
[operations/puppet@production] elasticsearch: remove UseConcMarkSweepGC

https://gerrit.wikimedia.org/r/358383

Change is merged. It will require a full cluster restart to be taken into account before we can actually close it. Cluster restart is planned to start on Monday June 19th.

debt claimed this task.
debt subscribed.

Resolving, final restart of the clusters will happen on Monday, June 19, 2017