Page MenuHomePhabricator

Investigate druid historical nodes GC spikes
Closed, ResolvedPublic

Assigned To
Authored By
JAllemandou
Wed, Feb 4, 3:32 PM
Referenced Files
F71914439: image.png
Thu, Feb 12, 2:22 PM
F71913620: image.png
Thu, Feb 12, 2:22 PM
F71698600: image.png
Fri, Feb 6, 7:29 AM
F71691534: image.png
Thu, Feb 5, 2:11 PM

Description

When investigating for DDoS mitigation using Turnilo/Druid, SREs have experienced slow downs. @elukey and @CDanis pointed out that historical nodes GC was spiking at that time.
We wish to investigate as in

  • Reproduce/validate (find a query that generates GC spikes)
  • Get GC logs while running the query and analyze them.

Related Objects

StatusSubtypeAssignedTask
ResolvedJAllemandou

Event Timeline

I have run a few queries, and managed to get the GC spike somehow. It's interesting to see that heap-space usage doesn't change at all during the spikes, so it's likely that the system uses a lot of short-lived objects.

image.png (1×2 px, 638 KB)

@JAllemandou just to be on the same page - did you notice a slowdown in Turnilo when the spike happened?

I have experiences a slow-down yes, and query response time is higher, even if there also are some other artifacts at play with this metrics.

image.png (1×2 px, 626 KB)

@JAllemandou nice thanks! I just added a series of panels in the Druid's grafana dashboard showing the various JVM's memory pools usage, I didn't spot anything so far but it may be useful. I kind of expected the Young gen / Eden's memory pool to increase in size, but I didn't find it in the metrics. At this point the best info may only come from enabling the GC logs, to understand what's happening.

I have ran an investigation getting GC logs while running queries with high GC spikes, and nothing special comes out of it. It just happens that Historical needs more young-heap space. I have quickly reviewed the curernt config we have for the cluster and will suggest changes early next week.

Change #1238695 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update druid config to try to fix GC spikes

https://gerrit.wikimedia.org/r/1238695

Change #1238695 merged by Btullis:

[operations/puppet@production] Update druid config to try to fix GC spikes

https://gerrit.wikimedia.org/r/1238695

Change #1238768 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update druid analytics config

https://gerrit.wikimedia.org/r/1238768

Change #1238768 merged by Btullis:

[operations/puppet@production] Update druid analytics config

https://gerrit.wikimedia.org/r/1238768

Change #1239027 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Update druid analytics memory configuration

https://gerrit.wikimedia.org/r/1239027

Change #1239027 merged by Btullis:

[operations/puppet@production] Update druid analytics memory configuration

https://gerrit.wikimedia.org/r/1239027

JAllemandou mentioned this in Unknown Object (Task).

With latest changes to memory configuration, we have lowered significantly the GC time for regex queries.
Taking this query: https://w.wiki/HodB when run over 30 days (thank you @CDanis !) as an example.
The original GC times as captured by Chris:

image.png (1×1 px, 318 KB)

While now (query response time is less than a minute):
image.png (1×1 px, 198 KB)

Closing this as resolved, please re-open as needed!

elukey rescinded a token.
elukey awarded a token.
JAllemandou closed subtask Unknown Object (Task) as Resolved.Fri, Feb 13, 4:51 PM