Page MenuHomePhabricator

Investigate and improve memory allocation rates of WDQS
Closed, ResolvedPublic

Description

While investigating GC times of WDQS (T175919), it seems clear that memory allocation rates are peaking at more than what the GC can reasonably cope with. Investigating what is allocating memory in blazegraph and why might allow us to reduce allocation rate and improve stability / scalability of the service.

Ideas:

  • configure a memory profiler on labs and / or production (jprofiler ?)
  • instrument code to keep track of memory allocation per query, at least in some specific places
  • identify patterns in queries that are more expansive in term of memory and limit those
  • ...

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata triaged this task as Medium priority.Jan 16 2018, 7:36 PM

Investigation on T192759 lead to some interesting discoveries.

Blazegraph Journal uses an unbounded executor service. Under high load (either because of more queries or more expensive queries), this executor creates a large number of threads for a short duration. We find examples where > 500 threads are created and destroyed after 1 minute (see graph). This can explain the peaks in allocation rates that we see. It is also a very expensive way to implement a work queue, delegating the queuing to the OS instead of having a work queue on the Java side.

I'll push a patch to allow configuring this executor, but this is deep into a code base I am unfamiliar with.

Note that [[ https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/journal/Journal.java#L3968-L3983 | com.bigdata.journal.Journal.readPoolSize ]] might help mitigate the issue.

Change 433724 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/blazegraph@master] Journal's executor service should be bounded and configureable.

https://gerrit.wikimedia.org/r/433724

Change 433724 abandoned by Gehel:
Journal's executor service should be bounded and configureable.

Reason:
see https://github.com/blazegraph/database/issues/91

https://gerrit.wikimedia.org/r/433724