Page MenuHomePhabricator

investigate GC times on wikidata query service
Closed, ResolvedPublic

Description

A quick look at GC logs on wdqs1004 show that almost 5% of time is spent in GC. That seems pretty high. During the last week, 2 full GC took more than 20 seconds (not good, but if it is only 2 times per week, we can probably live with it). Over the same time frame, heap after GC almost always go back to less than 8GB, with a minimum of around 2GB, this indicates that we probably have oversized the heap by a factor between 2 and 8.

The first step is probably to reduce the heap to 10GB and go further down from there.

Also note that we are using G1 on wdqs, which I personally understand much less than the more traditional GCs...

Details

Related Gerrit Patches:
operations/puppet : productionwdqs: cleanup JVM options for blazegraph
wikidata/query/rdf : masterAlign GC parameters with puppet.
wikidata/query/rdf : masterMove GC log options from puppet to the standard run script.
operations/puppet : productionwdqs: set G1 new size to 20%
operations/puppet : productionwdqs: cleanup JVM options
operations/puppet : productionwdqs: remove PrintPLAB from GC logging
operations/puppet : productionwdqs: add timestamp to GC logs
wikidata/query/rdf : masterRemove minimal heap size.
operations/puppet : productionwdqs: GC tuning
operations/puppet : productionwdqs: garbage collection tuning
operations/puppet : productionwdqs: reduce blazegraph heap size to 10GB
operations/puppet : productionwdqs: increase heap size to 16GB
operations/puppet : productionwdqs: send GC logs to file
operations/puppet : productionwdqs: fix typo in GC_LOGS
operations/puppet : productionwdqs: GC tuning
operations/puppet : productionwdqs: reduce blazegraph heap size to 12GB
operations/puppet : productionwdqs: reduce heap size to 12GB

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added projects: Wikidata, Discovery. · View Herald TranscriptSep 14 2017, 1:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 378227 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: reduce heap size to 12GB

https://gerrit.wikimedia.org/r/378227

Gehel added a comment.Sep 15 2017, 9:41 AM

Looking at wdqs-updater GC logs on wdqs1004, in the last 7 days:

  • heap before GC peaks at ~1.4GB (with a few higher peaks at ~2GB)
  • heap after full GC is ~512MB
  • max heap size is configured at 2GB
  • allocation rate over that period is ~70MB/s (but probably peaks much higher, the updater has a fairly spiky workload over short periods of time)
  • throughput = 99.807% (less than 0.2% of the time is spent in GC)

Overall, this seems pretty good.

We could probably reduce heap to 512MB easily, which would free an additional 1.5GB for disk cache. Probably not a game changer, but always nice.

The heap was bumped from 8G because there were some OOMs with heavy queries (some of them still use a bit of heap even if most of the data uses Blazegraph's own allocator). So let's not be over-zealous in reducing it yet. 12G could still be fine.

Change 378227 merged by Gehel:
[operations/puppet@production] wdqs: reduce heap size to 12GB

https://gerrit.wikimedia.org/r/378227

Mentioned in SAL (#wikimedia-operations) [2017-09-26T12:44:39Z] <gehel> restarting blazegraph on wdqs1004 / wdqs2001 for heap resive - T175919

Mentioned in SAL (#wikimedia-operations) [2017-09-26T15:06:23Z] <gehel> restarting blazegraph on all wdqs nodes for heap resize - T175919

Change 380768 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: reduce blazegraph heap size to 12GB

https://gerrit.wikimedia.org/r/380768

Change 380768 merged by Gehel:
[operations/puppet@production] wdqs: reduce blazegraph heap size to 12GB

https://gerrit.wikimedia.org/r/380768

Gehel added a comment.Sep 26 2017, 6:21 PM

comparing 6h of GC logs from before the reduction in heap size with 2h40m of GC logs after the reduction, here are a few observations:

  • GC overhead increased from 6.74% to 8.12%
  • allocation rate increased from ~900mb/sec to 1.2gb/sec (allocation rate is the rate at which the application creates objects, so not directly related to the change, but explains the GC overhead increase)
  • the allocation rate is sufficiently different than what I noted earlier that I suspect I read it wrong the first time:
  • allocation rate over that period is ~70MB/s (but probably peaks much higher, the updater has a fairly spiky workload over short periods of time)

Allocating ~1gb/sec is definitely a high rate! We probably need to start taking out a profiler and see if we can optimize blazegraph. We might be able to gain some throughput by further reducing the heap size or doing magical GC tuning, but we need to work on reducing the allocation rate as well.

Gehel added a comment.Sep 27 2017, 1:35 PM

Looking at 21 hours of GC logs on wdqs1004, I still see 1.3GB/sec allocation rate (almost 100TB allocated over 21 hours!). GC overhead is still high at ~8%. It still looks like the heap is oversized, so I'll try to reduce more to see if that increases throughput (but I doubt it will improve enough). We had 5 stop the world events longer than 10 seconds (all ~30 seconds) over the last 21h.

Change 380972 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: reduce blazegraph heap size to 10GB

https://gerrit.wikimedia.org/r/380972

Gehel added a comment.Sep 27 2017, 2:02 PM

For comparison, wdqs2001 (which is seeing much les user traffic, but the same amount of writes) is showing a GC overhead of ~3.2%, allocation rate of 200mb/sec, max GC time of 1.7 seconds.

Obvious solution to improve throughput: stop sending user traffic to our servers :)

More seriously, this indicates that the write traffic does not appear to be the main issue (good news!).

I see some advice in the blazgraph docs for performance tuning, but it seems mostly related to IO and even more to write load. It might make sense to increase the "write retention queue", since reducing disk writes might indirectly reduce memory throughput as well (just guessing).

I was expecting to see some read caches that might be increased, but I don't find them in the docs.

Gehel added a comment.Sep 27 2017, 3:15 PM

Doing some quick math:

  • wdqs eqiad cluster has between 20 and 40 requests per seconds according to our varnish stats
  • 3 servers in that cluster
  • let's assume requests are routed equally to all servers, that's 10 req/second/server
  • 1.3 GB/sec allocation rate
  • 130MB allocated per request
  • the calculation above do not take writes into account, to the allocation per request is probably quite a bit lower
  • 130MB / request is high, but not entirely crazy

Note that I also find anecdotal evidence through Google that an allocation rate of 1GB/core is sustainable (this does not match my experience, where I've seen issues much below that).

I need to dig more into G1 documentation to see how to tune generations sizes to optimize for short lived objects (which seems to be our case).

I think we should be careful about cutting down heap size. The peaks is what I am worried about, and Java has a nasty habit of getting stuck once the heap is exhausted. I'd rather it not happen in the middle of the night because somebody run some heavy query...

Smalyshev triaged this task as Medium priority.Sep 27 2017, 5:01 PM
Gehel added a comment.Sep 28 2017, 8:06 AM

Yes, we should be careful, but at the moment the logs indicate that a heap size of 3GB should be OK (looking at heap after GC). That's an oversizing by a factor of 3. Worst case, we have -XX:+ExitOnOutOfMemoryError configured, so we should exit and restart cleanly in most cases (yes, there are a few cases where we might run into crazy GC overhead without throwing OOME, but mostly in case of oversized heap, not undersized).

No I don't think 3G is nearly enough. We had 8G and I had to bump it because some queries killed the server. Also, in my experience, one of the failure modes specifically on low memory is not clean OOME but rather progressive performance degradation where the memory does not technically get to 0 but management starts taking more time and the server is essentially stuck and unable to serve the regular load.

I am not sure why having more memory leads to longer GC btw - if that memory is unallocated, what GC is doing with it? What the time is spent on?

Gehel added a comment.EditedSep 29 2017, 4:14 PM

There are multiple issues which can come from oversized heap. In particular:

  • If heap is available, it will be used, garbage collection will just start later.
  • Graph traversal in not linear (it is O(n) to O(n^2), depending on how cyclic the graphs is), traversing smaller graphs more often can be more efficient than traversing large graphs less often.
  • Long full GC pauses, since full GC is run less often, it has to do more once it kicks in (we do see GC pauses of over 30'').
  • It makes reasoning about object promotion harder than it should be.

Oversized heap is the main perf issue I have seen with Java applications. I'm not saying it is the only one here. But I would like to ensure that we have a heap reasonably sized before trying other tricks...

Note that yes, if we do have specific queries which requires much more memory than other, we need to size for those. I'm not proposing to go down to this 3GB limit, I'm just proposing to leave a bit less head room. If we do have queries consuming 5GB, we might want to have a look at limiting those as well. Or at least, we'll need to find a strategy to keep the service running efficiently...

So I took a look at our logs from Sep 29 with http://gceasy.io/ and I got the following conclusions (could be totally misguided, please tell me if so):

  • The log file covers about 5 to 6 hours. I think this should be enough but maybe if there are daily patterns it could be useful to have a larger data set to look integratively on. We could do this probably by bumping rotated file size to ~100M from 20M.
  • GC consumes 7-8% of CPU time. Have no idea if it's normal for Java service or not.
  • There's four long pause GC events I found in the logs for the last day. One took 40s. From what I see, three of them were in quick succession (around 3:08 am) - which indicates the same failure scenario I alluded to before - instead of failing the allocation, JVM starts obsessively GCing, trying to eke out that last byte. This is the failure mode I am concerned the most about. Not sure what's the alternative here - as I understand, Java has no safe out of memory failure mode? It's either slow death by GC or fast death by OOM?
  • Cross-referencing these events with WDQS dashboards shows that these events blocked updates, causing several minute lag in updates. Which means the server was essentially in coma for several minutes, but then recovered. Good news is it did recover, bad news is it should never happen in the first place.
  • Heap usage after GC is reported just under 6G. I think that means making heap size less than 6G is asking for trouble. There are several peaks reaching 12G.
  • GC times mostly stay under 5sec, but there are many over 1sec, it probably would be a good idea to keep them down.
  • Old Gen size stays mostly between 5-6G, which seems to indicate that is what long-term Blazegraph data structures need (unless I misunderstand what Old Gen means). There are several spikes, which I suspect are due to long-running queries that take so much time that normally transient structures get promoted to old gen, but I may be wrong here. This conclusion also is supported by the fact that the peaks are generally short-lived (which I interpret as query finally finishing or timing out and its memory allocation being released).
  • Young gen size goes from 6-7G before GC to under 1G after GC. That means there's probably some space for reducing the heap size if we want to have more frequent but less costly cleanups.
  • Not sure why our main GC cause is "GCLocker Initiated GC". Maybe because Blazegraph uses Java native process heap for its internal memory manager, instead of Java object heap, so each time it does something there it's a JNI call?

Some ideas:

  • Can we play with -XX:MaxTenuringThreshold and see if it can get any better results? Old gen size shows some volatility, which IMHO may mean some objects which are not supposed to be "old" get into the "old" space. From my understanding of how "old" is supposed to work, Old gen space should be more stable size-wise?
  • Should we adjust -Xmn and/or -XX:NewRatio to maybe give more space to old gen without growing young gen space too much?
  • Should we set -XX:MaxGCPauseMillis? I know it's not a hard limit but maybe hinting the VM that 40s GC is not OK would help?
  • Should we adjust XX:InitiatingHeapOccupancyPercent so that GC would try to collect young gen sooner, thus potentially resulting in shorter GCs?
Gehel added a comment.Oct 3 2017, 5:59 PM

Comments inline. Keep in mind that my understanding of GC is limited, I am most probably wrong in a lot of what I write below. And my understanding of G1 is even more limited...

So I took a look at our logs from Sep 29 with http://gceasy.io/ and I got the following conclusions (could be totally misguided, please tell me if so):

  • The log file covers about 5 to 6 hours. I think this should be enough but maybe if there are daily patterns it could be useful to have a larger data set to look integratively on. We could do this probably by bumping rotated file size to ~100M from 20M.

gceasy allows you to upload multiple files in a .tar.gz archive, so no need to change the config

  • GC consumes 7-8% of CPU time. Have no idea if it's normal for Java service or not.

No, it is not. Anything above 5% is suspicious. For comparison, elastic1023 is at a little bit over 1%.

  • There's four long pause GC events I found in the logs for the last day. One took 40s. From what I see, three of them were in quick succession (around 3:08 am) - which indicates the same failure scenario I alluded to before - instead of failing the allocation, JVM starts obsessively GCing, trying to eke out that last byte. This is the failure mode I am concerned the most about. Not sure what's the alternative here - as I understand, Java has no safe out of memory failure mode? It's either slow death by GC or fast death by OOM?

Actually, there are many failure modes (what fun would it be without them).

When no heap is available for allocation (either not enough space, or too much fragmentation), GC is run. If after a certain amount of effort not enough memory can be freed to do the allocation, then an OOME is raised. That's the easy scenario.

As long as the GC can free up memory for allocation, no OOME occurs. But if the application continues to ask for allocation, GC will start again. Trouble begins when the GC rate is close to the allocation rate, basically, the GC can just keep up, throughput decrease as the GC pressure raises, keeping things mostly in balance.

Consecutive full GC is still another problem. The main reason for full GC (as far as I understand) is when old gen needs to be collected. Full GC implies a longer stop-the-world phase. The cause is usually too early promotion, or a too large (and thus more expensive to collect) old gen. It does not necessarily means that no heap is available, but more that it isn't available in the right place. To minimize this, we should ensure that only long lived objects make it to the old gen and that no premature promotion occurs. Note that too much allocation can result in the same symptoms, with new allocation taking all the new gen space and pushing too much to old gen.

  • Cross-referencing these events with WDQS dashboards shows that these events blocked updates, causing several minute lag in updates. Which means the server was essentially in coma for several minutes, but then recovered. Good news is it did recover, bad news is it should never happen in the first place.

This makes me think that we have an increased allocation rate for some time (a few queries allocating much more, or allocating objects more expensive to collect). Once the workload is back to normal, the JVM recovers. That's only an assumption, and I'm not sure how to invalidate it...

  • Heap usage after GC is reported just under 6G. I think that means making heap size less than 6G is asking for trouble. There are several peaks reaching 12G.

What is important is the minima of heap after GC, which on the logs I sampled seems to be fairly consistently between 2 and 3 GB. The minima are important, because in most cases, GC will only free what it needs, leaving a lot of garbage in place. The minima is when we are the closest of not having any garbage left... By slowly minimizing the overall size of heap, we will see better what the reasonable limit is.

  • GC times mostly stay under 5sec, but there are many over 1sec, it probably would be a good idea to keep them down.

I'm used to GC taking less than 100ms in most cases. Again, elastic1023 as comparison: 99.849% in less than 1 sec, avg GC time: 50ms, longest GC ~3 seconds.

  • Old Gen size stays mostly between 5-6G, which seems to indicate that is what long-term Blazegraph data structures need (unless I misunderstand what Old Gen means). There are several spikes, which I suspect are due to long-running queries that take so much time that normally transient structures get promoted to old gen, but I may be wrong here. This conclusion also is supported by the fact that the peaks are generally short-lived (which I interpret as query finally finishing or timing out and its memory allocation being released).

Again, what is interesting are the minima of old gen after GC, which I see between 2 and 3 GB.

  • Young gen size goes from 6-7G before GC to under 1G after GC. That means there's probably some space for reducing the heap size if we want to have more frequent but less costly cleanups.

Agreed!

  • Not sure why our main GC cause is "GCLocker Initiated GC". Maybe because Blazegraph uses Java native process heap for its internal memory manager, instead of Java object heap, so each time it does something there it's a JNI call?

I don't understand that one either and it looks suspicious to me. elastic1023 has "GCLocker Initiated GC" as a cause for 0.05% of GCs.

Some ideas:

  • Can we play with -XX:MaxTenuringThreshold and see if it can get any better results? Old gen size shows some volatility, which IMHO may mean some objects which are not supposed to be "old" get into the "old" space. From my understanding of how "old" is supposed to work, Old gen space should be more stable size-wise?

Yes, but only after tuning the overall heap size to something reasonable.

  • Should we adjust -Xmn and/or -XX:NewRatio to maybe give more space to old gen without growing young gen space too much?

My intuition tells me that we should probably reduce old gen and increase young to avoid premature promotion. But yes, we can play with it. Again, only after having a reasonable overall heap size.

  • Should we set -XX:MaxGCPauseMillis? I know it's not a hard limit but maybe hinting the VM that 40s GC is not OK would help?

I think it is very unlikely that this will help. Basically, giving a max pause time will decrease throughput to improve latency, we are already not good on throughput.

  • Should we adjust XX:InitiatingHeapOccupancyPercent so that GC would try to collect young gen sooner, thus potentially resulting in shorter GCs?

Maybe worth playing with...

In the end, we also need to understand why GC is working so much harder and need more tuning. There might be good reasons for this, but without having some understanding of the kind of workload the GC is seeing, it is hard to know what to do. Or to understand if the GC issues we are seeing are indeed a cause, or just a consequence.

Gehel added a comment.Oct 3 2017, 6:10 PM

It would be interesting to see if allocation rate goes up when we see the JVM locking up. gceasy does not graph allocation rate over time, but I think that JClarity Censum might do that (closed source, but demo license available).

Gehel added a comment.Oct 4 2017, 9:54 AM

See

for an example of problematic GC log.

Gehel added a comment.Oct 4 2017, 12:50 PM

for reference:

JVM:

gehel@wdqs1004:~$ java -version
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-1~bpo8+1-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)

JVM options (full command line):

java -server -XX:+UseG1GC -Xms12g -Xmx12g
-Xloggc:/var/log/wdqs/wdqs-blazegraph_jvm_gc.%p.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
-XX:+PrintGCCause -XX:+PrintGCApplicationStoppedTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=20M
-Dlogback.configurationFile=/etc/wdqs/logback-wdqs-blazegraph.xml
-Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/etc/wdqs/RWStore.properties
-Dorg.eclipse.jetty.server.Request.maxFormContentSize=200000000
-Dcom.bigdata.rdf.sparql.ast.QueryHints.analytic=true
-Dcom.bigdata.rdf.sparql.ast.QueryHints.analyticMaxMemoryPerQuery=1073741824
-DASTOptimizerClass=org.wikidata.query.rdf.blazegraph.WikibaseOptimizers
-Dorg.wikidata.query.rdf.blazegraph.inline.literal.WKTSerializer.noGlobe=2
-Dcom.bigdata.rdf.sail.webapp.client.RemoteRepository.maxRequestURLLength=7168
-Dcom.bigdata.rdf.sail.sparql.PrefixDeclProcessor.additionalDeclsFile=/srv/deployment/wdqs/wdqs/prefixes.conf
-Dorg.wikidata.query.rdf.blazegraph.mwapi.MWApiServiceFactory.config=/srv/deployment/wdqs/wdqs/mwservices.json
-Dcom.bigdata.rdf.sail.webapp.client.HttpClientConfigurator=org.wikidata.query.rdf.blazegraph.ProxiedHttpConnectionFactory
-Dhttp.userAgent=Wikidata Query Service; https://query.wikidata.org/
-Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080
-XX:+ExitOnOutOfMemoryError
-jar jetty-runner-9.2.3.v20140905.jar --host localhost --port 9999 --path /bigdata blazegraph-service-0.2.5-SNAPSHOT.war
Gehel added a comment.Oct 4 2017, 3:22 PM

Using a demo version of Jclarity Censum, I see the following:

Problems

Premature promotion:

There are a number of possible causes for this problem:

  1. Survivor spaces are too small.
  2. Young gen is too small.
  3. The -XX:MaxTenuringThreshold flag may have been set too low.

There a number of possible solutions for this problem:

  1. Alter the size of the young space via the -XX:NewRatio property.
  2. Alter the size of Survivor Spaces (relative to Eden) via the -XX:SurvivorRatio=<N> property using the information provided by the Tenuring graphs. This flag works to divide Young into N + 2 chunks. N chunks will be assigned to Eden and 1 chunk each will be assigned to the To and From spaces respectively.
  3. Alter the -XX:MaxTenuringThreshold property.

Application Throughput:

The recommendation is a bit generic:

How you tune will depend upon your performance requirements and the environment into which the application is deployed in. In general terms, smaller live set sizes combined with smaller memory pools will reduce GC pause times. However, small memory pools will increase the frequency of garbage collections. Additionally, smaller memory pool sizes will result in more frequent collections and place you at risk of running out of memory. High rates of allocation will also result in more frequent collections and place you at risk of running out of memory. Consequently, the general recommendation is to first work to minimize allocation rates and the volume of data that needs to be retained in Java heap. The second step is to tune the garbage collector to reduce GC overhead and minimize pause times.

Minimizing allocation rate looks like a good idea, but not something easy to do :)

High kernel times:

This looks suspicious... and indicates that other applications are competing for resources. Surprisingly it seems that IO contention can be an issue during GC, but it could be something else.

Informative

Heap Too Small Indicators

4 To space exhausted event were seen. 1 is just after application startup and is probably not significant. The other 3 are at roughly the same time, preceding a full GC. It looks like there is a change of pattern at that time, which probably indicate that we have to problems:

  1. high GC overhead all the time
  2. sudden rise in allocation at specific times, probably related to more expensive queries (I might be entirely wrong with this one)

There is an allocation rate graph in the report, but I have no idea how to read it...

Gehel added a comment.Oct 4 2017, 5:23 PM

A thread on Friends of JClarity suggests:

  • adding -XX:+ParallelRefProcEnabled (since we seem to have really high Reference processing times
  • turn off PrintTenuringThreshold, since it is rarely useful with G1.
  • check transparent huge pages are disabled
  • activate PrintAdaptiveSizePolicy and PrintReferenceGC to get more information
  • not reducing the size of the heap (this one is counter intuitive to me...)
Gehel added a comment.Oct 4 2017, 5:31 PM

Transparent huge pages seem to be disabled:

gehel@wdqs1004:~$ cat /sys/kernel/mm/transparent_hugepage/enabled 
always [madvise] never

Patch coming up to test other options.

Change 382195 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: GC tuning

https://gerrit.wikimedia.org/r/382195

The -XX:MaxTenuringThreshold flag may have been set too low.

I'd try to see if this one helps, because BG work pattern would be allocating tons of stuff when query is processed, and then releasing it after it's done. But the query can be processed for a whole minute, so I imagine some of this stuff would get into old gen.

This looks suspicious... and indicates that other applications are competing for resources.

The only other app there is updater, which spends most of it's time waiting for either Blazegraph or Wikidata. Weird.

Change 382195 merged by Gehel:
[operations/puppet@production] wdqs: GC tuning

https://gerrit.wikimedia.org/r/382195

Mentioned in SAL (#wikimedia-operations) [2017-10-05T08:53:18Z] <gehel> rolling restart of wdqs to pick up new GC options - T175919

Change 382397 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: fix typo in GC_LOGS

https://gerrit.wikimedia.org/r/382397

Change 382397 merged by Gehel:
[operations/puppet@production] wdqs: fix typo in GC_LOGS

https://gerrit.wikimedia.org/r/382397

Change 382401 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: send GC lgos to file

https://gerrit.wikimedia.org/r/382401

Change 382401 merged by Gehel:
[operations/puppet@production] wdqs: send GC logs to file

https://gerrit.wikimedia.org/r/382401

Gehel added a comment.Oct 5 2017, 9:42 AM

New GC configuration deployed, all servers restarted. I'll wait a few hours and I'll have a look at GC logs to find out if we see improvements.

For reference, the JVM options before:

-XX:+UseG1GC
-Xms12g
-Xmx12g
-Xloggc:/var/log/wdqs/wdqs-blazegraph_jvm_gc.%p.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution
-XX:+PrintGCCause
-XX:+PrintGCApplicationStoppedTime
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=20M
-XX:+ExitOnOutOfMemoryError

and after the change:

-XX:+UseG1GC
-Xms12g
-Xmx12g
-Xloggc:/var/log/wdqs/wdqs-blazegraph_jvm_gc.%p.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintAdaptiveSizePolicy
-XX:+PrintReferenceGC
-XX:+PrintGCCause
-XX:+PrintGCApplicationStoppedTime
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=20M
-XX:+ParallelRefProcEnabled
-XX:+UnlockExperimentalVMOptions
-XX:G1MaxNewSizePercent=25
-XX:+ExitOnOutOfMemoryError
Gehel added a comment.Oct 5 2017, 12:53 PM

Looking at Grafana, we can already see a decrease in overall GC time. Looking good!

debt awarded a token.Oct 5 2017, 4:40 PM

Looking at a few days of logs, after the recent configuration changes:

  • overall, things look better, GC overhead is under control, < 5%
  • we still have long GC pauses (between 25 and 35 seconds), somewhat correlated with high allocation rate, but fairly rare (~ 4 / day)
  • allocation rate peaks at 11GB/s, I think it is reasonable to expect GC to have issue at this level of allocation

I think the current configuration is good enough for our use case, we can close this task and start looking into the allocation by blazegraph if we want to improve more.

I wonder if we could correlate peak allocation rates with certain queries (which would probably be long) since we have all queries now in logstash.

Gehel added a comment.Oct 11 2017, 8:06 AM

According to Kirk Pepperdine, we might have run into a G1 bug... I'll try to help the JVM guys debug it, and we might get a long term solution at some point...

Change 384557 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: increase heap size to 16GB

https://gerrit.wikimedia.org/r/384557

Change 384557 merged by Gehel:
[operations/puppet@production] wdqs: increase heap size to 16GB

https://gerrit.wikimedia.org/r/384557

Change 380972 abandoned by Gehel:
wdqs: reduce blazegraph heap size to 10GB

Reason:
Seems that increasing our heap is a better solution than decreasing

https://gerrit.wikimedia.org/r/380972

Change 384663 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: cleanup JVM options

https://gerrit.wikimedia.org/r/384663

Mentioned in SAL (#wikimedia-operations) [2017-10-17T08:45:55Z] <gehel> restarting blazegraph on wdqs1004 for GC tuning (adding -XX:+G1PrintRegionLivenessInfo) - T175919

Mentioned in SAL (#wikimedia-operations) [2017-10-17T11:12:00Z] <gehel> restarting wdqs-updater on wdqs1004 - T175919

Change 385364 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: garbage collection tuning

https://gerrit.wikimedia.org/r/385364

Change 385364 merged by Gehel:
[operations/puppet@production] wdqs: garbage collection tuning

https://gerrit.wikimedia.org/r/385364

Mentioned in SAL (#wikimedia-operations) [2017-10-23T13:19:16Z] <gehel> rolling restart of wdqs for GC tuning - T175919

Change 386132 had a related patch set uploaded (by Gehel; owner: Guillaume Lederrey):
[operations/puppet@production] wdqs: add timestamp to GC logs

https://gerrit.wikimedia.org/r/386132

Change 386170 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: GC tuning

https://gerrit.wikimedia.org/r/386170

Change 386172 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Remove minimal heap size.

https://gerrit.wikimedia.org/r/386172

Change 386170 merged by Gehel:
[operations/puppet@production] wdqs: GC tuning

https://gerrit.wikimedia.org/r/386170

Mentioned in SAL (#wikimedia-operations) [2017-10-24T14:24:30Z] <gehel> restarting blazegraph on wdqs2001 for GC config change - T175919

Mentioned in SAL (#wikimedia-operations) [2017-10-24T15:09:20Z] <gehel> restarting blazegraph on all wdqs nodes for GC config change - T175919

Change 386172 merged by jenkins-bot:
[wikidata/query/rdf@master] Remove minimal heap size.

https://gerrit.wikimedia.org/r/386172

Change 386132 merged by Gehel:
[operations/puppet@production] wdqs: add timestamp to GC logs

https://gerrit.wikimedia.org/r/386132

Mentioned in SAL (#wikimedia-operations) [2017-10-25T09:07:04Z] <gehel> rolling restart of all wdqs nodes for GC config change - T175919

Change 386791 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: remove PrintPLAB from GC logging

https://gerrit.wikimedia.org/r/386791

Change 386791 merged by Gehel:
[operations/puppet@production] wdqs: remove PrintPLAB from GC logging

https://gerrit.wikimedia.org/r/386791

Mentioned in SAL (#wikimedia-operations) [2017-10-30T08:32:23Z] <gehel> rolling restart of wdqs for config reload - T175919

Change 384663 abandoned by Gehel:
wdqs: cleanup JVM options

https://gerrit.wikimedia.org/r/384663

Change 388017 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: set G1 new size to 20%

https://gerrit.wikimedia.org/r/388017

Change 388017 merged by Gehel:
[operations/puppet@production] wdqs: set G1 new size to 20%

https://gerrit.wikimedia.org/r/388017

Mentioned in SAL (#wikimedia-operations) [2017-11-02T10:38:39Z] <gehel> rolling restart of wdqs nodes for GC tuning - T175919

Change 388023 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Move GC log options from puppet to the standard run script.

https://gerrit.wikimedia.org/r/388023

Change 388023 merged by jenkins-bot:
[wikidata/query/rdf@master] Move GC log options from puppet to the standard run script.

https://gerrit.wikimedia.org/r/388023

Gehel added a comment.Dec 4 2017, 1:04 PM

I am mostly happy with the current GC options. It would make sense to move those back from puppet to wdqs code base, so that they can be reused by other deployment of wdqs.

Change 388026 had a related patch set uploaded (by Smalyshev; owner: Gehel):
[operations/puppet@production] wdqs: cleanup JVM options for blazegraph

https://gerrit.wikimedia.org/r/388026

Change 405723 had a related patch set uploaded (by Gehel; owner: Gehel):
[wikidata/query/rdf@master] Align GC parameters with puppet.

https://gerrit.wikimedia.org/r/405723

Change 405723 merged by jenkins-bot:
[wikidata/query/rdf@master] Align GC parameters with puppet.

https://gerrit.wikimedia.org/r/405723

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Mar 5 2018, 4:16 PM

Change 388026 merged by Gehel:
[operations/puppet@production] wdqs: cleanup JVM options for blazegraph

https://gerrit.wikimedia.org/r/388026

Mentioned in SAL (#wikimedia-operations) [2018-03-12T15:51:30Z] <gehel> restart blazegraph on wdqs2001 to validate new config - T175919

Gehel closed this task as Resolved.Mar 12 2018, 3:52 PM
Gehel claimed this task.

Last cleanup has been done, this can be closed.