Page MenuHomePhabricator

investigate G1GC pause times
Closed, ResolvedPublic

Description

We are averaging almost 600ms collection time for young generation, with occasional spikes of over 1s. These collection times might translate to increased request latency, or reduced throughput; They should be investigated, and a plan to lower them formed (including a move back to CMS).

Additionally, some effort should be made to determine if adequate performance can be obtained using a smaller heap size (current size: 16GB).

See also:

Event Timeline

Eevans raised the priority of this task from to Needs Triage.
Eevans updated the task description. (Show Details)
Eevans added a subscriber: Eevans.

Investigation (WIP)

I've begun researching G1GC to determine how it should work, what to measure, how it is performing in our cluster, and what (if anything) can be done to improve this performance.

G1GC vs CMS

To date, we have measured only accumulated collection time and a count of GC runs, of which we have visualizations for collection time as a rate (using a non-negative derivative, with 5 minute moving average). Total collection time is an interesting metric, but it is likely a much better indicator of throughput than it is of latency impact. However, since that is all we have historical data for, that is all we can use for comparisons to CMS (without re-enabling it for more testing).

It's somewhat difficult to come up with a perfect comparison of collection time. The last weeks and months have seen significant changes in the size of the Cassandra working set (both up and down), and we've had many problems with exaggerated load. However:

CMS (period of April 1-30):

During this period the cluster was somewhat smaller than it was during it's recent peak (~750GB compression load), and it was a period of relative stability. Either way, the numbers are also more favorable then both March, and May, so given the data we have, this seems like a best-case scenario for CMS.

CMS

http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-gc?from=1427897579848&to=1430408905767&var-node=All&panelId=5&fullscreen

ParNew

http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-gc?from=1427897579848&to=1430408905767&var-node=All&panelId=33&fullscreen

Collection time averaged between 4.67 and 6.10 seconds for the month (with max times over 16 seconds).

G1GC (for period of July 20-27):

Again, this period may not make for an ideal comparison. The cluster is smaller than it was prior to this, but with efforts under way to cull past revisions, there has also been a considerable amount of compaction activity that wouldn't be a part of the steady-state.

http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-gc?from=1437431548650&to=1438036348651&var-node=All&panelId=37&fullscreen

Collection time averaged between 512ms and 562ms, (which is including an exceptional event (at 2015-07-27 04:15 UTC) that took down 3 nodes with OOM errors, and generated a considerable amount of GC activity). Also worth noting, is that other than some collections that occurred during the aforementioned event, there has been no full GC activity whatsoever (all collections have been young and mixed).

Other sources of data

Rather than strictly looking at accumulated collection time, it makes sense I think to visualize the collections as events. A scatter plot (parsed from logged GC information) seems to work well for this.

Nominal GC activity

normal.png (612×817 px, 35 KB)

Abnormal GC activity

abnormal.png (612×812 px, 42 KB)

GC Logging

The JVM is capable of logging extremely detailed information on each GC event, and the cost of enabling this appears negligent.

Note: GC logging was enabled in https://gerrit.wikimedia.org/r/227355

To be continued...

Change 227355 had a related patch set uploaded (by Eevans):
enabled GC logging

https://gerrit.wikimedia.org/r/227355

Patch to lower the GC initiation threshold slightly at https://gerrit.wikimedia.org/r/#/c/227335/.

This issue was opened on the assumption that we were spending too much time in evacuation, and that it was affecting latency. Based on the stated reasoning for this, I'd say it's likely to increase the overall collection time (by forcing them to occur more frequently).

Is this a reaction to the OOMs that occurred earlier this morning? This seems like an aberrant event, one that might not be survivable at all (there is bound to be a point where allocations can exceed collection capacity at any setting). The previous weeks worth of data suggests to me that the norm is for extremely consistent collection times, all of which are at or below the soft real time target, accomplished entirely with young and mixed collections (no full GCs). Optimizing the collector for any such future events might be futile anyway; I'd rather we spent the time to identifying the underlying cause, and addressing that.

If this change is a try-and-see, an experiment, then how do we gauge success? If it increases collection time, how much is too much? If we go a week without an OOM, does that mean that it fixed the problem, or that we haven't had another such event? Basically, what are we looking for, and how do we objectively define success?

Eevans triaged this task as Medium priority.
Eevans added a subscriber: fgiunchedi.

@Eevans, it might also be worth looking at the heap dumps on 1003, 1004 and 1005 to see which items were involved in the OOM.

@Eevans, it might also be worth looking at the heap dumps on 1003, 1004 and 1005 to see which items were involved in the OOM.

Great idea; Definitely worth a shot

This issue was opened on the assumption that we were spending too much time in evacuation, and that it was affecting latency. Based on the stated reasoning for this, I'd say it's likely to increase the overall collection time (by forcing them to occur more frequently).

For timeouts, we are concerned about *max* collection time. Short collections occuring slightly more frequently are not much of an issue in that context, unless it gets to the point where collection activity is slowing down progress to the point where timeouts occur.

We are seeing occasional GC spikes with G1GC (in the 6s accumulated GC time region), in my experience typically accompanied by bursts of GC events in the logs. The heap usage graph shows frequent heap usage spikes above 80%, sometimes 90% of the configured heap size.

Success, to my mind, would be a reduction in bursty GC events. I would define those as long individual collections (> target), or many short consecutive collections with the effect of slowing progress to the point of causing timeouts. In grafana, less bursts over 2s per 24 hours would be nice. In logstash, less timeout errors per 24 hours would be nice.

Change 227355 merged by Filippo Giunchedi:
enabled GC logging

https://gerrit.wikimedia.org/r/227355

@Eevans, it might also be worth looking at the heap dumps on 1003, 1004 and 1005 to see which items were involved in the OOM.

Great idea; Definitely worth a shot

I took a look at the heap dump from restbase1003, here is what I found:

82% of the heap is occupied by HeapByteBuffers which are the value attribute of BufferCell, and BufferExpiringCell objects (aka "column values").

Nearly all the column values that I tracked down references for are associated with so-called "range slices" (slices of columns across ranges of rows), which pretty clearly links this to the revision culling script (it does a select"_domain", key, rev, tid, token("_domain",key) as "_token" from data where token("_domain",key) [condition]).

Additionally, ~90% of these (by size) are made up of values in excess of 5,000,000 bytes. Of these, the ones that I tracked down references for were all data-parsoid JSON blobs in the wikipedia domain.

So it would appear that the culling scripts encountered some very wide row(s) (wide in bytes, if nothing else) prior to these OOM events. This is consistent with the GC logs that showed enormous allocation after enormous allocation (eden) prior to entering the death spiral.

See also: https://phabricator.wikimedia.org/T94121 (Investigate huge max partition size in cfhistograms output)

@Eevans, thanks for looking into that! Your findings corroborate the earlier theory that the burst of deletions was caused by the thin-out script finally succeeding in processing one of the extra-wide rows.

GC collections from restbase1005 over the past ~24 hours.

rb1005_2015-07-31.png (612×1 px, 114 KB)

Note: JDK downgrade took place at 13:39 on 07/30.

For comparison, here are plots of 1003, and 1004:

1003 somehow seeming "righted" without intervention:

rb1003_2015-07-31.png (612×1 px, 81 KB)

1004, by accident, was not restarted, and so had not seen the JDK downgrade applied:

rb1004_2015-07-31.png (612×1 px, 39 KB)

Here is 1004 again, without the outliers:

rb1004_2015-07-31_without_outliers.png (612×1 px, 149 KB)

To summarize a discussion w/ @fgiunchedi on IRC, the following is the plan for addressing the remainder of this issue:

Now that we've brushed up on the science of G1, and have established what to measure and how. We propose to do the following:

Once the cleanups have completed, the nodes have been upgraded to 2.1.8, and any related compaction activity has ceased, then work can commence. Waiting for the cleanup to conclude will rule out any influence that process might have, and provide a firm baseline of the working set size. Waiting until after the upgrade of 2.1.8 will provide for additional logging in the form of wide rows encountered during compaction, (which might prove useful in correlating with large allocations).

A single node will be manually configured to add -XX:+PrintAdaptiveSizePolicy to GC logging options. This is likely too verbose for day-to-day use, but should provide insight into why certain regions are included in a CSet, or what triggered a marking cycle. Humongous allocations are also something that should be thorougly investigated (any object 4MB or over would be considered humongous in our environment, and we've seen evidence such objects exist). If necessary, an investiation of humongous objects might also require that we add -XX:+UnlockExperimentalVMOptions -XX:+G1ReclaimDeadHumongousObjectsAtYoungGC -XX:G1LogLevel=finest -XX:+G1TraceReclaimDeadHumongousObjectsAtYoungGC to the list of temporary GC logging options.

An analysis of GC activity over some period of time will then be made, and targeted config changes iteratively applied toward the goals of a) reducing the heap size from 16G, b) reducing accumulated collection time, and c) reducing 99p collection times.

Since GC tuning is something of a black art, I expect that some of this will be nothing more than trial-and-error; Some of these experiments will fail, either by generating no immediate observable effect, or worse by creating a regression. Negative results are results though, so even after reverting such changes, they may be able to help us better understand the problem space. It goes without saying that all of this must be done while minimizing impact to users.

We will of course follow up with Gerrit(s) for permanently applying any improvements on as-needed basis.

Comments, suggestions, etc, welcome.

Regarding GC tuning, one suggestion I'd like to make is to take a day or two to establish a baseline with CMS and ~6G heap. We could then see if CMS with a smaller heap already satisfies our max pause time and timeout requirements, especially with the smaller instance sizes we are using now.

Regarding GC tuning, one suggestion I'd like to make is to take a day or two to establish a baseline with CMS and ~6G heap. We could then see if CMS with a smaller heap already satisfies our max pause time and timeout requirements, especially with the smaller instance sizes we are using now.

I'd really like to do that, yes. However, all signs seem to point at G1GC being The Future with respect to workloads like these, (upstream Cassandra moving to it as the default is but one data point). Any reasonable decision should include a thorough evaluation of it, and considering that we are already running G1, and have a decent baseline from which to begin, I'd like to start here.

Regarding GC tuning, one suggestion I'd like to make is to take a day or two to establish a baseline with CMS and ~6G heap. We could then see if CMS with a smaller heap already satisfies our max pause time and timeout requirements, especially with the smaller instance sizes we are using now.

I'd really like to do that, yes. However, all signs seem to point at G1GC being The Future with respect to workloads like these, (upstream Cassandra moving to it as the default is but one data point). Any reasonable decision should include a thorough evaluation of it, and considering that we are already running G1, and have a decent baseline from which to begin, I'd like to start here.

+1. A concrete (use-case) comparison with CMS would be desirable, but given the point we are currently at, it makes more sense to treat G1GC @ 16GB as the baseline keeping in mind we'd ideally like to lower the heap size.

Personally, I think G1GC goes more in the direction we want than CMS. That is not to say we should discard it.

While I agree that G1GC is promising and might even outperform CMS for small heaps with the right tuning, my concern is more about cost / benefit in the short term. Just switching to CMS and letting it run for two days is fairly cheap, perhaps an hour of work to tweak the configs. If the outcome of that experiment is that CMS works as-is, then we have the option of deferring the full G1GC evaluation until later, confidently order hardware with 16G RAM per instance, and tackle urgent tasks like multi-instance preparation and multi-DC replication preps. We will also have data to compare G1GC to, so that we can actually say something meaningful about the relative performance for our specific workload.

@Eevans, how long do you think it will take to establish whether we can use G1GC with 16G total RAM per instance, and whether we are better off using it for small instances? Do you think we can just try G1GC with a (say) 7G heap?

Regarding GC tuning, one suggestion I'd like to make is to take a day or two to establish a baseline with CMS and ~6G heap. We could then see if CMS with a smaller heap already satisfies our max pause time and timeout requirements, especially with the smaller instance sizes we are using now.

+1 to more data on G1 vs CMS, what kind of max pause times and timeout requirements are we aiming at btw?

While I agree that G1GC is promising and might even outperform CMS for small heaps with the right tuning, my concern is more about cost / benefit in the short term. Just switching to CMS and letting it run for two days is fairly cheap, perhaps an hour of work to tweak the configs. If the outcome of that experiment is that CMS works as-is, then we have the option of deferring the full G1GC evaluation until later, confidently order hardware with 16G RAM per instance, and tackle urgent tasks like multi-instance preparation and multi-DC replication preps.

This is a really important area for us. So many past discussions of stability and performance have centered around GC, that I feel confident it warrants spending the time on now rather than deferring yet again. That doesn't mean that future discussions of stability and performance centered around GC won't happen (I am sure they will), but hopefully those issues will be less problematic, less urgent, and that we'll know more, and speculate less.

@Eevans, how long do you think it will take to establish whether we can use G1GC with 16G total RAM per instance, and whether we are better off using it for small instances? Do you think we can just try G1GC with a (say) 7G heap?

The plan is to incrementally lower it, while analyzing compaction activity, and adjusting accordingly. You want some time in between iterations obviously, so I would say "a few days" (wall time, not man-hours). If we can establish that 2.1.8 is ready to be moved to production (it's on the testing cluster now), then work could begin as early as next week. @fgiunchedi, what are you thoughts on that time-line?

@Eevans, all the data and information I have seen shows that sufficiently small instances are *not* suffering from the GC scaling issues we saw in the past with large instances and heaps. My hope is that with multiple and small instances, we can avoid investing too much time in trying to push the GC envelope.

Before making a decision on memory sizes needed, I would like to sanity-check our old setup at 8G heap size with CMS. Since this is a known configuration with low risk, I would actually propose to just let that run over the weekend. If 8G/CMS request timeout levels are already okay-ish (no worse than current timeout levels with G1GC), then we have all the data we need for our hardware decision by Monday. We can then play with heap / instance sizes or G1GC later to further improve timeouts.

We decided not to run CMS over the weekend, so I did some quick experiments on 1001 and 1003 today to get an idea for memory needs.

1001:

  • CMS
  • 6G heap, 1G new gen

1003:

  • G1GC
  • 6, 5, finally 4G heap
  • JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=20"
  • JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=150"
  • commented out key_cache_size_in_mb in cassandra.yaml

Metrics (experiment started ~18 UTC):

pasted_file (1×1 px, 391 KB)

pasted_file (1×1 px, 576 KB)

pasted_file (1×1 px, 576 KB)

pasted_file (1×1 px, 294 KB)

pasted_file (1×1 px, 217 KB)

Observations:

  • G1GC seems to perform quite well even at small heap sizes. There doesn't seem to be much of a performance reason to use CMS any more.
  • Lowering InitiatingHeapOccupancyPercent smoothed GC times.
  • Lowering MaxGCPauseMillis smoothed GC times as well, with relatively modest impact on CPU usage and a slight positive impact on p99 read / write latency.
  • At moderate load during the experiment, there was little latency or GC difference between 6 and 5G heap. However, GC times and latencies went noticeably up at 4G.
  • RSS hovered around 12G with 5G heap, with most of this in off-heap allocations. Increasing indexing intervals slightly could reduce off-heap usage a bit.
  • Total RAM used as reported by free was 14466976 bytes on 1001, and 12193832 bytes on 1003. This includes memory used by RESTBase.
  • CPU usage hovered around 300-400%, with occasional bursts to 600%.

Overall, it seems that running four moderately-sized and -loaded instances in 64G of RAM is feasible, but tight. 80 or 96G would give us a bit more headroom for a relatively modest extra cost.

If there are no objections, I'll being working on this tomorrow, using the process outlined here, and restbase1001 as the guinea pig, (in preparation I have added the -XX:+PrintAdaptiveSizePolicy logging option to restbase1001).

@Eevans, it would be great to get this going. How long do you think this investigation will take overall?

Also, as a small reminder, https://gerrit.wikimedia.org/r/#/c/227335/ is still open. The results I saw in my brief test with very small heaps were encouraging, and seemed to corroborate other's experiences. I'll leave it up to you to decide whether you want to test this on large heaps. I would appreciate a review, though.

@Eevans, it would be great to get this going. How long do you think this investigation will take overall?

I'm not sure, several days at least though, I would think (in span, not effort). There are a number of things to try, and it would help to collect/log for a some time between iterations. I'll keep the issue updated as I go though.

Also, as a small reminder, https://gerrit.wikimedia.org/r/#/c/227335/ is still open. The results I saw in my brief test with very small heaps were encouraging, and seemed to corroborate other's experiences. I'll leave it up to you to decide whether you want to test this on large heaps. I would appreciate a review, though.

I'll gather data on initiating occupancy, too.

As an update on my progress:

Yesterday I set the heap to 12GB on restbase1001, and today I lowered it again to 8GB, in order to measure the impact (in isolation), of smaller heap sizes. Before the end of the day, I will re-enable Puppet and restore the baseline settings.

I've begun to compile findings here: https://people.wikimedia.org/~eevans/T106619/; This is a work-in-progress, so I wouldn't encourage drawing too many conclusions just yet.

Grafana doesn't show much of a change in latencies or timeouts:

pasted_file (1×1 px, 364 KB)

pasted_file (1×1 px, 511 KB)

pasted_file (1×1 px, 500 KB)

Over the course of the last couple of weeks I conducted a number of experiments, the individual results of which can be found here. I used restbase1001 as a guinea pig, sample periods lasting approximately 24 hours, with results compared against a baseline created from our current GC settings. The objective throughout was to determine if performance parity (or an improvement) could be achieved at a lower heap size.

Takeaways:

First off: Conventional wisdom is that G1GC is tuning-free; Over-tuning is an oft-cited source of performance issues. I've found this to be mostly true, the defaults at any given heap size generally seem quite close to optimal.

Our biggest source of GC issues under G1 are so-called Humongous objects. Any object 50% or greater of the region size is considered Humongous, and is allocated directly to old-gen. Humongous objects occupy one or more regions fully, and are only evacuated during a cleanup or full-GC. This can result in very inefficient use of memory, fragmentation, and ultimately, allocation failures. Bottom-line, Humongous objects are handled exceptionally by G1GC, and performance suffers when their allocations are anything other than exceptional.

Larger regions can help by causing more of these large objects to be allocated normally, but larger regions have similar effects on efficiency for the smaller objects (i.e. by requiring more data to be copied for each byte that is evacuated). If this inefficiency results in too many to-space failures, then increasing the heap size seems the only recourse.

Under a region size of 4MB (the default for 8GB heaps), the number of Humongous allocations is significant, and it's effects are obvious. Increasing the region size to 8MB helps, but the number of to-space failures, and the long pauses they create (1-4s) are still a problem. Decreasing the occupancy threshold (to 35%) has no effect here, and an analysis of the logs would suggest that this is because 35% is below the live set here.

Therefore it is my conclusion that 8GB is too small for optimal performance under this workload; My current efforts are focused on 12GB heaps.

restbase1001 is currently configured for a 12GB heap with 8MB region sizes. Tomorrow, I will introduce a lower initiating occupancy threshold, and by Thursday I expect to have a final recommendation.

Therefore it is my conclusion that 8GB is too small for optimal performance under this workload

The data I got over an afternoon in T106619#1519002 suggested that peak pause times, read and write latency didn't change significantly when using G1GC with heaps as small as 6G and lowered pause time targets & initiating thresholds. Aggregate GC times remained below 1s, and latencies and timeouts didn't change significantly.

It looks like you didn't test a similar configuration. Do you think it would be worth testing a similar config (8G heap, perhaps 30% initiating threshold, MaxGCPauseMillis=150) for more than a couple of hours?

It looks like you didn't test a similar configuration. Do you think it would be worth testing a similar config (8G heap, perhaps 30% initiating threshold, MaxGCPauseMillis=150) for more than a couple of hours?

It couldn't hurt; I'll see to it.

Yesterday's run was with a 12G heap, 8M region size, and an initiating occupancy threshold of 35%. Definitely the best results I've seen so far. Zero to-space failures, zero full GCs, 258ms average collection, only one pause that exceeded the soft real-time target (and by less than 20ms), and easily the most even distribution of collections to date.

collections.2015-09-02_to_2015-09-03.png (612×1 px, 27 KB)

Therefore it is my conclusion that 8GB is too small for optimal performance under this workload

The data I got over an afternoon in T106619#1519002 suggested that peak pause times, read and write latency didn't change significantly when using G1GC with heaps as small as 6G and lowered pause time targets & initiating thresholds. Aggregate GC times remained below 1s, and latencies and timeouts didn't change significantly.

It looks like you didn't test a similar configuration. Do you think it would be worth testing a similar config (8G heap, perhaps 30% initiating threshold, MaxGCPauseMillis=150) for more than a couple of hours?

restbase1001 is currently setup for this, I'll update with results tomorrow.

I ran restbase1001 for ~24 (from 2015-09-03T21:18:37+0000 to 2015-09-04T21:55:51+0000) with -Xmx8g -Xms8g -XX:MaxGCPauseMillis=150 -XX:InitiatingHeapOccupancyPercent=30. During that time there were 6 to-space failures, and an average collection time of nearly 1.5 times the next highest node.

collections.2015-09-03_to_2015-09-04.png (612×1 px, 21 KB)

Screenshot from 2015-09-05 20-16-54.png (821×1 px, 297 KB)

Change 236391 had a related patch set uploaded (by Eevans):
updated gc settings

https://gerrit.wikimedia.org/r/236391

@Eevans, what was the impact on request latencies and timeouts? Were there more GC pauses over 500ms / 1s compared to other nodes?

@Eevans, what was the impact on request latencies and timeouts?

As with most of the other tests, there wasn't enough of a difference in latencies that I'd hazard correlating them.

Were there more GC pauses over 500ms / 1s compared to other nodes?

Yes.

  • 1002: 3.8425163 secs, 3.7546157 secs
  • 1003: 5.3459563 secs
  • 1004: 1.3661091 secs
  • 1005: 3.5257712 secs, 4.9188115 secs
  • 1006: None
  • 1007: 0.6086962 secs
  • 1008: 1.4050795 secs
  • 1009: None

Change 236391 merged by Filippo Giunchedi:
cassandra: updated gc settings

https://gerrit.wikimedia.org/r/236391

Change 238462 had a related patch set uploaded (by Filippo Giunchedi):
Revert "cassandra: updated gc settings"

https://gerrit.wikimedia.org/r/238462

Change 238462 merged by Filippo Giunchedi:
Revert "cassandra: updated gc settings"

https://gerrit.wikimedia.org/r/238462

Change 248960 had a related patch set uploaded (by Eevans):
cassandra: updated gc settings

https://gerrit.wikimedia.org/r/248960

Change 248960 merged by Filippo Giunchedi:
cassandra: updated gc settings

https://gerrit.wikimedia.org/r/248960

@Eevans, could you check if this is still a problem?

GWicke lowered the priority of this task from Medium to Low.Jul 11 2017, 8:44 PM
GWicke edited projects, added Services (later); removed Services.

@Eevans, could you check if this is still a problem?

GC has been performing quite well for sometime; Closing