Testing of the Brotli compressor in T125906 revealed that impressive gains were possible with large Brotli windows, and correspondingly large Cassandra chunk_length_kb values. Using such large values for chunk_length_kb though will require a correspondingly increase in G1GC region size to prevent an abnormal number of Humongous allocations. It would be useful to test these larger region sizes in isolation of upcoming changes (T126629, T125904), while the cluster state is quiescent.
restbase1013-a.eqiad.wmnet was live-hacked to have a 32M region size for ~2 days. GC plot below.
Surprisingly, average collection time seems a bit lower than usual at this setting.
Though, once compared to other instances, the "improvement" looks a little suspect.
It's odd here that 1014-a dipped (from/to) at the same time (no change was made there). It's also odd that they dipped to a level so consistent with the other hosts.
This is just speculation, but as load is often not distributed throughout the cluster perfectly, perhaps the restart of 1013-a merely forced a shift in traffic? If so, the inference would be that the change in region size had little to no effect (same as the prior 16M test).
It might be interesting to repeat this test.
It was the scope of this ticket to determine the effect of larger G1GC region sizes in the production environment, independent of other changes. I think that goal has been accomplished, so I'm closing.
Yeah, I think it's pretty clear that increasing the region size is not affecting GC performance materially either way. That's good news, as it eliminates one of the concerns we had about using larger regions to support large compression block sizes. We now know that any change in GC performance would now be down to the change in compression block size, and is not a consequence of changing region size.