Page MenuHomePhabricator

Isolated testing of GC settings for aggressive Cassandra chunk_length_kb values
Closed, ResolvedPublic

Description

Testing of the Brotli compressor in T125906 revealed that impressive gains were possible with large Brotli windows, and correspondingly large Cassandra chunk_length_kb values. Using such large values for chunk_length_kb though will require a correspondingly increase in G1GC region size to prevent an abnormal number of Humongous allocations. It would be useful to test these larger region sizes in isolation of upcoming changes (T126629, T125904), while the cluster state is quiescent.

Event Timeline

Eevans triaged this task as Medium priority.Apr 27 2016, 8:07 PM
Eevans moved this task from Backlog to Next on the Cassandra board.

Mentioned in SAL [2016-09-06T20:56:46Z] <urandom> T133805: Disabling Puppet for GC experiment on restbase1013.eqiad.wmnet

Mentioned in SAL [2016-09-06T20:59:15Z] <urandom> T133805: Restarting Cassandra to apply G1 region size of 16M on restbase1013-a.eqiad.wmnet

Mentioned in SAL [2016-09-09T15:15:22Z] <urandom> T133805: Renabling Pupppet, forcing run, and restarting Cassandra to restore 8M region size on restbase1013-a.eqiad.wmnet

restbase1013-a.eqiad.wmnet was live-hacked to have a 16M region size for ~2 ½ days. GC plot below (there is no difference that I can discern).

Screenshot from 2016-09-09 10-03-02.png (875×1 px, 133 KB)

NOTE: I plan to run a similar test at 32M next week (once some planned restarts are out of the way).

Mentioned in SAL [2016-09-12T19:12:13Z] <urandom> T133805: Disabling Puppet for GC experiment on restbase1013.eqiad.wmnet

Mentioned in SAL [2016-09-12T19:13:31Z] <urandom> T133805: Restarting Cassandra to apply G1 region size of 32M on restbase1013-a.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2016-09-14T18:17:07Z] <urandom> T133805: Renabling Pupppet, forcing run, and restarting Cassandra to restore 8M region size on restbase1013-a.eqiad.wmnet

restbase1013-a.eqiad.wmnet was live-hacked to have a 32M region size for ~2 days. GC plot below.

Surprisingly, average collection time seems a bit lower than usual at this setting.

Screenshot from 2016-09-14 13-29-46.png (873×1 px, 125 KB)

Though, once compared to other instances, the "improvement" looks a little suspect.

Screenshot from 2016-09-14 13-39-20.png (873×1 px, 248 KB)

It's odd here that 1014-a dipped (from/to) at the same time (no change was made there). It's also odd that they dipped to a level so consistent with the other hosts.

This is just speculation, but as load is often not distributed throughout the cluster perfectly, perhaps the restart of 1013-a merely forced a shift in traffic? If so, the inference would be that the change in region size had little to no effect (same as the prior 16M test).

It might be interesting to repeat this test.

Eevans moved this task from Blocked to In-Progress on the Cassandra board.

It was the scope of this ticket to determine the effect of larger G1GC region sizes in the production environment, independent of other changes. I think that goal has been accomplished, so I'm closing.

Yeah, I think it's pretty clear that increasing the region size is not affecting GC performance materially either way. That's good news, as it eliminates one of the concerns we had about using larger regions to support large compression block sizes. We now know that any change in GC performance would now be down to the change in compression block size, and is not a consequence of changing region size.