Page MenuHomePhabricator

Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)
Open, HighPublic

Description

Original summary

Upon upgrading the first host in production (restbase1007) to Cassandra 2.2.6, very high levels of disk read throughput were encountered (10x or more). Ultimately it required setting disk_access_mode to mmap_index_only to restore normal levels. Since memory-mapped decompression reads were an important new feature of 2.2 for us, we should figure out why this, and what is needed to correct it.

See also: https://wikitech.wikimedia.org/wiki/Incident_documentation/20160531-RESTBase

Update

This issue still affects Cassandra 3, so it is still relevant. The work-around from 2.2 still works, but he option might no longer be officially supported (@Eevans: is this accurate?).

Event Timeline

Mentioned in SAL [2016-06-09T13:26:39Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet to apply 2G file cache : T137419

Mentioned in SAL [2016-06-09T14:07:39Z] <urandom> Re-enabling puppet on xenon.eqiad.wmnet, forcing a run, and restarting Cassandra : T137419

As an experiment, I started 3 separate dump processes in the Staging environment, two of which used title name offsets ('Fa' and 'Sa' respectively). The increased concurrency and request distribution were sufficient to reproduce the issues observed in production.

As can be seen in the attached plot, when disk_access_policy was set to mmap_index_only at ~10:55 on praseodymium and cerium, disk read throughput dropped to 10s of MBps, while on xenon (disk_access_policy: auto), throughput continued to be several hundred MBps.

Screenshot from 2016-06-09 16-02-13.png (688×1 px, 63 KB)

Additionally, over the course of this run I incrementally increased the value of file_cache_size_in_mb from the default of 512, to 768, 1024, and 2048 (after bumping heap size to 6G). This value controls the amount of memory allocated to o.a.c.service.FileCacheService, which caches o.a.c.io.util.RandomAccessReader instances. When RAR instances are evicted from FCS, the corresponding buffers are unmapped, and so there was some hope that increasing the cache size might in turn reduce the number of page faults. Unfortunately, I saw no indication that this was the case, the rate of major page faults remained relatively constant throughout (~280/s compared to ~15/s on the other 2 nodes).

Mentioned in SAL [2016-06-17T15:54:58Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet to enable large pages : T137419

Mentioned in SAL [2016-06-17T15:59:30Z] <urandom> Starting html dumps from xenon.eqiad.wmnet and cerium.eqiad.wmnet : T137419

Mentioned in SAL [2016-06-17T18:56:36Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet with -XX:+PreserveFramePointer : T137419

Mentioned in SAL [2016-06-17T20:35:51Z] <urandom> Disabling puppet on xenon.eqiad.wmnet : T137419

Mentioned in SAL [2016-06-17T20:39:11Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet to apply -XX:+PreserveFramePointer : T137419

Mentioned in SAL [2016-06-17T21:21:45Z] <urandom> Reenabling puppet and resetting configuration on xenon.eqiad.wmnet : T137419

Do we still care about this for 2.2, or should we just drop it & revisit as part of 3.x or Scylla testing later?

GWicke lowered the priority of this task from High to Low.Oct 12 2016, 5:43 PM
GWicke added a project: Services (later).

Change 361506 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Limit mmap to indexes (work-around for abnormal page faults)

https://gerrit.wikimedia.org/r/361506

Change 361506 merged by Dzahn:
[operations/puppet@production] Limit mmap to indexes (work-around for abnormal page faults)

https://gerrit.wikimedia.org/r/361506

@Eevans, is this relevant for 3.11?

Yes, it is.

GWicke renamed this task from Investigate aberrant disk read throughput in Cassandra 2.2.6 to Investigate aberrant disk read throughput in Cassandra 3 (originally 2.2.6).Jul 18 2017, 9:53 PM
GWicke updated the task description. (Show Details)
Eevans renamed this task from Investigate aberrant disk read throughput in Cassandra 3 (originally 2.2.6) to Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x).Oct 6 2017, 4:41 PM

@Eevans I believe this is done?

Actually, no. We maintain a setting in our configs (2.2 & 3.11), that has since been omitted upstream. In the absence of this setting, the number of major pagefaults (and correspondingly, disk read throughput) skyrockets. Typically, when a setting like this is removed, it's under the assumption that the default is The Right Thing™ in all circumstances, and the ability to override it will at some point be removed.

This is actually something we should be following up on with upstream more aggressively.

This is actually something we should be following up on with upstream more aggressively.

@Eevans: Hi, do you plan to do this? Asking as you you have been task assignee for a while now.

Eevans removed Eevans as the assignee of this task.Mar 3 2020, 3:51 PM

This is actually something we should be following up on with upstream more aggressively.

@Eevans: Hi, do you plan to do this? Asking as you you have been task assignee for a while now.

The issue is real (and important), so I hate to close the issue (even if it's obvious that we're not making the time to work on it).

Eevans raised the priority of this task from Low to High.Jun 7 2021, 7:59 PM

Elevating priority in the hopes it attracts more attention