Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)
Open, HighPublic
Actions

Assigned To

None

Authored By

	Eevans
	Jun 9 2016, 1:18 PM

Description

Original summary

Upon upgrading the first host in production (restbase1007) to Cassandra 2.2.6, very high levels of disk read throughput were encountered (10x or more). Ultimately it required setting disk_access_mode to mmap_index_only to restore normal levels. Since memory-mapped decompression reads were an important new feature of 2.2 for us, we should figure out why this, and what is needed to correct it.

Update

This issue still affects Cassandra 3, so it is still relevant. The work-around from 2.2 still works, but he option might no longer be officially supported (@Eevans: is this accurate?).

Details

	Subject	Repo	Branch	Lines +/-
	Limit mmap to indexes (work-around for abnormal page faults)	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T93751 RFC: Next steps for long-term revision storage -- space needs, storage hierarchies
Declined	Eevans	T93496 Improve revision compression in Cassandra / Brotli or LZMA support
Declined	Eevans	T125904 Brotli compression for Cassandra
Declined	None	T120171 RFC: Differentiate storage strategies for archival storage vs. hot current data
Declined	None	T122028 RFC: Chunked storage algorithms for archival data vs. large-window brotli compression
Declined	Eevans	T125906 Evaluate Brotli compression for Cassandra
Invalid	None	T126582 Log input from cassandra caused logstash process to crash repeatedly
Resolved	• GWicke	T111746 [future] Keep an eye on materialized views in Cassandra 3.0
Resolved	Eevans	T126629 Cassandra 2.2.6
Resolved	Eevans	T160570 Cassandra 3.x Tracking
Open	None	T177621 Apache Cassandra Tracking
Open	None	T137419 Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)

Event Timeline

Eevans created this task.Jun 9 2016, 1:18 PM

Restricted Application added a subscriber: Zppix. · View Herald TranscriptJun 9 2016, 1:18 PM

Mentioned in SAL [2016-06-09T13:26:39Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet to apply 2G file cache : T137419

Mentioned in SAL [2016-06-09T14:07:39Z] <urandom> Re-enabling puppet on xenon.eqiad.wmnet, forcing a run, and restarting Cassandra : T137419

As an experiment, I started 3 separate dump processes in the Staging environment, two of which used title name offsets ('Fa' and 'Sa' respectively). The increased concurrency and request distribution were sufficient to reproduce the issues observed in production.

As can be seen in the attached plot, when disk_access_policy was set to mmap_index_only at ~10:55 on praseodymium and cerium, disk read throughput dropped to 10s of MBps, while on xenon (disk_access_policy: auto), throughput continued to be several hundred MBps.

Screenshot from 2016-06-09 16-02-13.png (688×1 px, 63 KB)

Additionally, over the course of this run I incrementally increased the value of file_cache_size_in_mb from the default of 512, to 768, 1024, and 2048 (after bumping heap size to 6G). This value controls the amount of memory allocated to o.a.c.service.FileCacheService, which caches o.a.c.io.util.RandomAccessReader instances. When RAR instances are evicted from FCS, the corresponding buffers are unmapped, and so there was some hope that increasing the cache size might in turn reduce the number of page faults. Unfortunately, I saw no indication that this was the case, the rate of major page faults remained relatively constant throughout (~280/s compared to ~15/s on the other 2 nodes).

Eevans moved this task from Backlog to In-Progress on the Cassandra board.Jun 13 2016, 9:19 PM

Mentioned in SAL [2016-06-17T15:54:58Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet to enable large pages : T137419

Mentioned in SAL [2016-06-17T15:59:30Z] <urandom> Starting html dumps from xenon.eqiad.wmnet and cerium.eqiad.wmnet : T137419

Mentioned in SAL [2016-06-17T18:56:36Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet with -XX:+PreserveFramePointer : T137419

Mentioned in SAL [2016-06-17T20:35:51Z] <urandom> Disabling puppet on xenon.eqiad.wmnet : T137419

Mentioned in SAL [2016-06-17T20:39:11Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet to apply -XX:+PreserveFramePointer : T137419

Mentioned in SAL [2016-06-17T21:21:45Z] <urandom> Reenabling puppet and resetting configuration on xenon.eqiad.wmnet : T137419

Eevans moved this task from In-Progress to Blocked on the Cassandra board.Jul 6 2016, 2:31 PM

Eevans moved this task from Blocked to Backlog on the Cassandra board.Aug 3 2016, 4:40 PM

Gehel unsubscribed.Sep 22 2016, 2:02 PM

Do we still care about this for 2.2, or should we just drop it & revisit as part of 3.x or Scylla testing later?

• GWicke lowered the priority of this task from High to Low.Oct 12 2016, 5:43 PM

• GWicke added a project: Services (later).

Eevans mentioned this in T167477: Unusually high read latency.Jun 8 2017, 11:25 PM

Eevans merged a task: T167477: Unusually high read latency.Jun 26 2017, 4:17 PM

Eevans added a parent task: T160570: Cassandra 3.x Tracking.

Change 361506 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] Limit mmap to indexes (work-around for abnormal page faults)

https://gerrit.wikimedia.org/r/361506

gerritbot added a project: Patch-For-Review.Jun 26 2017, 7:23 PM

Change 361506 merged by Dzahn:
[operations/puppet@production] Limit mmap to indexes (work-around for abnormal page faults)

https://gerrit.wikimedia.org/r/361506

@Eevans, is this relevant for 3.11?

In T137419#3450179, @GWicke wrote:

@Eevans, is this relevant for 3.11?

Yes, it is.

• GWicke renamed this task from Investigate aberrant disk read throughput in Cassandra 2.2.6 to Investigate aberrant disk read throughput in Cassandra 3 (originally 2.2.6).Jul 18 2017, 9:53 PM

• GWicke updated the task description. (Show Details)

Eevans renamed this task from Investigate aberrant disk read throughput in Cassandra 3 (originally 2.2.6) to Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x).Oct 6 2017, 4:41 PM

Eevans added a parent task: T177621: Apache Cassandra Tracking.

Eevans added a project: User-Eevans.Dec 13 2017, 8:25 PM

@Eevans I believe this is done?

In T137419#4394249, @Pchelolo wrote:

@Eevans I believe this is done?

Actually, no. We maintain a setting in our configs (2.2 & 3.11), that has since been omitted upstream. In the absence of this setting, the number of major pagefaults (and correspondingly, disk read throughput) skyrockets. Typically, when a setting like this is removed, it's under the assumption that the default is The Right Thing™ in all circumstances, and the ability to override it will at some point be removed.

This is actually something we should be following up on with upstream more aggressively.

• mobrovac added a project: Platform Team Legacy (Later).Dec 20 2018, 12:07 PM

In T137419#4395956, @Eevans wrote:

This is actually something we should be following up on with upstream more aggressively.

@Eevans: Hi, do you plan to do this? Asking as you you have been task assignee for a while now.

In T137419#5935635, @Aklapper wrote:

In T137419#4395956, @Eevans wrote:

This is actually something we should be following up on with upstream more aggressively.

@Eevans: Hi, do you plan to do this? Asking as you you have been task assignee for a while now.

The issue is real (and important), so I hate to close the issue (even if it's obvious that we're not making the time to work on it).

• WDoranWMF moved this task from Inbox to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Mar 12 2020, 1:35 PM

• AMooney moved this task from Backlog to Teleport on the Platform Team Workboards (Clinic Duty Team) board.Mar 24 2020, 1:48 PM

• WDoranWMF edited projects, added Platform Engineering (Icebox); removed Platform Team Workboards (Clinic Duty Team).Mar 24 2020, 9:48 PM

Elevating priority in the hopes it attracts more attention

Eevans removed a project: User-Eevans.Jun 9 2021, 4:45 PM

hnowlan added a project: Platform Team Workboards (Platform Engineering Reliability).Aug 4 2021, 4:03 PM

hnowlan removed a project: Platform Team Workboards (Platform Engineering Reliability).Oct 5 2022, 10:35 AM

	F4146761: Screenshot from 2016-06-09 16-02-13.png
	Jun 9 2016, 2:31 PM

Investigate aberrant disk read throughput in Cassandra (affects 2.2.x and 3.x)Open, HighPublicActions