Use Date-Tiered Compaction Strategy for time series and revisioned data
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Oct 29 2015, 10:28 PM

Description

The investigation into timeouts in T116933 showed that leveled compaction's inability to ignore sstables containing tombstones for a partition key is likely causing timeouts and memory pressure for keys that are extremely frequently rerendered. A subsequent experiment in staging shows promising performance with the Date-Tiered Compaction Strategy (DTCS). A key benefit of using this strategy is its non-overlapping sstable hierarchy by writetime, which lets it avoid reading older tombstones when it can establish that their writetime is further in the past based on the hierarchy.

Configuring individual tables to use DTCS is fairly simple:

alter table data WITH compaction = {'class': 'DateTieredCompactionStrategy', 'base_time_seconds':'3600', 'max_sstable_age_days':'365'} and gc_grace_seconds = 86400;

For this reason, I'm proposing to start cautiously testing this in production as well, starting with small projects. The metrics collected under realistic conditions will let us make an informed decision about using DCTS by default.

Related Objects
Search...

Status	Assigned	Task
Resolved	Eevans	T116859 Investigate recent Cassandra cluster performance issues
Resolved	• GWicke	T116861 Investigate OOM and elevated read latencies on 1007
Resolved	• Pchelolo	T116933 Review mobile apps storage strategy, and look into retention policy select scaling
Resolved	• GWicke	T117115 Use Date-Tiered Compaction Strategy for time series and revisioned data

Event Timeline

• GWicke created this task.Oct 29 2015, 10:28 PM

• GWicke raised the priority of this task from to High.

• GWicke updated the task description. (Show Details)

• GWicke added projects: Services, RESTBase.

• GWicke added subscribers: • GWicke, Eevans, • mobrovac and 2 others.

I have gone ahead and switched the following tables to DCTS (size, per node):

local_group_wikiquote_T_title__revisions (81MB)
local_group_wikinews_T_parsoid_html (763MB)
local_group_wikivoyage_T_parsoid_html (1.85GB)
local_group_default_T_parsoid_html (12.24GB)

Remaining candidates, based on size:

local_group_wiktionary_T_parsoid_html (14.94GB)
local_group_wikipedia_T_parsoid_section_offsets (15.82GB)
local_group_wikipedia_T_title__revisions (36.99GB)
local_group_wikimedia_T_parsoid_html (44.53GB)
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ (251.69GB)
local_group_wikipedia_T_parsoid_html (479.48GB)

A good dashboard to track read latencies per keyspace is https://grafana-admin.wikimedia.org/dashboard/db/restbase-cassandra-cf-latency-rate

With the metrics for local_group_default_T_parsoid_html continuing to look okay & compaction mostly done, I have now converted local_group_wiktionary_T_parsoid_html as well.

SSTables per read for local_group_wiktionary_T_parsoid_html looks good so far:

Last night I also switched local_group_wikipedia_T_title__revisions.

Latencies on restbase100[7-9] in particular are very much affected by overall iowait, which in turn is dominated by wikipedia html and data-parsoid compactions. As a result, the minimal effect we are seeing from switching to DTCS seems to be within the margin of error at this point.

We just discussed next steps on IRC. There is one mid-sized table left (local_group_wikimedia_T_parsoid_html, 44.53GB), which we plan to convert on Monday. After that, we'd like to wait for a couple of days to get more data for the decision on whether to switch Wikipedia data-parsoid and HTML. If we decide to go ahead, we should be able to kick that off the following Monday.

• GWicke updated the task description. (Show Details)Oct 30 2015, 5:08 PM

• GWicke set Security to None.

I did a bit more research on DTCS over the weekend. Summary: Both theory and practice show DTCS to be a good fit for our current use cases. However, there are issues that will become relevant to us eventually, but it looks like they will be addressed before they will impact us:

handling of really old data (> 1 year)
handling of significant revision backfilling, using that revision's timestamp

Detail:

One issue is DTCS's use of the minimum timestamp to determine window membership. This has the consequence that renders using an old revision's timestamp as their writetime would force a merge with large sstables in an old window, even if most data is new. This would be very inefficient. Fortunately, we currently write all revisions using the current time, so this isn't directly impacting us (apart from read repair). This does however reduce data locality for old revisions, as new sstables will end up spanning range keys including rather old revisions, weakening the timestamp - revision correlation. This isn't the end of the world, as performance of old-revision access isn't so critical in any case. Filling in one article at a time can also localize revisions in a single sstable, even if it is technically not the correct time window. See also: https://issues.apache.org/jira/browse/CASSANDRA-10280

Another issue is the interplay between DTCS's max_sstable_age_days with bootstraps on old data. By default, sstables containing data written more than a year ago are no longer compacted. This somewhat bounds the maximum size of sstables, and avoids spending resources on compactions that aren't really needed. However, after bootstraps with vnodes we typically end up with very small sstables, and really need those to be compacted. There is discussion about bounding the max time window in https://issues.apache.org/jira/browse/CASSANDRA-10280, and cassandra 2.1.12 will start to use size-tiered compaction within a single time window.

This comment summarizes the current state of the discussion quite well, and provides links to other tasks.

local_group_wikimedia_T_parsoid_html was switched as well, and is looking good.

Remaining major keyspaces:

local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ (251.69GB)
local_group_wikipedia_T_parsoid_html (479.48GB)

Performance in staging is looking rather good:

https://grafana.wikimedia.org/dashboard/db/restbase-staging-cassandra?from=1446349523229&to=1446482723230

Compaction activity and read latencies are consistently low, despite a full forced re-render of enwiki running.

local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ is now switched to DTCS as well. Re-compaction is ongoing, and will take a while longer. Read latency on the column family has been stable so far.

Patch for a basic update pattern hint option & option migrations at https://github.com/wikimedia/restbase-mod-table-cassandra/pull/162.

note sstables per read globally has been increasing, spotted because the related graphite alarm has been firing

screenshot_8CEdHo.png (379×902 px, 85 KB)

• GWicke mentioned this in T116861: Investigate OOM and elevated read latencies on 1007.Nov 11 2015, 6:27 AM

Even after everything is fully compacted to the lowest level (multi-TB SSTables are currently being built on several nodes), the SSTable per read metric will likely be slightly higher with DTCS than with LCS. However, the extra SSTables seem to be small & in page cache, so this doesn't necessarily imply higher read latencies, memory pressure, or iowait.

p50 read latency, last 30 days:

p99 read latency, last 30 days:

iowait, last ~25 days (metrics seem to have gone missing before that):

In T117115#1799421, @GWicke wrote:

Even after everything is fully compacted to the lowest level (multi-TB SSTables are currently being built on several nodes), the SSTable per read metric will likely be slightly higher with DTCS than with LCS. However, the extra SSTables seem to be small & in page cache, so this doesn't necessarily imply higher read latencies, memory pressure, or iowait.

Yes, @fgiunchedi and I chatted about the need to (re)establish some "new normals", for some of these metrics.

I think it follows from the way the algorithms work that we should expect to see a little higher SSTables/read metric in the worst-case, with a lower SSTables/read in the typical case. The graphs would seem to bear that out, as well.

Pending compactions is another metric I wouldn't bee too alarmed over. It is an estimation, and the larger individual compactions (the lower the rate of completed compactions), the less accurate those estimations can become. This likely explains why they're so much spikey-er.

In T117115#1800047, @Eevans wrote:

In T117115#1799421, @GWicke wrote:

Even after everything is fully compacted to the lowest level (multi-TB SSTables are currently being built on several nodes), the SSTable per read metric will likely be slightly higher with DTCS than with LCS. However, the extra SSTables seem to be small & in page cache, so this doesn't necessarily imply higher read latencies, memory pressure, or iowait.

Yes, @fgiunchedi and I chatted about the need to (re)establish some "new normals", for some of these metrics.

tracked in T118976: establish new thresholds for cassandra alarms after switching restbase to dtcs

• GWicke closed this task as Resolved.Oct 4 2016, 6:50 PM

• GWicke claimed this task.

	F2955409: 9068541.png
	Nov 11 2015, 10:37 PM

	F2955411: 9068542.png
	Nov 11 2015, 10:37 PM

	F2955392: 9068525.png
	Nov 11 2015, 10:37 PM

	F2954839: pasted_file
	Nov 11 2015, 6:14 PM

Use Date-Tiered Compaction Strategy for time series and revisioned dataClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Use Date-Tiered Compaction Strategy for time series and revisioned data
Closed, ResolvedPublic
Actions

Related Objects
Search...