Page MenuHomePhabricator

Use Date-Tiered Compaction Strategy for time series and revisioned data
Closed, ResolvedPublic

Description

The investigation into timeouts in T116933 showed that leveled compaction's inability to ignore sstables containing tombstones for a partition key is likely causing timeouts and memory pressure for keys that are extremely frequently rerendered. A subsequent experiment in staging shows promising performance with the Date-Tiered Compaction Strategy (DTCS). A key benefit of using this strategy is its non-overlapping sstable hierarchy by writetime, which lets it avoid reading older tombstones when it can establish that their writetime is further in the past based on the hierarchy.

Configuring individual tables to use DTCS is fairly simple:

alter table data WITH compaction = {'class': 'DateTieredCompactionStrategy', 'base_time_seconds':'3600', 'max_sstable_age_days':'365'} and gc_grace_seconds = 86400;

For this reason, I'm proposing to start cautiously testing this in production as well, starting with small projects. The metrics collected under realistic conditions will let us make an informed decision about using DCTS by default.

Event Timeline

GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke added projects: Services, RESTBase.
GWicke added subscribers: GWicke, Eevans, mobrovac and 2 others.
GWicke added a comment.EditedOct 29 2015, 10:34 PM

I have gone ahead and switched the following tables to DCTS (size, per node):

  • local_group_wikiquote_T_title__revisions (81MB)
  • local_group_wikinews_T_parsoid_html (763MB)
  • local_group_wikivoyage_T_parsoid_html (1.85GB)
  • local_group_default_T_parsoid_html (12.24GB)

Remaining candidates, based on size:

  • local_group_wiktionary_T_parsoid_html (14.94GB)
  • local_group_wikipedia_T_parsoid_section_offsets (15.82GB)
  • local_group_wikipedia_T_title__revisions (36.99GB)
  • local_group_wikimedia_T_parsoid_html (44.53GB)
  • local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ (251.69GB)
  • local_group_wikipedia_T_parsoid_html (479.48GB)

With the metrics for local_group_default_T_parsoid_html continuing to look okay & compaction mostly done, I have now converted local_group_wiktionary_T_parsoid_html as well.

SSTables per read for local_group_wiktionary_T_parsoid_html looks good so far:

Last night I also switched local_group_wikipedia_T_title__revisions.

Latencies on restbase100[7-9] in particular are very much affected by overall iowait, which in turn is dominated by wikipedia html and data-parsoid compactions. As a result, the minimal effect we are seeing from switching to DTCS seems to be within the margin of error at this point.

We just discussed next steps on IRC. There is one mid-sized table left (local_group_wikimedia_T_parsoid_html, 44.53GB), which we plan to convert on Monday. After that, we'd like to wait for a couple of days to get more data for the decision on whether to switch Wikipedia data-parsoid and HTML. If we decide to go ahead, we should be able to kick that off the following Monday.

GWicke updated the task description. (Show Details)Oct 30 2015, 5:08 PM
GWicke set Security to None.

I did a bit more research on DTCS over the weekend. Summary: Both theory and practice show DTCS to be a good fit for our current use cases. However, there are issues that will become relevant to us eventually, but it looks like they will be addressed before they will impact us:

  • handling of really old data (> 1 year)
  • handling of significant revision backfilling, using that revision's timestamp

Detail:

One issue is DTCS's use of the minimum timestamp to determine window membership. This has the consequence that renders using an old revision's timestamp as their writetime would force a merge with large sstables in an old window, even if most data is new. This would be very inefficient. Fortunately, we currently write all revisions using the current time, so this isn't directly impacting us (apart from read repair). This does however reduce data locality for old revisions, as new sstables will end up spanning range keys including rather old revisions, weakening the timestamp - revision correlation. This isn't the end of the world, as performance of old-revision access isn't so critical in any case. Filling in one article at a time can also localize revisions in a single sstable, even if it is technically not the correct time window. See also: https://issues.apache.org/jira/browse/CASSANDRA-10280

Another issue is the interplay between DTCS's max_sstable_age_days with bootstraps on old data. By default, sstables containing data written more than a year ago are no longer compacted. This somewhat bounds the maximum size of sstables, and avoids spending resources on compactions that aren't really needed. However, after bootstraps with vnodes we typically end up with very small sstables, and really need those to be compacted. There is discussion about bounding the max time window in https://issues.apache.org/jira/browse/CASSANDRA-10280, and cassandra 2.1.12 will start to use size-tiered compaction within a single time window.

This comment summarizes the current state of the discussion quite well, and provides links to other tasks.

local_group_wikimedia_T_parsoid_html was switched as well, and is looking good.

Remaining major keyspaces:

  • local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ (251.69GB)
  • local_group_wikipedia_T_parsoid_html (479.48GB)

Performance in staging is looking rather good:

https://grafana.wikimedia.org/dashboard/db/restbase-staging-cassandra?from=1446349523229&to=1446482723230

Compaction activity and read latencies are consistently low, despite a full forced re-render of enwiki running.

local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ is now switched to DTCS as well. Re-compaction is ongoing, and will take a while longer. Read latency on the column family has been stable so far.

Patch for a basic update pattern hint option & option migrations at https://github.com/wikimedia/restbase-mod-table-cassandra/pull/162.

note sstables per read globally has been increasing, spotted because the related graphite alarm has been firing

GWicke added a comment.EditedNov 11 2015, 6:07 PM

Even after everything is fully compacted to the lowest level (multi-TB SSTables are currently being built on several nodes), the SSTable per read metric will likely be slightly higher with DTCS than with LCS. However, the extra SSTables seem to be small & in page cache, so this doesn't necessarily imply higher read latencies, memory pressure, or iowait.

p50 read latency, last 30 days:

p99 read latency, last 30 days:

iowait, last ~25 days (metrics seem to have gone missing before that):

Even after everything is fully compacted to the lowest level (multi-TB SSTables are currently being built on several nodes), the SSTable per read metric will likely be slightly higher with DTCS than with LCS. However, the extra SSTables seem to be small & in page cache, so this doesn't necessarily imply higher read latencies, memory pressure, or iowait.

Yes, @fgiunchedi and I chatted about the need to (re)establish some "new normals", for some of these metrics.

I think it follows from the way the algorithms work that we should expect to see a little higher SSTables/read metric in the worst-case, with a lower SSTables/read in the typical case. The graphs would seem to bear that out, as well.

Pending compactions is another metric I wouldn't bee too alarmed over. It is an estimation, and the larger individual compactions (the lower the rate of completed compactions), the less accurate those estimations can become. This likely explains why they're so much spikey-er.

Even after everything is fully compacted to the lowest level (multi-TB SSTables are currently being built on several nodes), the SSTable per read metric will likely be slightly higher with DTCS than with LCS. However, the extra SSTables seem to be small & in page cache, so this doesn't necessarily imply higher read latencies, memory pressure, or iowait.

Yes, @fgiunchedi and I chatted about the need to (re)establish some "new normals", for some of these metrics.

tracked in T118976: establish new thresholds for cassandra alarms after switching restbase to dtcs

GWicke closed this task as Resolved.Oct 4 2016, 6:50 PM
GWicke claimed this task.