Page MenuHomePhabricator

Review mobile apps storage strategy, and look into retention policy select scaling
Closed, ResolvedPublic

Description

There are some other issues here that we should look into:

  • The retention policy is set up to keep renders for 24 hours. It might make sense to change this to only keep one render, or at least only the renders of a short period (like 10 minutes).
  • We only want to retain renders of the latest revision. It might be worth considering to use key_value storage instead of key_rev_value.

Independently, it's worth looking into whether we can optimize the select for the retention policy further, to avoid it being overwhelmed & timing out as easily.

Event Timeline

GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke added projects: RESTBase, Services.

Most of the request timeouts in logstash are from user dashboards that are constantly re-rendered by update jobs. They normally transclude fast-changing pages like https://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion, which triggers the re-renders.

The query looks like this:

select "key","rev","tid","latestTid","value","content-type","content-sha256","content-location","tags","_domain","_del" from "local_group_wikipedia_T_parsoid_html"."data" where "key" = ? AND "rev" = ? AND "_domain" = ?

It doesn't contain a limit, which is odd. Originally I thought it came from background updates, but the absence of the limit makes me less sure about this. The relevant accesses in key_rev_bucket seem to specify a limit as well.

Edit: I think these queries come from https://github.com/wikimedia/restbase/blob/master/mods/key_rev_value.js#L200-L205. The limit: 1 there is being converted to fetchSize: 1 in getRaw. dbu.buildGetQuery then ignores the query.limit parameter, which results in it being omitted from the query. A PR to clarify this in dbutils is at https://github.com/wikimedia/restbase-mod-table-cassandra/pull/161.

Further investigation shows that requests for some of these articles with a lot of expired renders reliably fail, despite limit 1 or fetchSize: 1. The likely cause for this seems to be Cassandra reading all tombstones in fully expired SSTable levels, which would also cause higher-than-usual memory pressure and read latencies.

Given a fixed re-render rate, the factors that will affect the number of tombstones kept around per partition are:

  • gc_grace interval, currently set to seven days.
  • Timeliness of compaction after gc_grace elapses. We are using leveled compaction, which generally ensures fairly timely compaction at the cost of more IO.

The most obvious solution would be to reduce gc_grace. The downside could be deleted data coming back when a node re-joins after more than max_hint_window_in_ms (default 3 hours). While a node being out or slow for a couple of hours does happen in practice, it seems less likely that a node would be out for more than ~12 hours. I do thus wonder if we could set both max_hint_window_in_ms and gc_grace to ~12 hours, along with the rule to wipe nodes that are down for longer than this time. During periods of general trouble likely to last > 12 hours we could increase gc_grace dynamically, as needed. Relying more on hints should be especially viable with new hint storage in 3.0, but there are also some optimizations since 2.1.1 that should make > 3 hours supportable at our relatively modest update rates. This discussion has at least one user reporting a use of at least 6 hour hint windows, but cautioning against going all-out with > 24 hours in Cassandra 2.1.

After looking some more into date tiered compaction, it actually appears to be a very good fit for key_rev_value and key_rev buckets, and more generally anything that is versioned and predominantly produced in timestamp order. Even with out-of order production (backfilling) of old revisions, access to new revisions should be reliably fast thanks to the non-overlapping timestamp ranges for new revisions. The non-overlapping property avoids the need to read loads of tombstones in lower levels, along with the associated memory pressure / OOM risk and timeouts.

To evaluate performance for our use case in practice, I have switched the parsoid html and data-parsoid buckets to the DateTieredCompactionStrategy. I have also started a forced re-render of all enwiki articles (with no-cache), to simulate a typical re-render workload.

The re-compaction is now mostly finished.

First results:

Read latency

pasted_file (1×1 px, 327 KB)

Pchelolo claimed this task.
Pchelolo subscribed.

Ok, looks like this is done, we already use key_value for mobile apps and use DateTieredCompactionStrategy, closing