Page MenuHomePhabricator

Low-latency current revision storage
Closed, ResolvedPublic

Description

As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner.

Option 1: Retention policies using application-level TTLs

This approach uses a schema identical to that of the current storage model, one that utilizes so-called wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention.

Since renders are keyed on a type-1 UUID, retaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete with new renders, using a tid predicate 24 hours less than the one being inserted.

Limiting renders is slightly more challenging since the revision is an integer and no temporal context exists. As a result, additional storage is used to establish this relationship, mapping timestamps to corresponding revisions. Records in this timeline are keyed by domain (on the assumption that mediawiki sharding would never be more granular than this). Updates to the timeline can be performed probabilistically, if necessary. TTLs can be applied to prevent unbounded growth.

See https://www.mediawiki.org/wiki/RESTBase/StorageDesign#Retention_policies_using_application-level_TTLs for a more thorough explanation.

Option 2: Separate TTL table with rage range deletes on latest content

This option is very similar to the previous one, but it avoids the need of the timeline indexing. This approach uses 2 tables with schemas identical to the one used now. The latest table with no TTL for the data and the ttl table with the table-level TTL of 24 hours.

On read, the data is first attempted to be read from he latest table, fallback to the ttl table with a fallback to generating on demand. If the data was generated on demand - it's written to both latest and ttl table. If the data was found in the ttl table only, the TTL refreshing might be applied if it's about to expire. (the TTL refreshing simply writes the data in the TTL table again, potentially with a new TID)

On update, the following algorithm is applied:

  • Read the currently latest render form the latest table
  • Write it to the TTL table
  • Write a new latest render to the latest table
  • Write the new latest render to the TTL table (to avoid a race condition when 2 concurrent updates read a single render, and then both written new renders one of the new renders would never be stashed in the TTL table)
  • Apply range deletes for previous renders of the currently written revision and for the previous revisions (if the latest_hash revision policy is used)

No explicit deletes are made in the TTL table, the unbounded growth is prevented by the table-level TTL.

Open questions:

  • How would the TTL table behave in this - it receives as many writes as the latest table and the writes might be out of order (for the arbitrary old revision edit case)

Option 3: Table-per-query

This approach materializes views of results using distinct tables, each corresponding to a query.

Queries / Tables

  • The most current render of the most current revision (table: current)
  • The most current render of a specific revision (table: by_rev)
  • A specific render of a specific revision (table: by_tid)

Algorithm

Data in the current table must be durable, but the contents of by_rev and by_tid can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in current has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from current, or b) copying them on update, prior to replacing a value in current. Neither of these strategies are ideal.

For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy must be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from current for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the copy-on-update approach.

See https://www.mediawiki.org/wiki/RESTBase/StorageDesign#Table-per-query and this document for details.


See also

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I don't have strong feelings about truncating the parsoid tables, except for a concern that it sets an expectation that the OOMs will cease; I don't think there is any reason to believe that either of T153588 or T156155 could have been avoided by truncating the tables. These were the result of accumulated tombstones within a single revision, something we'd continue to be vulnerable to even if we're only retaining the current revisions, (possibly more so given the other proposed changes, latest_hash, TTL renewal, etc).

In case there is any doubt about this, this can be used to reproduce the issue. The resulting heap dump looks identical what we found in T153588:

Screenshot from 2017-01-30 17-26-03.png (1×1 px, 250 KB)

The conditions for this are a function of the size of the value, the number of renders for a revision, and the efficacy of tombstone GC; It is possible for this to occur on an otherwise empty database with only a single stored revision.

@Eevans, I don't think there was any doubt about this being possible given enough re-renders in a short amount of time. However, this does not tell us much about whether this would matter to us in practice when storing current revisions only, and assuming upper bounds on re-render rates as discussed in T156199#2973062.

In any case, we just discussed this in our 1:1, and you made it clear that you would prefer to not rely on tombstone GC to happen within a bounded time. @Pchelolo started a design discussion for schemas that avoid this at the cost of some complexity, so lets look at those options, and then consider our next steps in Thursday's team meeting.

Eevans updated the task description. (Show Details)

Changed the table names to be a bit more generic, in line with the cassandra table implementation.

One big area we have not defined yet is how the mapping between high-level table schemas & cassandra schemas would work. For the concrete case we are considering here, the only realistic option I see right now is to keep defining revision as a range key in the table spec, and then modify the Cassandra schema for the data table to omit this key when the latest_hash retention policy is set. The ttl retention policy is fairly easy to handle as well (only using the ttl table). I'm less sure about the other retention policies, or the case where no retention policy is defined at all.

Another question is what we will do about listing support. A lot of the schema changes we are discussing here are making listings impossible. Should we generally remove support for listings from the table spec, reflecting the lowest common denominator?

Edit: Added a section with some thoughts on this in the next steps doc.

Since the If-Modified-Since timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions.

We'll probably need to come up with something a little more clever; Revision ID is an unsigned int and simply adding the revision to a microseconds representation of the timestamp will skew the writetime (increasing) into the future (which I don't think we want).

Since the If-Modified-Since timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions.

We'll probably need to come up with something a little more clever; Revision ID is an unsigned int and simply adding the revision to a microseconds representation of the timestamp will skew the writetime (increasing) into the future (which I don't think we want).

We could make sure that whatever we derive from the revision is less than one second's worth of microseconds.

CREATE TABLE ttl (
  "_domain" text,
  time_window bigint,
  title text,
  revision int,
  tid timeuuid,
  html text,
  data_parsoid text,
  section_offsets text,
  PRIMARY KEY (("_domain", time_window, title, revision), tid)
);

What was the reason for using tid as a cluster key here (I mean as opposed to another component of the partition key)? In other words, is there any reason not to use the following instead?

CREATE TABLE ttl (
  "_domain" text,
  time_window bigint,
  title text,
  revision int,
  tid timeuuid,
  html text,
  data_parsoid text,
  section_offsets text,
  PRIMARY KEY (("_domain", time_window, title, revision, tid))
);

What was the reason for using tid as a cluster key here (I mean as opposed to another component of the partition key)? In other words, is there any reason not to use the following instead?

We wouldn't always have a tid in a request. For example for requests for HTML with a revision but without the TID (if one is starting to edit a specific revision of the content)

Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

Not going deeper in reasoning on the requirements that this tickets assume to be true (I'm not sure all of those are justified, but that's another topic) I would say that the "application-level TTLs" options seems the best way to go for a few reasons:

  1. It doesn't multiply the write load
  2. It doesn't increase significantly the complexity of the schema
  3. It should mostly keep the amount of storage we have to do in check
  4. Seems simple enough not to trip over race conditions.

I would also add that the requirement stated on the mediawiki.org page

"Storage of arbitrarily old revisions (on request), for a TTL period (at least), from the time of the request" is, to my understanding, only valid in the context of Visual Editor. If that's the case, I would strongly suggest we save arbitrary old revisions only if the request has some 'secret' hashing in the query that signals it's coming from VE.

Else, it would only take a pretty persistent bot to render enough old revisions to both cripple storage performance and fill it up.

Option: Separate TTL table with rage deletes on latest content

Rage deletes indeed. :)

Thanks for adding Option 2 (TTL table, range deletes on latest, no timeline)! One advantage of this option that @Pchelolo & I were pretty excited about is that this should generalize well to arbitrary schemas, which will make it a lot easier to implement the table storage spec.

Alternative (slightly tweaked) update sequence for option 2:

  • read previously-latest render

BATCH:

  • ttl table: write previously latest render to ttl table
  • latest table: insert using TID derived timestamp
  • latest table: range delete from latest table
    • smaller revision numbers
    • same revision, lower tid
    • (generally, any key range that is lower, at each range index level)
  • ttl table: insert using TID

Considering all three options it seems option 1 > option 2 > option 3.
In particular in option 2 though I'm not sure the pros outweight the cons as it seems more complex and higher latency.

Further, I couldn't find any estimation of latency requirements for current vs historical storage and estimated size of "current" storage. Also how revisions requested by VE (assuming it is the most latency-sensitive rb client) are distributed, IOW how they would fall into current vs historical buckets

Further, I couldn't find any estimation of latency requirements for current vs historical storage and estimated size of "current" storage.

Good questions.

For latency, I think the short (unsatisfying?) answer is that for current storage it should be at least as good as what we have now (https://grafana.wikimedia.org/dashboard/db/restbase?fpanelId=11&fullscreen&orgId=1), while archive storage can be somewhat more latent. But you're right, we should put real numbers to this.

For storage size of current data, I'm not sure how to do much better than a SWAG. Average content size x number-of-titles x overhead(s) x 1.5 (per storage group). Or maybe we can infer it from the dev env. I'll try to work on this.

Also how revisions requested by VE (assuming it is the most latency-sensitive rb client) are distributed, IOW how they would fall into current vs historical buckets

Right; I'm curious about this as well. It feels like something that is pretty exceptional. If so, maybe it should be handled that way.

We got a good idea of size & expected latency right after we set up the RB cluster, and filled it with current revisions. Size for current revisions (all domains) was ~400G compressed total, access latency (from memory) < 20ms, with p50 <5ms.

Considering all three options it seems option 1 > option 2 > option 3.
In particular in option 2 though I'm not sure the pros outweight the cons as it seems more complex and higher latency.

Further, I couldn't find any estimation of latency requirements for current vs historical storage and estimated size of "current" storage. Also how revisions requested by VE (assuming it is the most latency-sensitive rb client) are distributed, IOW how they would fall into current vs historical buckets

Added some latency requirements to the document

Option #1 has been implemented, and is fully live in production; Closing this issue.