Low-latency current revision storage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Jan 24 2017, 10:37 PM

Description

As a sub-task of T120171, this task discusses steps towards storing current revisions only, in a reliable, low-maintenance, and low-latency manner.

Option 1: Retention policies using application-level TTLs

This approach uses a schema identical to that of the current storage model, one that utilizes so-called wide rows to model a one-to-many relationship between a title and its revisions, and a one-to-many relationship between each revision and its corresponding renders. It differs only in how it approaches retention.

Since renders are keyed on a type-1 UUID, retaining a single current render, and (at least) 24 hours worth of past renders, is as simple as batching a range delete with new renders, using a tid predicate 24 hours less than the one being inserted.

Limiting renders is slightly more challenging since the revision is an integer and no temporal context exists. As a result, additional storage is used to establish this relationship, mapping timestamps to corresponding revisions. Records in this timeline are keyed by domain (on the assumption that mediawiki sharding would never be more granular than this). Updates to the timeline can be performed probabilistically, if necessary. TTLs can be applied to prevent unbounded growth.

See https://www.mediawiki.org/wiki/RESTBase/StorageDesign#Retention_policies_using_application-level_TTLs for a more thorough explanation.

Option 2: Separate TTL table with rage range deletes on latest content

This option is very similar to the previous one, but it avoids the need of the timeline indexing. This approach uses 2 tables with schemas identical to the one used now. The latest table with no TTL for the data and the ttl table with the table-level TTL of 24 hours.

On read, the data is first attempted to be read from he latest table, fallback to the ttl table with a fallback to generating on demand. If the data was generated on demand - it's written to both latest and ttl table. If the data was found in the ttl table only, the TTL refreshing might be applied if it's about to expire. (the TTL refreshing simply writes the data in the TTL table again, potentially with a new TID)

On update, the following algorithm is applied:

Read the currently latest render form the latest table
Write it to the TTL table
Write a new latest render to the latest table
Write the new latest render to the TTL table (to avoid a race condition when 2 concurrent updates read a single render, and then both written new renders one of the new renders would never be stashed in the TTL table)
Apply range deletes for previous renders of the currently written revision and for the previous revisions (if the latest_hash revision policy is used)

No explicit deletes are made in the TTL table, the unbounded growth is prevented by the table-level TTL.

Open questions:

How would the TTL table behave in this - it receives as many writes as the latest table and the writes might be out of order (for the arbitrary old revision edit case)

Option 3: Table-per-query

This approach materializes views of results using distinct tables, each corresponding to a query.

Queries / Tables

The most current render of the most current revision (table: current)
The most current render of a specific revision (table: by_rev)
A specific render of a specific revision (table: by_tid)

Algorithm

Data in the current table must be durable, but the contents of by_rev and by_tid can be ephemeral (should be, to prevent unbounded growth), lasting only for a time-to-live after the corresponding value in current has been superseded by something more recent. There are two ways of accomplishing this, either by a) copying the values on a read from current, or b) copying them on update, prior to replacing a value in current. Neither of these strategies are ideal.

For example, with non-VE use-cases, copy-on-read is problematic due to the write-amplification it creates (think: HTML dumps). Additionally, in order to fulfill the VE contract, the copy must be done in-line to ensure the values are there for the forthcoming save, introducing additional transaction complexity, and latency. Copy-on-update over-commits by default, copying from current for every new render, regardless of the probability it will be edited, but happens asynchronously without impacting user requests, and can be done reliably. This proposal uses the copy-on-update approach.

See https://www.mediawiki.org/wiki/RESTBase/StorageDesign#Table-per-query and this document for details.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T120171 RFC: Differentiate storage strategies for archival storage vs. hot current data
Resolved	Eevans	T156199 Low-latency current revision storage
Resolved	Eevans	T157173 Verify requirements and parameters for efficient TTL'ed storage in Cassandra
Resolved	Eevans	T164865 Prototype and test range delete-based current revision storage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T156199#2970308, @Eevans wrote:

I don't have strong feelings about truncating the parsoid tables, except for a concern that it sets an expectation that the OOMs will cease; I don't think there is any reason to believe that either of T153588 or T156155 could have been avoided by truncating the tables. These were the result of accumulated tombstones within a single revision, something we'd continue to be vulnerable to even if we're only retaining the current revisions, (possibly more so given the other proposed changes, latest_hash, TTL renewal, etc).

In case there is any doubt about this, this can be used to reproduce the issue. The resulting heap dump looks identical what we found in T153588:

Screenshot from 2017-01-30 17-26-03.png (1×1 px, 250 KB)

The conditions for this are a function of the size of the value, the number of renders for a revision, and the efficacy of tombstone GC; It is possible for this to occur on an otherwise empty database with only a single stored revision.

@Eevans, I don't think there was any doubt about this being possible given enough re-renders in a short amount of time. However, this does not tell us much about whether this would matter to us in practice when storing current revisions only, and assuming upper bounds on re-render rates as discussed in T156199#2973062.

In any case, we just discussed this in our 1:1, and you made it clear that you would prefer to not rely on tombstone GC to happen within a bounded time. @Pchelolo started a design discussion for schemas that avoid this at the cost of some complexity, so lets look at those options, and then consider our next steps in Thursday's team meeting.

• Pchelolo updated the task description. (Show Details)Feb 2 2017, 8:53 PM

• Pchelolo updated the task description. (Show Details)Feb 2 2017, 8:58 PM

• GWicke created subtask T157173: Verify requirements and parameters for efficient TTL'ed storage in Cassandra.Feb 3 2017, 10:25 PM

Eevans updated the task description. (Show Details)Feb 6 2017, 5:17 PM

Eevans updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Feb 6 2017, 5:24 PM

• GWicke updated the task description. (Show Details)

Changed the table names to be a bit more generic, in line with the cassandra table implementation.

One big area we have not defined yet is how the mapping between high-level table schemas & cassandra schemas would work. For the concrete case we are considering here, the only realistic option I see right now is to keep defining revision as a range key in the table spec, and then modify the Cassandra schema for the data table to omit this key when the latest_hash retention policy is set. The ttl retention policy is fairly easy to handle as well (only using the ttl table). I'm less sure about the other retention policies, or the case where no retention policy is defined at all.

Another question is what we will do about listing support. A lot of the schema changes we are discussing here are making listings impossible. Should we generally remove support for listings from the table spec, reflecting the lowest common denominator?

Edit: Added a section with some thoughts on this in the next steps doc.

Eevans updated the task description. (Show Details)Feb 8 2017, 3:55 PM

Since the If-Modified-Since timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions.

We'll probably need to come up with something a little more clever; Revision ID is an unsigned int and simply adding the revision to a microseconds representation of the timestamp will skew the writetime (increasing) into the future (which I don't think we want).

elukey subscribed.Feb 8 2017, 5:15 PM

In T156199#3009590, @Eevans wrote:

Since the If-Modified-Since timestamp has a one second resolution, we can numerically add the revision number to the write time to guard against sub-second race conditions.

We'll probably need to come up with something a little more clever; Revision ID is an unsigned int and simply adding the revision to a microseconds representation of the timestamp will skew the writetime (increasing) into the future (which I don't think we want).

We could make sure that whatever we derive from the revision is less than one second's worth of microseconds.

CREATE TABLE ttl (
  "_domain" text,
  time_window bigint,
  title text,
  revision int,
  tid timeuuid,
  html text,
  data_parsoid text,
  section_offsets text,
  PRIMARY KEY (("_domain", time_window, title, revision), tid)
);

What was the reason for using tid as a cluster key here (I mean as opposed to another component of the partition key)? In other words, is there any reason not to use the following instead?

CREATE TABLE ttl (
  "_domain" text,
  time_window bigint,
  title text,
  revision int,
  tid timeuuid,
  html text,
  data_parsoid text,
  section_offsets text,
  PRIMARY KEY (("_domain", time_window, title, revision, tid))
);

Eevans updated the task description. (Show Details)Feb 13 2017, 9:44 PM

What was the reason for using tid as a cluster key here (I mean as opposed to another component of the partition key)? In other words, is there any reason not to use the following instead?

We wouldn't always have a tid in a request. For example for requests for HTML with a revision but without the TID (if one is starting to edit a specific revision of the content)

Eevans mentioned this in T157425: Degraded RAID on restbase-dev1001.Feb 16 2017, 8:49 PM

• GWicke mentioned this in T158975: Consistently upgrade stored Parsoid HTML to the latest stable spec version.Feb 24 2017, 5:47 PM

Eevans updated the task description. (Show Details)Feb 27 2017, 9:31 PM

Eevans updated the task description. (Show Details)Feb 27 2017, 10:08 PM

Eevans mentioned this in T160570: Cassandra 3.x Tracking.Mar 15 2017, 8:29 PM

• Mholloway subscribed.Apr 20 2017, 9:27 PM

Eevans added a project: Wikimedia-Incident.Apr 21 2017, 4:16 PM

Eevans moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Eevans updated the task description. (Show Details)May 1 2017, 7:25 PM

Eevans updated the task description. (Show Details)May 1 2017, 7:31 PM

Eevans updated the task description. (Show Details)May 1 2017, 7:44 PM

Eevans updated the task description. (Show Details)May 1 2017, 8:13 PM

Eevans updated the task description. (Show Details)May 1 2017, 8:18 PM

• mobrovac edited projects, added Services (designing), Cassandra, User-mobrovac; removed Services (next).May 2 2017, 3:48 PM

Eevans updated the task description. (Show Details)May 2 2017, 6:35 PM

Eevans updated the task description. (Show Details)May 2 2017, 7:07 PM

Eevans updated the task description. (Show Details)May 2 2017, 7:16 PM

Eevans updated the task description. (Show Details)May 2 2017, 7:31 PM

Eevans updated the task description. (Show Details)May 2 2017, 7:45 PM

Eevans updated the task description. (Show Details)May 2 2017, 9:41 PM

Eevans updated the task description. (Show Details)

Eevans updated the task description. (Show Details)May 2 2017, 9:44 PM

Eevans updated the task description. (Show Details)May 2 2017, 9:48 PM

Eevans updated the task description. (Show Details)May 2 2017, 9:54 PM

Eevans updated the task description. (Show Details)May 2 2017, 9:59 PM

Eevans updated the task description. (Show Details)May 2 2017, 11:40 PM

Eevans updated the task description. (Show Details)May 2 2017, 11:55 PM

Eevans updated the task description. (Show Details)May 2 2017, 11:58 PM

Eevans updated the task description. (Show Details)May 3 2017, 2:44 AM

Eevans updated the task description. (Show Details)May 3 2017, 3:32 PM

Eevans updated the task description. (Show Details)

Eevans updated the task description. (Show Details)May 3 2017, 3:45 PM

Eevans updated the task description. (Show Details)May 3 2017, 3:47 PM

Eevans updated the task description. (Show Details)May 3 2017, 5:20 PM

Eevans updated the task description. (Show Details)May 3 2017, 5:32 PM

Eevans updated the task description. (Show Details)May 3 2017, 5:52 PM

Eevans updated the task description. (Show Details)May 3 2017, 6:03 PM

Eevans updated the task description. (Show Details)May 3 2017, 6:11 PM

Eevans updated the task description. (Show Details)May 3 2017, 7:26 PM

• Pchelolo updated the task description. (Show Details)May 3 2017, 8:26 PM

Eevans claimed this task.May 3 2017, 8:43 PM

Eevans updated the task description. (Show Details)

Eevans updated the task description. (Show Details)May 4 2017, 1:07 AM

Eevans updated the task description. (Show Details)May 4 2017, 1:35 AM

Eevans updated the task description. (Show Details)May 4 2017, 7:23 PM

Eevans updated the task description. (Show Details)May 5 2017, 3:42 PM

Eevans updated the task description. (Show Details)May 5 2017, 3:54 PM

Eevans updated the task description. (Show Details)May 5 2017, 4:23 PM

Eevans updated the task description. (Show Details)May 5 2017, 6:44 PM

Eevans updated the task description. (Show Details)May 5 2017, 9:10 PM

Eevans updated the task description. (Show Details)May 5 2017, 10:03 PM

Eevans updated the task description. (Show Details)May 5 2017, 10:09 PM

• GWicke updated the task description. (Show Details)May 8 2017, 11:55 AM

Joe subscribed.May 8 2017, 4:22 PM

MoritzMuehlenhoff subscribed.May 8 2017, 4:22 PM

mark subscribed.May 9 2017, 5:15 PM

Eevans created subtask T164865: Prototype and test range delete-based current revision storage.May 9 2017, 8:07 PM

Not going deeper in reasoning on the requirements that this tickets assume to be true (I'm not sure all of those are justified, but that's another topic) I would say that the "application-level TTLs" options seems the best way to go for a few reasons:

It doesn't multiply the write load
It doesn't increase significantly the complexity of the schema
It should mostly keep the amount of storage we have to do in check
Seems simple enough not to trip over race conditions.

I would also add that the requirement stated on the mediawiki.org page

"Storage of arbitrarily old revisions (on request), for a TTL period (at least), from the time of the request" is, to my understanding, only valid in the context of Visual Editor. If that's the case, I would strongly suggest we save arbitrary old revisions only if the request has some 'secret' hashing in the query that signals it's coming from VE.

Else, it would only take a pretty persistent bot to render enough old revisions to both cripple storage performance and fill it up.

• Pchelolo updated the task description. (Show Details)May 23 2017, 4:42 PM

• Mholloway unsubscribed.May 23 2017, 4:48 PM

• Pchelolo updated the task description. (Show Details)May 23 2017, 6:20 PM

Option: Separate TTL table with rage deletes on latest content

Rage deletes indeed. :)

Eevans updated the task description. (Show Details)May 24 2017, 3:24 PM

• GWicke updated the task description. (Show Details)May 24 2017, 6:00 PM

Thanks for adding Option 2 (TTL table, range deletes on latest, no timeline)! One advantage of this option that @Pchelolo & I were pretty excited about is that this should generalize well to arbitrary schemas, which will make it a lot easier to implement the table storage spec.

Alternative (slightly tweaked) update sequence for option 2:

read previously-latest render

BATCH:

ttl table: write previously latest render to ttl table
latest table: insert using TID derived timestamp
latest table: range delete from latest table
- smaller revision numbers
- same revision, lower tid
- (generally, any key range that is lower, at each range index level)
ttl table: insert using TID

Considering all three options it seems option 1 > option 2 > option 3.
In particular in option 2 though I'm not sure the pros outweight the cons as it seems more complex and higher latency.

Further, I couldn't find any estimation of latency requirements for current vs historical storage and estimated size of "current" storage. Also how revisions requested by VE (assuming it is the most latency-sensitive rb client) are distributed, IOW how they would fall into current vs historical buckets

In T156199#3307062, @fgiunchedi wrote:

Further, I couldn't find any estimation of latency requirements for current vs historical storage and estimated size of "current" storage.

Good questions.

For latency, I think the short (unsatisfying?) answer is that for current storage it should be at least as good as what we have now (https://grafana.wikimedia.org/dashboard/db/restbase?fpanelId=11&fullscreen&orgId=1), while archive storage can be somewhat more latent. But you're right, we should put real numbers to this.

For storage size of current data, I'm not sure how to do much better than a SWAG. Average content size x number-of-titles x overhead(s) x 1.5 (per storage group). Or maybe we can infer it from the dev env. I'll try to work on this.

Also how revisions requested by VE (assuming it is the most latency-sensitive rb client) are distributed, IOW how they would fall into current vs historical buckets

Right; I'm curious about this as well. It feels like something that is pretty exceptional. If so, maybe it should be handled that way.

We got a good idea of size & expected latency right after we set up the RB cluster, and filled it with current revisions. Size for current revisions (all domains) was ~400G compressed total, access latency (from memory) < 20ms, with p50 <5ms.

In T156199#3307062, @fgiunchedi wrote:

Considering all three options it seems option 1 > option 2 > option 3.
In particular in option 2 though I'm not sure the pros outweight the cons as it seems more complex and higher latency.

Further, I couldn't find any estimation of latency requirements for current vs historical storage and estimated size of "current" storage. Also how revisions requested by VE (assuming it is the most latency-sensitive rb client) are distributed, IOW how they would fall into current vs historical buckets

Added some latency requirements to the document

• Pchelolo added subtasks: T170902: Don't add the TID by default in storage module, T170901: Expose default table TTL.Jul 17 2017, 11:04 PM

Eevans removed a subtask: T170901: Expose default table TTL.Jul 18 2017, 5:18 PM

Eevans removed a subtask: T170902: Don't add the TID by default in storage module.

Eevans mentioned this in T177621: Apache Cassandra Tracking.Oct 6 2017, 4:26 PM

Eevans closed subtask T164865: Prototype and test range delete-based current revision storage as Resolved.Dec 11 2017, 3:25 PM

Eevans closed subtask T157173: Verify requirements and parameters for efficient TTL'ed storage in Cassandra as Resolved.Feb 5 2018, 9:41 PM

Option #1 has been implemented, and is fully live in production; Closing this issue.

Eevans edited projects, added Services (done); removed Services (designing).Feb 5 2018, 9:46 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM