Page MenuHomePhabricator

Need to change data model for >~300 wikis in single cassandra cluster
Closed, ResolvedPublic

Description

I have bad news from the hundreds-of-wikis front:

  • Creating about 20 column families for each of 800 domains finishes in ~2 hours on my laptop, but eats up a lot of RAM and even makes cqlsh fail to load the full schema info within the default 2s timeout.
  • On the three-node test cluster creating the same number of cfs took about two days. Towards the end the VMs were showing signs of heap pressure (lots of >200ms ParNew pauses, ~12s CMS collections) and heavy compaction activity on the system.schema_columns cf.

So it's clear that modeling 800 wikis with separate CFs is not really feasible.

The simplest way forward seems to be to define a composite partition key (domain, key) for all tables (should be supported by our backend already, using two or more 'hash' index members) & then share them across all (or sets of) domains. For our primary access patterns this should result in better performance than even low hundreds of wikis in the current model.

The downside is that it's no longer as easy to enumerate & delete data for a given domain. We could address this by creating a secondary index hashed on domain and possibly some bucket key (to spread the wide row around).

It's a bit unfortunate that we discover this pretty late in the game. Changing the data model might not actually be *that* hard (probably doable in days), but we'd have to re-test etc before deploying. The alternative would be to deploy to wikipedias only for now (that's ~270 wikis), and then migrate the data over once we have changed the data model.

This RFC commit has a sketch of the schema changes this would require.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke added a project: RESTBase.
GWicke set Security to None.
GWicke edited subscribers, added: mobrovac, Eevans; removed: Aklapper.

AFAIK, Cassandra, like other big-data solutions, is more tuned for stuffing a lot of data into a (relatively-) small number of CFs, so I am no so surprised by not being able to handle well 800 of them. The proposition in the task description is an obvious, immediate solution to this problem, but IMHO, it's the opposite extreme of what we have now in place. How about a more middle-ground solution based on per-wiki traffic? We should have metrics telling us about views/edits for each wiki, right? If so, we could group them into CFs so that each table has more or less the same load. So, instead of having 1 or 800 CFs, we'd have, say, 20:

  • en.wp in CF#1
  • ru.wp + es.wp in CF#2
  • du.wp + fr.wp + pt.wp in CF#3
  • etc.

This still means we'd have to adjust our code, but it could give us more balanced access patterns to CFs.

@mobrovac, we can definitely have several buckets (that's what I meant with 'sets of wikis'). I think finding the right granularity of such groups involves a trade-off between config & management complexity and isolation.

From a performance perspective I'm pretty sure we could just toss everything in a single cf & it would work just fine, as the hashing spreads it out to nodes & per-node datastructures aren't so huge. It's more GC that's limiting in the JVM, and the memory pressure depends more on the amount of hot data per node. But I agree that there could be advantages in a moderate amount of partitioning, perhaps with an eye towards a future separation of clusters, or a phased deployment of schema changes. Lets just not over-complicate things for now. We can always split later with a fall-back read scheme.

After thinking about this a bit more I'm actually leaning towards handling this transparently in the Cassandra table storage backend, by

  • maintaining a mapping from domain to 'table group', 'table backend' (or whatever we want to call it) in the config, and
  • prepending the actual domain as an additional hash key to each schema layout (similar to this example, but in the table storage backend).

Advantages:

  • We prevent limitations of a specific backend from leaking across the table storage interface.
  • We stay true to the principle of expressing index patterns in logical tables while leaving backends to implement this physically any way they like.
  • Other backends like SQLite or MySQL with different scaling characteristics are probably better off with tables-per-domain.

In the restbase-mod-table-cassandra options, this could perhaps look like this:

storageGroups:
  - name: test.group
    domains: /^test\..*\.org/
    # storage settings for test group
  - name: wikipedia.group
    domains: /\.wikipedia\.org$/
    # storage settings for wikipedia group

One challenge with this scheme is that we lose ability to retrieve a randomly ordered list of primary keys per domain. We currently use this ability as a cheap implementation for page and revision listings. Options:

  • Punt for now & disable those listings until implemented properly.
  • Define a native secondary index on a copy of the domain field, and query with and _domain_copy = 'domain'. This replicates the current randomly ordered key listing, which is probably not what we want in the longer term anyway. It's cheapest to implement though.
  • Define a secondary index using a separate CF (and our secondary index logic) that maps domain to the remainder of the primary key per logical table. This gives us range queries on keys, which is very desirable. Challenge is that we'll need to bucket things to avoid everything being stored in a single wide row on one replica set.

We just discussed this, and decided to go ahead & attempt to build a minimal solution today & tomorrow.

This is now deployed and confirmed working beautifully. Closing.