I have bad news from the hundreds-of-wikis front:
- Creating about 20 column families for each of 800 domains finishes in ~2 hours on my laptop, but eats up a lot of RAM and even makes cqlsh fail to connect within the hardcoded 2s timeout.
- On the three-node test cluster creating the same number of cfs took about two days. Towards the end the VMs were showing signs of heap pressure (lots of >200ms ParNew pauses, ~12s CMS collections) and heavy compaction activity on the system.schema_columns cf.
So it's clear that modeling 800 wikis with separate CFs is not really feasible.
The simplest way forward seems to be to define a composite partition key `(domain, key)` for all tables (should be supported by our backend already, using two or more 'hash' index members) & then share them across all (or sets of) domains. For our primary access patterns this should result in better performance than even low hundreds of wikis in the current model.
The downside is that it's no longer as easy to enumerate & delete data for a given domain. We could address this by creating a secondary index hashed on domain and possibly some bucket key (to spread the wide row around).
It's a bit unfortunate that we discover this pretty late in the game. Changing the data model might not actually be *that* hard (probably doable in days), but we'd have to re-test etc before deploying. The alternative would be to deploy to wikipedias only for now (that's ~270 wikis), and then migrate the data over once we have changed the data model.
https://github.com/wikimedia/restbase/pull/202 has a sketch of the schema changes.