We discussed this so far in https://rt.wikimedia.org/Ticket/Display.html?id=8824, but I'm now moving this over to Phabricator so that we get some broader access and the ability to edit the summary.
I think we have enough information in T76370 to start thinking about specs. First iteration:
- start with 5 nodes in eqiad; could use misc hardware in codfw for cross-DC replication testing at first
- powerful CPU (performance is largely CPU-bound)
- 48-64G RAM
- 2TB SSD space per node with at least 1000 rated erase cycles per cell (ex: [Samsung 850 PRO Series 1Tb](http://www.newegg.com/Product/Product.aspx?Item=N82E16820147362&nm_mc=KNC-GoogleAdwords-PC&cm_mmc=KNC-GoogleAdwords-PC-_-pla-_-Internal+SSDs-_-N82E16820147362&gclid=CLSU3_7ntcICFQQSMwodXxYAag) @$630, perhaps we could even get away with the cheaper Samsung 840 EVO 1TB at around $400)
- 10Gbit would be nice (can saturate 1Gbit even on the old test hosts with requests for large pages), but realistically with sufficient nodes & the expected traffic pattern we should also be able to get by with 1Gbit; I imagine it still makes a significant price difference.
## Thoughts about storage space and SSDs
HTML is relatively bulky compared to wikitext; based on the info so far I would expect only current enwiki revisions to take up at least ~60G. This means that just current revisions across all projects will already use up a good chunk of 1G. Additional HTML variants for mobile etc will use up additional space. These numbers are with the default lz4 compression, and we can improve things a bit by enabling deflate. Really big gains from compression require an algorithm with a larger than 32k sliding window such as LZMA to pick up the repetitions between bulky HTML revisions. At level 1 compression takes about 4-5 times as much CPU; decompression might even be faster than deflate.
It's too early to make very precise predictions, but I think about 3TB of unreplicated storage will be about the minimum for the start. We currently use a replication factor of three (so that we can use quorum reads, and get some amount of read scaling), but could consider dropping this to two & single-node operations for the initial caching use case if necessary to save space. I would rather not go in that direction though, mainly to avoid hot spots in the read path.
Much of this data will be relatively cold, so I think we'll be fine with a fairly high storage density per node (2TB). At five replicas and RF=3, this works out to a bit over 3G of unreplicated storage. We can add more nodes as needed to grow capacity later.
Cassandra performs only sequential writes, which keeps the number of flash sector erase cycles low (no write amplification). Our write volumes and thus SSTable merge traffic should be fairly moderate. We could probably get away with cheap consumer-grade SSDs with low erase cycle specs for this application, especially if we are using a replication factor of three & are not close to the space limit all the time. All data is checksummed in Cassandra, so issues would be detected.
- Benchmark results in T76370
- [Cassandra hardware planning docs](http://www.datastax.com/documentation/cassandra/2.1/cassandra/planning/architecturePlanningHardware_c.html); since we'll be storing a long tail of old revisions that are rarely accessed we can use more storage capacity per node.