We discussed this so far in https://rt.wikimedia.org/Ticket/Display.html?id=8824, but I'm now moving this over to Phabricator so that we get some broader access and the ability to edit the summary. The procurement itself is tracked in https://rt.wikimedia.org/Ticket/Display.html?id=9007.
I think we have enough information in T76370 to start thinking about specs. First iteration:
- start with 5 nodes in eqiad; could use misc hardware in codfw for cross-DC replication testing at first
- powerful CPU (performance is largely CPU-bound)
- 48-64G RAM
- 2TB SSD space per node with at least 1000 rated erase cycles per cell (ex: [Samsung 850 PRO Series 1Tb](http://www.newegg.com/Product/Product.aspx?Item=N82E16820147362&nm_mc=KNC-GoogleAdwords-PC&cm_mmc=KNC-GoogleAdwords-PC-_-pla-_-Internal+SSDs-_-N82E16820147362&gclid=CLSU3_7ntcICFQQSMwodXxYAag) @$630, perhaps we could even get away with the cheaper Samsung 840 EVO 1TB at around $400). (Side note: [This German web site](http://www.heise.de/preisvergleich/?cat=hdssd&xf=252_1000&sort=r) is handy to list SSDs by criteria like price / GB.)
- 10Gbit would be nice (can saturate 1Gbit even on the old test hosts with requests for large pages), but realistically with sufficient nodes & the expected traffic pattern we should also be able to get by with 1Gbit; I imagine it still makes a significant price difference.
## Thoughts about storage space and SSDs
HTML is relatively bulky compared to wikitext; based on the info so far I would expect only current enwiki revisions to take up at least ~60G. This means that just current revisions across all projects will already use up a good chunk of 1G. Additional HTML variants for mobile etc will use up additional space. These numbers are with the default lz4 compression, and we can improve things a bit by enabling deflate. Really big gains from compression require an algorithm with a larger than 32k sliding window such as LZMA to pick up the repetitions between bulky HTML revisions. [Benchmarks](http://stephane.lesimple.fr/blog/2010-07-20/lzop-vs-compress-vs-gzip-vs-bzip2-vs-lzma-vs-lzma2xz-benchmark-reloaded.html) suggest that at level 1 LZMA compression takes about 4-5 times more CPU than deflate at level 3 (or about as much as deflate at level 9); decompression might be faster than deflate if the output is significantly smaller. Cassandra doesn't currently support lzma compression. It does provide an interface to plug in additional algorithms though, which is something we could consider doing in the longer term if nobody else gets there first. Worth talking to datastax about this.
It's too early to make very precise predictions, but I'd guess that about 3TB of unreplicated storage will be about the minimum for the start. We currently use a replication factor of three (so that we can use quorum reads, and get some amount of read scaling), but could consider dropping this to two & single-node operations for the initial caching use case if necessary to save space. I would rather not go in that direction though, mainly to avoid hot spots in the read path.
Much of this data will be relatively cold, so I think we'll be fine with a fairly high storage density per node (2TB). At five replicas and RF=3, this works out to a bit over 3G of unreplicated storage. We can add more nodes as needed to grow capacity later.
Cassandra performs only sequential writes, which keeps the number of flash sector erase cycles low (no write amplification from partial sector writes). Our write volumes and thus SSTable merge traffic should be fairly moderate. We could probably get away with cheap consumer-grade SSDs with low erase cycle specs for this application, especially if we are using a replication factor of three & are not close to the space limit all the time. All data is checksummed in Cassandra, so issues will be detected early.
- Benchmark results in T76370
- [Cassandra hardware planning docs](http://www.datastax.com/documentation/cassandra/2.1/cassandra/planning/architecturePlanningHardware_c.html); since we'll be storing a long tail of old revisions that are rarely accessed we can use more storage capacity per node.