We discussed this so far in https://rt.wikimedia.org/Ticket/Display.html?id=8824, but I'm now moving this over to Phabricator so that we get some broader access and the ability to edit the summary. The procurement itself is tracked in https://rt.wikimedia.org/Ticket/Display.html?id=9007.
I think we have enough information in T76370 to spec & order:
- start with 6 nodes in eqiad; could use misc hardware in codfw for cross-DC replication testing at first
- powerful CPU (performance is largely CPU-bound)
- 48-64G RAM
- 3TB JBOD SSD space per node with at least 1000 rated erase cycles per cell
- [Samsung 840 EVO 1TB](http://www.newegg.com/Product/Product.aspx?Item=N82E16820147251&cm_re=Samsung_840_EVO-_-20-147-251-_-Product) @$420: Since cassandra is doing purely sequential writes we should even be fine with this one. Anandtech predicts [31 years life at 100G sequential writes per day](http://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/3).
- [Samsung 850 PRO Series 1Tb](http://www.newegg.com/Product/Product.aspx?Item=N82E16820147362&nm_mc=KNC-GoogleAdwords-PC&cm_mmc=KNC-GoogleAdwords-PC-_-pla-_-Internal+SSDs-_-N82E16820147362&gclid=CLSU3_7ntcICFQQSMwodXxYAag) @$630.
- Intel options > 480G are [around 2x+ more expensive per GB](http://www.heise.de/preisvergleich/?cat=hdssd&sort=r&xf=252_480~1035_Intel#xf_top)
- Side note: [This German web site](http://www.heise.de/preisvergleich/?cat=hdssd&xf=252_1000&sort=r) is handy to list SSD models by criteria like price / GB.
- OS can use a small RAID-5 (or -1) across a small partition on each of the disks. The puppetization works well, so even placing the OS on the RAID-0 would be okay from a reliability POV, but it could cause slightly more work when bringing a failed node back up.
- 10Gbit would be nice (can saturate 1Gbit even on the old test hosts with requests for large pages), but realistically with sufficient nodes & the expected traffic pattern we should also be able to get by with 1Gbit; I imagine it still makes a significant price difference.
## Thoughts about storage space and SSDs
HTML is relatively bulky compared to wikitext; based on the info so far enwiki alone will use more than 100G just for current HTML and data-parsoid. Across all projects, we will already use close to 2TB of storage. Additional HTML variants for mobile etc will use up additional space. These numbers are with the default lz4 compression, and we can improve things a bit by enabling deflate. Really big gains from compression require an algorithm with a larger than 32k sliding window such as LZMA to pick up the repetitions between bulky HTML revisions. [Benchmarks](http://stephane.lesimple.fr/blog/2010-07-20/lzop-vs-compress-vs-gzip-vs-bzip2-vs-lzma-vs-lzma2xz-benchmark-reloaded.html) suggest that at level 1 LZMA compression takes about 4-5 times more CPU than deflate at level 3 (or about as much as deflate at level 9); decompression might be faster than deflate if the output is significantly smaller. Cassandra doesn't currently support lzma compression. It does provide an interface to plug in additional algorithms though, which is something we could consider doing in the longer term if nobody else gets there first. Worth talking to datastax about this.
Based on the info so far, 6TB of unreplicated storage will be about the minimum for the start. We will need more space for revisions eventually, but by then we'll have more information from the first deploy to refine the order for the second round. We currently use a replication factor of three (so that we can use quorum reads, and get some amount of read scaling), but could consider dropping this to two & single-node operations for the initial caching use case if necessary to save space. Lets not plan based on that though, as it's good to have a little bit of reserve if necessary.
Storage density can be fairly high, as most of those revisions are very rarely accessed, and benchmark data so far shows good throughput with limited CPU resources. Cassandra performs only sequential writes, which keeps the number of flash sector erase cycles low (no write amplification from partial sector writes). Our write volumes and thus SSTable merge traffic are fairly moderate, especially relative to the storage capacity we need. We could be fine with cheap consumer-grade SSDs with low erase cycle specs for this application, especially if we are using a replication factor of three & are not close to the space limit all the time. All data is checksummed in Cassandra, so issues will be detected early.
See also:
- Benchmark results in T76370
- [Cassandra hardware planning docs](http://www.datastax.com/documentation/cassandra/2.1/cassandra/planning/architecturePlanningHardware_c.html); since we'll be storing a long tail of old revisions that are rarely accessed we can use more storage capacity per node.