We discussed this so far in https://rt.wikimedia.org/Ticket/Display.html?id=8824, but I'm now moving this over to Phabricator so that we get some broader access and the ability to edit the summary. The procurement itself is tracked in https://rt.wikimedia.org/Ticket/Display.html?id=9007.
I think we have enough information in T76370 to start thinking about specs. First iteration:
- start with 56 nodes in eqiad; could use misc hardware in codfw for cross-DC replication testing at first
- powerful CPU (performance is largely CPU-bound)
- 48-64G RAM
- 2TB- 3TB SSD space per node with at least 1000 rated erase cycles per cell (ex: [Samsung 850 PRO Series 1Tb](http://www.newegg.com/Product/Product.aspx?Item=N82E16820147362&nm_mc=KNC-GoogleAdwords-PC&cm_mmc=KNC-GoogleAdwords-PC-_-pla-_-Internal+SSDs-_-N82E16820147362&gclid=CLSU3_7ntcICFQQSMwodXxYAag) @$630,; perhapsince cassandra is doing purely sequential writes we cshould even get awaybe fine with the cheaper Samsung 840 EVO 1TB at around $400 ([endurance testanandtech predicts 31 years life at 100G sequential writes per day](http://www.anandtech.com/show/7173/samsung-ssd-840-evo-review-120gb-250gb-500gb-750gb-1tb-models-tested/3)). (Side note: [This German web site](http://www.heise.de/preisvergleich/?cat=hdssd&xf=252_1000&sort=r) is handy to list SSD models by criteria like price / GB.)
- 10Gbit would be nice (can saturate 1Gbit even on the old test hosts with requests for large pages), but realistically with sufficient nodes & the expected traffic pattern we should also be able to get by with 1Gbit; I imagine it still makes a significant price difference.
## Thoughts about storage space and SSDs
HTML is relatively bulky compared to wikitext; based on the info so far I would expect onlyenwiki alone will use more than 100G just for current enwiki revisions to take up at least ~60G.HTML and data-parsoid. Across all projects, This means that just current revisions across all projectswe will already use up a good chunk of 1Gclose to 2TB of storage. Additional HTML variants for mobile etc will use up additional space. These numbers are with the default lz4 compression, and we can improve things a bit by enabling deflate. Really big gains from compression require an algorithm with a larger than 32k sliding window such as LZMA to pick up the repetitions between bulky HTML revisions. [Benchmarks](http://stephane.lesimple.fr/blog/2010-07-20/lzop-vs-compress-vs-gzip-vs-bzip2-vs-lzma-vs-lzma2xz-benchmark-reloaded.html) suggest that at level 1 LZMA compression takes about 4-5 times more CPU than deflate at level 3 (or about as much as deflate at level 9); decompression might be faster than deflate if the output is significantly smaller. Cassandra doesn't currently support lzma compression. It does provide an interface to plug in additional algorithms though, which is something we could consider doing in the longer term if nobody else gets there first. Worth talking to datastax about this.
It's too early to make very precise predictionsBased on the info so far, 6TB of unreplicated storage will be about the minimum for the start. We will need more space for revisions eventually, but I'd guess that about 3TB of unreplicated storage will be about the minimumby then we'll have more information from the first deploy to refine the order for the startecond round. We currently use a replication factor of three (so that we can use quorum reads, and get some amount of read scaling), but could consider dropping this to two & single-node operations for the initial caching use case if necessary to save space. I would ratherLets not go in that directionplan based on that though, mainlyas it's good to avoid hot spots in the read pathhave a little bit of reserve if necessary.
Much of this data willStorage density can be relatively coldfairly high, so I think we'll be fine with a fairly high storage density per node (2TB).as most of those revisions are very rarely accessed, At five replicas and RF=3,and benchmark data so far shows good throughput with limited CPU resources. this works out to a bit over 3G of unreplicated storage.Cassandra performs only sequential writes, We can add more nodes as needed to grow capacity later.
Cassandra performs only sequenwhich keeps the number of flash sector erase cycles low (no write amplification from partial sector writes,). which keeps the number of flash sector erase cycles low (no write amplification from partial sector writes).Our write volumes and thus SSTable merge traffic are fairly moderate, Our write volumes and thus SSTable merge traffic should be fairly moderateespecially relative to the storage capacity we need. We could probably get awaybe fine with cheap consumer-grade SSDs with low erase cycle specs for this application, especially if we are using a replication factor of three & are not close to the space limit all the time. All data is checksummed in Cassandra, so issues will be detected early.
See also:
- Benchmark results in T76370
- [Cassandra hardware planning docs](http://www.datastax.com/documentation/cassandra/2.1/cassandra/planning/architecturePlanningHardware_c.html); since we'll be storing a long tail of old revisions that are rarely accessed we can use more storage capacity per node.