RESTBase production hardware
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Dec 8 2014, 6:53 AM

Description

We discussed this so far in https://rt.wikimedia.org/Ticket/Display.html?id=8824, but I'm now moving this over to Phabricator so that we get some broader access and the ability to edit the summary. The procurement itself is tracked in https://rt.wikimedia.org/Ticket/Display.html?id=9007.

I think we have enough information in T76370 to spec & order:

start with 6 nodes in eqiad; could use misc hardware in codfw for cross-DC replication testing at first
powerful CPU (performance is largely CPU-bound)
48-64G RAM
3TB JBOD SSD space per node with at least 1000 rated erase cycles per cell
- Samsung 840 EVO 1TB @$420: Since cassandra is doing purely sequential writes we should even be fine with this one. Anandtech predicts 31 years life at 100G sequential writes per day.
- Samsung 850 PRO Series 1Tb @$630.
- Intel options > 480G are around 2x+ more expensive per GB
- Side note: This German web site is handy to list SSD models by criteria like price / GB.
- OS can use a small RAID-5 (or -1) across a small partition on each of the disks. The puppetization works well, so even placing the OS on the RAID-0 would be okay from a reliability POV, but it could cause slightly more work when bringing a failed node back up.
10Gbit would be nice (can saturate 1Gbit even on the old test hosts with requests for large pages), but realistically with sufficient nodes & the expected traffic pattern we should also be able to get by with 1Gbit; I imagine it still makes a significant price difference.

Thoughts about storage space and SSDs

HTML is relatively bulky compared to wikitext; based on the info so far enwiki alone will use more than 100G just for current HTML and data-parsoid. Across all projects, we will already use close to 2TB of storage. Additional HTML variants for mobile etc will use up additional space. These numbers are with the default lz4 compression, and we can improve things a bit by enabling deflate. Really big gains from compression require an algorithm with a larger than 32k sliding window such as LZMA to pick up the repetitions between bulky HTML revisions. Benchmarks suggest that at level 1 LZMA compression takes about 4-5 times more CPU than deflate at level 3 (or about as much as deflate at level 9); decompression might be faster than deflate if the output is significantly smaller. Cassandra doesn't currently support lzma compression. It does provide an interface to plug in additional algorithms though, which is something we could consider doing in the longer term if nobody else gets there first. Worth talking to datastax about this.

Based on the info so far, 6TB of unreplicated storage will be about the minimum for the start. We will need more space for revisions eventually, but by then we'll have more information from the first deploy to refine the order for the second round. We currently use a replication factor of three (so that we can use quorum reads, and get some amount of read scaling), but could consider dropping this to two & single-node operations for the initial caching use case if necessary to save space. Lets not plan based on that though, as it's good to have a little bit of reserve if necessary.

Storage density can be fairly high, as most of those revisions are very rarely accessed, and benchmark data so far shows good throughput with limited CPU resources. Cassandra performs only sequential writes, which keeps the number of flash sector erase cycles low (no write amplification from partial sector writes). Our write volumes and thus SSTable merge traffic are fairly moderate, especially relative to the storage capacity we need. We could be fine with cheap consumer-grade SSDs with low erase cycle specs for this application, especially if we are using a replication factor of three & are not close to the space limit all the time. All data is checksummed in Cassandra, so issues will be detected early.

Details

Subject	Repo	Branch	Lines +/-
restbase: provision restbase/cassandra role	operations/puppet	production	+29 -8
restbase: switch to new partitioning scheme	operations/puppet	production	+31 -28
restbase: adjust partman recipe	operations/puppet	production	+2 -3

Customize query in gerrit

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	• GWicke	T1228 RESTbase deployment
Resolved	fgiunchedi	T76986 RESTBase production hardware
Resolved	fgiunchedi	T88805 rack and setup restbase production cluster in eqiad
Resolved	fgiunchedi	T89639 restbase1006 faulty disk controller
		· · ·

Event Timeline

• GWicke created this task.Dec 8 2014, 6:53 AM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke changed Security from none to None.

• GWicke updated the task description. (Show Details)

• GWicke subscribed.

• GWicke mentioned this in T1228: RESTbase deployment.Dec 8 2014, 6:56 AM

• GWicke added projects: acl*sre-team, RESTBase, RESTBase-architecture.Dec 8 2014, 6:58 AM

• GWicke edited subscribers, added: mark, RobH, faidon and 2 others; removed: Aklapper.

• GWicke updated the task description. (Show Details)Dec 8 2014, 7:04 AM

• GWicke updated the task description. (Show Details)Dec 8 2014, 7:13 AM

• GWicke added subscribers: Ottomata, • Gage.Dec 8 2014, 4:34 PM

• GWicke updated the task description. (Show Details)Dec 8 2014, 5:07 PM

• GWicke updated the task description. (Show Details)

• Springle subscribed.Dec 8 2014, 7:50 PM

• GWicke updated the task description. (Show Details)Dec 8 2014, 11:40 PM

@mark, @RobH, @faidon: Do you need any additional information? It would be great to get this started asap.

• GWicke updated the task description. (Show Details)Dec 10 2014, 4:54 PM

With enwiki dumped alphabetically until 'G' disk usage is 37G for html and 18G for data-parsoid, both with LZ4 compression. This means that my previous estimate of 60G for enwiki current revisions is definitely too low, at least using LZ4. We'll likely use somewhere between 1 and 2TB for current HTML & data-parsoid alone. We need to adjust our storage capacity upwards, probably closer to 6TB of unreplicated storage.

• GWicke triaged this task as High priority.Dec 10 2014, 5:42 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Dec 10 2014, 5:50 PM

• GWicke added a project: Scrum-of-Scrums.Dec 10 2014, 5:56 PM

• GWicke moved this task from Scheduled to Blocked on the Scrum-of-Scrums board.

• GWicke added a project: Blocked-on-Operations.Dec 11 2014, 12:39 AM

What are the following numbers based on?

5 nodes initially, any significance to the nr 5? Does even/odd matter?
>= 1000 erase cycles per cell, what's that based on?

It sounds like you're shooting for boxes with 2 SSDs each.

I'd like to do 10Gbps here, and we should be able to handle that in both data centers.

I'd also prefer to do codfw from the start as well, and we don't have a lot of misc hardware available there either. What's the worry here?

It seems the description changed since what I was commenting on. 3 TB per node now instead of 2 TB, and 6 nodes instead of 5, right?

It seems the description changed since what I was commenting on. 3 TB per node now instead of 2 TB, and 6 nodes instead of 5, right?

Yes, we overlapped. I tweaked the specs to account for the increased space need found in testing.

In T76986#841531, @mark wrote:

What are the following numbers based on?

5 nodes initially, any significance to the nr 5? Does even/odd matter?

Even/odd does not matter beyond the basic replication factor. The sixth node got in there for the extra capacity. We can always add nodes later.

>= 1000 erase cycles per cell, what's that based on?

That basically excludes USB thumb drives, but includes most consumer SSDs. Erase cycles largely determine how many writes each flash block in an SSD can take before dying. An SSD for random rewrites (Varnish, MySQL) needs higher ratings, as each write involves the erase of a full flash block. An app doing sequential writes only (like cassandra) on the other hand produces a lot less erase cycles.

It sounds like you're shooting for boxes with 2 SSDs each.

I bumped that up to three to get more storage density for relatively cold data.

I'd like to do 10Gbps here, and we should be able to handle that in both data centers.

Okay, great. That avoids that being the bottleneck.

I'd also prefer to do codfw from the start as well, and we don't have a lot of misc hardware available there either. What's the worry here?

The main advantage of doing it in two phases could be that we can fine-tune the second order based on more data. But then on the other hand a bigger order for both might end up cheaper per box. I'm fine with either.

• GWicke updated the task description. (Show Details)Dec 11 2014, 7:25 PM

>= 1000 erase cycles per cell, what's that based on?

That basically excludes USB thumb drives, but includes most consumer SSDs. Erase cycles largely determine how many writes each flash block in an SSD can take before dying. An SSD for random rewrites (Varnish, MySQL) needs higher ratings, as each write involves the erase of a full flash block. An app doing sequential writes only (like cassandra) on the other hand produces a lot less erase cycles.

I'm aware of all this, I'm just wondering why you specified it like that with that number. :)

It sounds like you're shooting for boxes with 2 SSDs each.

I bumped that up to three to get more storage density for relatively cold data.

Ok. What disk config are you looking for? Direct SATA or RAID?

The main advantage of doing it in two phases could be that we can fine-tune the second order based on more data. But then on the other hand a bigger order for both might end up cheaper per box. I'm fine with either.

Mmh, besides the SSDs these will likely use a fairly typical misc server config, so worst case we can repurpose them for that.

As you know we're working for some initial quotes for this.

• Spage moved this task from Blocked to blocked by Operations on the Scrum-of-Scrums board.Dec 12 2014, 11:06 PM

In T76986#844287, @mark wrote:

>= 1000 erase cycles per cell, what's that based on?

That basically excludes USB thumb drives, but includes most consumer SSDs. Erase cycles largely determine how many writes each flash block in an SSD can take before dying. An SSD for random rewrites (Varnish, MySQL) needs higher ratings, as each write involves the erase of a full flash block. An app doing sequential writes only (like cassandra) on the other hand produces a lot less erase cycles.

I'm aware of all this, I'm just wondering why you specified it like that with that number. :)

It's certainly not a scientific (and somewhat arbitrary) number, but I figured that having some quantitative ballpark could be helpful when evaluating the durability of different low-end models. I'm pretty sure there are models out there with lower ratings. Not just USB thumb drives ;)

Ok. What disk config are you looking for? Direct SATA or RAID?

Direct SATA. A striped LVM volume will work, but we could also experiment with Cassandra's JBOD support by cutting up the physical volume along drive boundaries. The latter can potentially offer better availability & faster repair in the case of a single-disk failure.

The main advantage of doing it in two phases could be that we can fine-tune the second order based on more data. But then on the other hand a bigger order for both might end up cheaper per box. I'm fine with either.

Mmh, besides the SSDs these will likely use a fairly typical misc server config, so worst case we can repurpose them for that.

Yeah, makes sense.

As you know we're working for some initial quotes for this.

Thank you for getting this out so quickly, it's much appreciated!

• GWicke mentioned this in T76883: Flatten internal URL structure and move away from a layered overloading model.Dec 29 2014, 11:24 PM

@RobH: What is the latest ETA for this hardware?

Edit: Just talked about this in the ops meeting; end of January seems to be more likely at this point, the order is not yet out.

As Gabriel updates, we just pushed the order for mgmt approval today. We've only recently begun ordering HP systems, but our limited history shows about 3 weeks from order to delivery. Once it is ordered, I'll make certain to update this task with the info.

RobH claimed this task.Jan 5 2015, 8:15 PM

Jdforrester-WMF subscribed.Jan 7 2015, 7:20 PM

The order for the hardware has been placed today via RT#9049. The procurement tickets will be the last thing we migrate from RT to Phab, so there was some disconnect in keeping this properly up to date (my fault).

The delivery lead time for this is presently 2-3 weeks, but I'll update with more information as I get it. (I'll keep this task assigned to me so it stays on my radar.)

@RobH, thank you!

fgiunchedi subscribed.Jan 12 2015, 5:02 PM

@RobH, any news?

We are aiming for a release before mid-February. VE performance work (top priority project) depends on RESTBase being available ASAP, so moving fast on this would be great.

At the time of order, it was a 2-3 week lead time for shipment. As that has passed and I have no further update, I've pinged our HP VAR via email (just now.) I'll update ticket with his reply when received.

@RobH, thanks!

Latest update from VAR:

//The servers have arrived at VAR. The Samsung SSD and drive carriers are supposed to arrive tomorrow 2/3.

Kitting of the drives into the sleds should be completed and the completed order shipped by Tuesday 2/3 or Wednesday 2/4 for delivery to Ashburn on Thursday 2/5 or Friday 2/6.

We will provide another update tomorrow once we confirm delivery of the SSD's.//

Once the Var provides further updates, I'll pass them along into this task.

@RobH, thanks for the update! It looks like we are still on track for mid-February deploy, but it's getting tighter with about a week left for racking & node bring-up if things actually arrive by Friday.

Update from vendor: These have shipped and are due to arrive onsite @ eqiad on Thursday, 2015-02-05.

These have arrived on-site. What are the requirements for racking? Do these need to be spread across rows and/or racks?

These need to be spread across rows and (ideally) racks. Our replica placement is by row, with the goal of having one copy of each bit of data in a separate row.

Just to document the latest status:

HP forgot 10G ethernet, shipping modules for arrival tomorrow
Racking early next week
Setup: Debian Jessie, small (~20G RAID-1) partition for / with bulk of SSDs as RAID-0 on top of LVM.

RobH added a subtask: T88805: rack and setup restbase production cluster in eqiad.Feb 9 2015, 6:33 PM

Change 190182 had a related patch set uploaded (by Filippo Giunchedi):
restbase: switch to new partitioning scheme

https://gerrit.wikimedia.org/r/190182

Patch-For-Review

Change 190182 merged by Filippo Giunchedi:
restbase: switch to new partitioning scheme

https://gerrit.wikimedia.org/r/190182

fgiunchedi mentioned this in rOPUP0e67ebf959ba: restbase: switch to new partitioning scheme.Feb 12 2015, 11:01 AM

Change 190190 had a related patch set uploaded (by Filippo Giunchedi):
restbase: adjust partman recipe

https://gerrit.wikimedia.org/r/190190

Patch-For-Review

Change 190190 merged by Filippo Giunchedi:
restbase: adjust partman recipe

https://gerrit.wikimedia.org/r/190190

fgiunchedi mentioned this in rOPUP5449ca55a82f: restbase: adjust partman recipe.Feb 12 2015, 12:03 PM

Change 190426 had a related patch set uploaded (by Filippo Giunchedi):
restbase: provision restbase/cassandra role

https://gerrit.wikimedia.org/r/190426

Patch-For-Review

Change 190426 merged by Filippo Giunchedi:
restbase: provision restbase/cassandra role

https://gerrit.wikimedia.org/r/190426

fgiunchedi mentioned this in rOPUP284fef3422be: restbase: provision restbase/cassandra role.Feb 13 2015, 4:05 PM

As the actual hardware request via this ticket is done, I pulled myself off the assigned list. I cannot quite resolve it yet, since we have to rack and make the last two available.

Once those two are racked and the systems available for use, this can be resolved.

RobH renamed this task from RESTBase production hardware to RESTBase production hardware - 4 of 6 ready.Feb 13 2015, 6:56 PM

• mobrovac moved this task from Backlog to Blocked / others on the RESTBase board.Feb 13 2015, 7:06 PM

• GWicke mentioned this in T89366: Access to restbase / cassandra cluster.Feb 13 2015, 8:17 PM

Change 191339 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: set rack/dc/cluster name

https://gerrit.wikimedia.org/r/191339

Patch-For-Review

• GWicke added a parent task: T89481: RESTBase beta release (revision storage / content API).Feb 20 2015, 2:41 AM

Liuxinyu970226 subscribed.Feb 22 2015, 7:18 AM

• GWicke renamed this task from RESTBase production hardware - 4 of 6 ready to RESTBase production hardware - 3 of 6 ready.Feb 23 2015, 4:44 PM

4 of 6 servers are now online and serving requests.

The remaining two are:

restbase1001: needs to be racked
restbase1006: T89639 (faulty disk controller)

clarification: restbase1001 is up and racked but currently running into an issue with the debian installer and network cards, tracked at T90236

restbase1001 is online

fgiunchedi closed subtask T88805: rack and setup restbase production cluster in eqiad as Resolved.Feb 25 2015, 10:20 AM

• GWicke renamed this task from RESTBase production hardware - 4 of 6 ready to RESTBase production hardware - 5 of 6 ready.Feb 25 2015, 6:56 PM

• GWicke removed a project: Scrum-of-Scrums.

coren removed a project: Patch-For-Review.Mar 2 2015, 6:19 PM

• mobrovac added a subtask: T89639: restbase1006 faulty disk controller.Mar 4 2015, 4:15 PM

• GWicke removed a parent task: T89481: RESTBase beta release (revision storage / content API).Mar 5 2015, 9:08 PM

Ottomata unsubscribed.Mar 9 2015, 4:29 PM

yuvipanda assigned this task to fgiunchedi.Mar 16 2015, 1:44 PM

yuvipanda subscribed.

fgiunchedi closed subtask T89639: restbase1006 faulty disk controller as Resolved.Mar 17 2015, 4:57 PM

Resolving with restbase1006 now back in operation.

Liuxinyu970226 unsubscribed.Mar 18 2015, 5:58 AM

• GWicke mentioned this in T93790: Expand RESTBase cluster capacity.Apr 23 2015, 6:20 PM

• GWicke renamed this task from RESTBase production hardware - 5 of 6 ready to RESTBase production hardware.Jul 7 2015, 10:58 PM

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 7 2015, 10:58 PM

RobH mentioned this in Unknown Object (Task).Feb 9 2016, 7:18 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:56 PM

RESTBase production hardwareClosed, ResolvedPublicActions

Description

Thoughts about storage space and SSDs

Details

Related ObjectsSearch...

Event Timeline

RESTBase production hardware
Closed, ResolvedPublic
Actions

Related Objects
Search...