Page MenuHomePhabricator

RESTBase production hardware
Closed, ResolvedPublic

Description

We discussed this so far in https://rt.wikimedia.org/Ticket/Display.html?id=8824, but I'm now moving this over to Phabricator so that we get some broader access and the ability to edit the summary. The procurement itself is tracked in https://rt.wikimedia.org/Ticket/Display.html?id=9007.

I think we have enough information in T76370 to spec & order:

  • start with 6 nodes in eqiad; could use misc hardware in codfw for cross-DC replication testing at first
  • powerful CPU (performance is largely CPU-bound)
  • 48-64G RAM
  • 3TB JBOD SSD space per node with at least 1000 rated erase cycles per cell
  • 10Gbit would be nice (can saturate 1Gbit even on the old test hosts with requests for large pages), but realistically with sufficient nodes & the expected traffic pattern we should also be able to get by with 1Gbit; I imagine it still makes a significant price difference.

Thoughts about storage space and SSDs

HTML is relatively bulky compared to wikitext; based on the info so far enwiki alone will use more than 100G just for current HTML and data-parsoid. Across all projects, we will already use close to 2TB of storage. Additional HTML variants for mobile etc will use up additional space. These numbers are with the default lz4 compression, and we can improve things a bit by enabling deflate. Really big gains from compression require an algorithm with a larger than 32k sliding window such as LZMA to pick up the repetitions between bulky HTML revisions. Benchmarks suggest that at level 1 LZMA compression takes about 4-5 times more CPU than deflate at level 3 (or about as much as deflate at level 9); decompression might be faster than deflate if the output is significantly smaller. Cassandra doesn't currently support lzma compression. It does provide an interface to plug in additional algorithms though, which is something we could consider doing in the longer term if nobody else gets there first. Worth talking to datastax about this.

Based on the info so far, 6TB of unreplicated storage will be about the minimum for the start. We will need more space for revisions eventually, but by then we'll have more information from the first deploy to refine the order for the second round. We currently use a replication factor of three (so that we can use quorum reads, and get some amount of read scaling), but could consider dropping this to two & single-node operations for the initial caching use case if necessary to save space. Lets not plan based on that though, as it's good to have a little bit of reserve if necessary.

Storage density can be fairly high, as most of those revisions are very rarely accessed, and benchmark data so far shows good throughput with limited CPU resources. Cassandra performs only sequential writes, which keeps the number of flash sector erase cycles low (no write amplification from partial sector writes). Our write volumes and thus SSTable merge traffic are fairly moderate, especially relative to the storage capacity we need. We could be fine with cheap consumer-grade SSDs with low erase cycle specs for this application, especially if we are using a replication factor of three & are not close to the space limit all the time. All data is checksummed in Cassandra, so issues will be detected early.

See also:

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke changed Security from none to None.
GWicke updated the task description. (Show Details)
GWicke added a subscriber: GWicke.
GWicke updated the task description. (Show Details)Dec 8 2014, 7:04 AM
GWicke updated the task description. (Show Details)Dec 8 2014, 7:13 AM
GWicke updated the task description. (Show Details)Dec 8 2014, 5:07 PM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)Dec 8 2014, 11:40 PM

@mark, @RobH, @faidon: Do you need any additional information? It would be great to get this started asap.

GWicke updated the task description. (Show Details)Dec 10 2014, 4:54 PM

With enwiki dumped alphabetically until 'G' disk usage is 37G for html and 18G for data-parsoid, both with LZ4 compression. This means that my previous estimate of 60G for enwiki current revisions is definitely too low, at least using LZ4. We'll likely use somewhere between 1 and 2TB for current HTML & data-parsoid alone. We need to adjust our storage capacity upwards, probably closer to 6TB of unreplicated storage.

GWicke triaged this task as High priority.Dec 10 2014, 5:42 PM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)Dec 10 2014, 5:50 PM
GWicke moved this task from Scheduled to Blocked on the Scrum-of-Scrums board.
mark added a comment.Dec 11 2014, 1:49 PM

What are the following numbers based on?

  • 5 nodes initially, any significance to the nr 5? Does even/odd matter?
  • >= 1000 erase cycles per cell, what's that based on?

It sounds like you're shooting for boxes with 2 SSDs each.

I'd like to do 10Gbps here, and we should be able to handle that in both data centers.

I'd also prefer to do codfw from the start as well, and we don't have a lot of misc hardware available there either. What's the worry here?

mark added a comment.Dec 11 2014, 3:23 PM

It seems the description changed since what I was commenting on. 3 TB per node now instead of 2 TB, and 6 nodes instead of 5, right?

GWicke added a comment.EditedDec 11 2014, 4:18 PM

It seems the description changed since what I was commenting on. 3 TB per node now instead of 2 TB, and 6 nodes instead of 5, right?

Yes, we overlapped. I tweaked the specs to account for the increased space need found in testing.

In T76986#841531, @mark wrote:

What are the following numbers based on?

  • 5 nodes initially, any significance to the nr 5? Does even/odd matter?

Even/odd does not matter beyond the basic replication factor. The sixth node got in there for the extra capacity. We can always add nodes later.

  • >= 1000 erase cycles per cell, what's that based on?

That basically excludes USB thumb drives, but includes most consumer SSDs. Erase cycles largely determine how many writes each flash block in an SSD can take before dying. An SSD for random rewrites (Varnish, MySQL) needs higher ratings, as each write involves the erase of a full flash block. An app doing sequential writes only (like cassandra) on the other hand produces a lot less erase cycles.

It sounds like you're shooting for boxes with 2 SSDs each.

I bumped that up to three to get more storage density for relatively cold data.

I'd like to do 10Gbps here, and we should be able to handle that in both data centers.

Okay, great. That avoids that being the bottleneck.

I'd also prefer to do codfw from the start as well, and we don't have a lot of misc hardware available there either. What's the worry here?

The main advantage of doing it in two phases could be that we can fine-tune the second order based on more data. But then on the other hand a bigger order for both might end up cheaper per box. I'm fine with either.

GWicke updated the task description. (Show Details)Dec 11 2014, 7:25 PM
mark added a comment.Dec 12 2014, 2:41 PM
  • >= 1000 erase cycles per cell, what's that based on?

That basically excludes USB thumb drives, but includes most consumer SSDs. Erase cycles largely determine how many writes each flash block in an SSD can take before dying. An SSD for random rewrites (Varnish, MySQL) needs higher ratings, as each write involves the erase of a full flash block. An app doing sequential writes only (like cassandra) on the other hand produces a lot less erase cycles.

I'm aware of all this, I'm just wondering why you specified it like that with that number. :)

It sounds like you're shooting for boxes with 2 SSDs each.

I bumped that up to three to get more storage density for relatively cold data.

Ok. What disk config are you looking for? Direct SATA or RAID?

The main advantage of doing it in two phases could be that we can fine-tune the second order based on more data. But then on the other hand a bigger order for both might end up cheaper per box. I'm fine with either.

Mmh, besides the SSDs these will likely use a fairly typical misc server config, so worst case we can repurpose them for that.

As you know we're working for some initial quotes for this.

In T76986#844287, @mark wrote:
  • >= 1000 erase cycles per cell, what's that based on?

That basically excludes USB thumb drives, but includes most consumer SSDs. Erase cycles largely determine how many writes each flash block in an SSD can take before dying. An SSD for random rewrites (Varnish, MySQL) needs higher ratings, as each write involves the erase of a full flash block. An app doing sequential writes only (like cassandra) on the other hand produces a lot less erase cycles.

I'm aware of all this, I'm just wondering why you specified it like that with that number. :)

It's certainly not a scientific (and somewhat arbitrary) number, but I figured that having some quantitative ballpark could be helpful when evaluating the durability of different low-end models. I'm pretty sure there are models out there with lower ratings. Not just USB thumb drives ;)

Ok. What disk config are you looking for? Direct SATA or RAID?

Direct SATA. A striped LVM volume will work, but we could also experiment with Cassandra's JBOD support by cutting up the physical volume along drive boundaries. The latter can potentially offer better availability & faster repair in the case of a single-disk failure.

The main advantage of doing it in two phases could be that we can fine-tune the second order based on more data. But then on the other hand a bigger order for both might end up cheaper per box. I'm fine with either.

Mmh, besides the SSDs these will likely use a fairly typical misc server config, so worst case we can repurpose them for that.

Yeah, makes sense.

As you know we're working for some initial quotes for this.

Thank you for getting this out so quickly, it's much appreciated!

GWicke added a comment.EditedJan 5 2015, 7:36 PM

@RobH: What is the latest ETA for this hardware?

Edit: Just talked about this in the ops meeting; end of January seems to be more likely at this point, the order is not yet out.

RobH added a comment.Jan 5 2015, 8:15 PM

As Gabriel updates, we just pushed the order for mgmt approval today. We've only recently begun ordering HP systems, but our limited history shows about 3 weeks from order to delivery. Once it is ordered, I'll make certain to update this task with the info.

RobH claimed this task.Jan 5 2015, 8:15 PM
RobH added a comment.Jan 7 2015, 9:00 PM

The order for the hardware has been placed today via RT#9049. The procurement tickets will be the last thing we migrate from RT to Phab, so there was some disconnect in keeping this properly up to date (my fault).

The delivery lead time for this is presently 2-3 weeks, but I'll update with more information as I get it. (I'll keep this task assigned to me so it stays on my radar.)

@RobH, any news?

We are aiming for a release before mid-February. VE performance work (top priority project) depends on RESTBase being available ASAP, so moving fast on this would be great.

RobH added a comment.Jan 30 2015, 7:17 PM

At the time of order, it was a 2-3 week lead time for shipment. As that has passed and I have no further update, I've pinged our HP VAR via email (just now.) I'll update ticket with his reply when received.

Latest update from VAR:

//The servers have arrived at VAR. The Samsung SSD and drive carriers are supposed to arrive tomorrow 2/3.

Kitting of the drives into the sleds should be completed and the completed order shipped by Tuesday 2/3 or Wednesday 2/4 for delivery to Ashburn on Thursday 2/5 or Friday 2/6.

We will provide another update tomorrow once we confirm delivery of the SSD's.//

Once the Var provides further updates, I'll pass them along into this task.

@RobH, thanks for the update! It looks like we are still on track for mid-February deploy, but it's getting tighter with about a week left for racking & node bring-up if things actually arrive by Friday.

RobH added a comment.Feb 4 2015, 12:18 AM

Update from vendor: These have shipped and are due to arrive onsite @ eqiad on Thursday, 2015-02-05.

These have arrived on-site. What are the requirements for racking? Do these need to be spread across rows and/or racks?

These need to be spread across rows and (ideally) racks. Our replica placement is by row, with the goal of having one copy of each bit of data in a separate row.

Just to document the latest status:

  • HP forgot 10G ethernet, shipping modules for arrival tomorrow
  • Racking early next week
  • Setup: Debian Jessie, small (~20G RAID-1) partition for / with bulk of SSDs as RAID-0 on top of LVM.
gerritbot added a subscriber: gerritbot.

Change 190182 had a related patch set uploaded (by Filippo Giunchedi):
restbase: switch to new partitioning scheme

https://gerrit.wikimedia.org/r/190182

Patch-For-Review

Change 190182 merged by Filippo Giunchedi:
restbase: switch to new partitioning scheme

https://gerrit.wikimedia.org/r/190182

Change 190190 had a related patch set uploaded (by Filippo Giunchedi):
restbase: adjust partman recipe

https://gerrit.wikimedia.org/r/190190

Patch-For-Review

Change 190190 merged by Filippo Giunchedi:
restbase: adjust partman recipe

https://gerrit.wikimedia.org/r/190190

Change 190426 had a related patch set uploaded (by Filippo Giunchedi):
restbase: provision restbase/cassandra role

https://gerrit.wikimedia.org/r/190426

Patch-For-Review

Change 190426 merged by Filippo Giunchedi:
restbase: provision restbase/cassandra role

https://gerrit.wikimedia.org/r/190426

RobH removed RobH as the assignee of this task.Feb 13 2015, 6:55 PM

As the actual hardware request via this ticket is done, I pulled myself off the assigned list. I cannot quite resolve it yet, since we have to rack and make the last two available.

Once those two are racked and the systems available for use, this can be resolved.

RobH renamed this task from RESTBase production hardware to RESTBase production hardware - 4 of 6 ready.Feb 13 2015, 6:56 PM
mobrovac moved this task from Backlog to Blocked / others on the RESTBase board.Feb 13 2015, 7:06 PM

Change 191339 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: set rack/dc/cluster name

https://gerrit.wikimedia.org/r/191339

Patch-For-Review

GWicke renamed this task from RESTBase production hardware - 4 of 6 ready to RESTBase production hardware - 3 of 6 ready.Feb 23 2015, 4:44 PM
GWicke renamed this task from RESTBase production hardware - 3 of 6 ready to RESTBase production hardware - 4 of 6 ready.Feb 23 2015, 7:09 PM

4 of 6 servers are now online and serving requests.

The remaining two are:

  • restbase1001: needs to be racked
  • restbase1006: T89639 (faulty disk controller)

clarification: restbase1001 is up and racked but currently running into an issue with the debian installer and network cards, tracked at T90236

restbase1001 is online

GWicke renamed this task from RESTBase production hardware - 4 of 6 ready to RESTBase production hardware - 5 of 6 ready.Feb 25 2015, 6:56 PM
GWicke removed a project: Scrum-of-Scrums.
Ottomata removed a subscriber: Ottomata.Mar 9 2015, 4:29 PM
yuvipanda added a subscriber: yuvipanda.
GWicke closed this task as Resolved.Mar 17 2015, 7:16 PM

Resolving with restbase1006 now back in operation.

GWicke renamed this task from RESTBase production hardware - 5 of 6 ready to RESTBase production hardware.Jul 7 2015, 10:58 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 7 2015, 10:58 PM
RobH mentioned this in Unknown Object (Task).Feb 9 2016, 7:18 PM