Page MenuHomePhabricator

Expand RESTBase cluster capacity
Closed, ResolvedPublic

Description

Looking at the first production data it looks like our current RESTBase Cassandra cluster does not have a lot of margin on IO bandwidth and storage capacity. With more tuning we might be able to stretch it a bit more, but it is clear that we'll need more capacity sooner rather than later.

We are leaning towards moving to a setup with multiple Cassandra instances per hardware node, aiming for a maximum compressed load of ~600G per instance. With 1T SSDs this could be configured as:

  • one instance per physical SSD, JBOD
  • all instances sharing a RAID-0 (current setup)
  • some other RAID level

JBOD would give us a good amount of failure isolation, but also does not allow us to share the aggregate IO bandwidth between instances. RAID levels other than RAID-0 lose disk capacity, and are not strictly needed as the data is already replicated 3-way across the cluster.

For the current cluster, the main decision we have to make is whether we want to add one SSD to the existing nodes or not. (There is one slot left.)

See T97692 for the storage need projection for FY2015/16. Bottom line is: ~35T of additional storage in the next fiscal year, for a total of 53T per cluster.

From a replica placement perspective it makes sense to add nodes in increments of three, one per each of three rows.

RT tracking of the current order: https://rt.wikimedia.org/Ticket/Display.html?id=9506

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
GWicke triaged this task as Medium priority.Mar 24 2015, 7:16 PM
GWicke updated the task description. (Show Details)

What "does not have a lot of margin on IO bandwidth and storage capacity" mean exactly? Which resource is near exhaustion (IOPS, bandwidth or capacity) in your analysis and how long do you guys think we have until it becomes critical?

If we're buying more, can we also figure out our plans (& procurement) for codfw? How would an additional DC fit in our plans? Would we still keep 3 replicas in eqiad (e.g. for I/O throughput reasons) or would we do 2+2? In any case, we should start procuring that soon and we should have the x2 factor in mind for budgetary purposes.

Why do we need multiple Cassandra instances per hardware node?

What "does not have a lot of margin on IO bandwidth and storage capacity" mean exactly? Which resource is near exhaustion (IOPS, bandwidth or capacity) in your analysis and how long do you guys think we have until it becomes critical?

The most critical bit at this point is capacity, followed by bandwidth and IOPS in roughly equal parts.

Currently the nodes already use around 900G out of 2.5T. With several optimizations implemented last week and deletions of older content the growth rate has slowed now, but it is clear that even with only a single HTML render per edit revision we'll eventually use up all of the current storage space. We would also like to add additional content types and projects, but don't have a lot of spare storage capacity to accommodate this.

Cassandra write workloads are basically purely bandwidth-bound. We managed to lower write-driven iowait to (typically) below 1% by enabling the trickle_fsync feature which forces constant flushing.

We do see high IOPS rates during random reads and repairs, along with an increase in iowait and read latency while doing so. While not maxing out all the time, it looks likely that some of the latency increase we are seeing is caused by limited IOPS and bandwidth. We have stopped running periodic anti-entropy repairs for now to avoid the latency increases. It would be nice to have enough headroom to be able to run them again.

If we're buying more, can we also figure out our plans (& procurement) for codfw? How would an additional DC fit in our plans? Would we still keep 3 replicas in eqiad (e.g. for I/O throughput reasons) or would we do 2+2? In any case, we should start procuring that soon and we should have the x2 factor in mind for budgetary purposes.

We currently have six nodes in eqiad, and would like to add three for a total of nine. For codfw we could either set up an identical cluster, or look for possible savings with slightly cheaper hardware (1G networking might actually be sufficient after all). We'll need the same amount of SSD space if we are going to replicate to codfw.

Why do we need multiple Cassandra instances per hardware node?

GCs don't scale to arbitrary heap sizes / garbage production rates without increasing pause times. This means that the load per Cassandra instance needs to be limited to keep latencies reasonable. Running multiple Cassandra instances lets us lower the load per instance, and can thus let us use the available CPU and IO resources more effectively without hitting the GC latency issues.

GCs don't scale to arbitrary heap sizes / garbage production rates without increasing pause times. This means that the load per Cassandra instance needs to be limited to keep latencies reasonable. Running multiple Cassandra instances lets us lower the load per instance, and can thus let us use the available CPU and IO resources more effectively without hitting the GC latency issues.

Heh, sounds like a hack to bypass a software problem. It is going to complicate configuration on multiple fronts btw, isn't it ?

Heh, sounds like a hack to bypass a software problem.

More precisely, to bypass a Java design feature(TM) :P

It is going to complicate configuration on multiple fronts btw, isn't it ?

Yup. Which is why we need to figure out together with you guys what the best course of action would be.

Re init scripts: We could consider using systemd instances to avoid actually having to create init scripts. This feature lets us spin up one instance per config file.

Edit: Moved this to T95253: Finish conversion to multiple Cassandra instances per hardware node.

What "does not have a lot of margin on IO bandwidth and storage capacity" mean exactly? Which resource is near exhaustion (IOPS, bandwidth or capacity) in your analysis and how long do you guys think we have until it becomes critical?

The most critical bit at this point is capacity, followed by bandwidth and IOPS in roughly equal parts.

Currently the nodes already use around 900G out of 2.5T. With several optimizations implemented last week and deletions of older content the growth rate has slowed now, but it is clear that even with only a single HTML render per edit revision we'll eventually use up all of the current storage space. We would also like to add additional content types and projects, but don't have a lot of spare storage capacity to accommodate this.

Do you have figures on the expected growth rate? In other words how much we can expect the next batch of machines to last and plan accordingly.

Why do we need multiple Cassandra instances per hardware node?

GCs don't scale to arbitrary heap sizes / garbage production rates without increasing pause times. This means that the load per Cassandra instance needs to be limited to keep latencies reasonable. Running multiple Cassandra instances lets us lower the load per instance, and can thus let us use the available CPU and IO resources more effectively without hitting the GC latency issues.

out of curiosity how many do you think it'd make sense?

Do you have figures on the expected growth rate?

This depends on the factors mentioned in the summary. The disk space graphs in http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad are currently broken (T95627), but my guess would be that the current growth for HTML/data-parsoid content is < 10G/day. There are changes like LZMA compression (T93496) that could potentially reduce the rate by 1/2, so exact predictions are difficult.

We also don't have precise data on how much space storing all wikitext revisions will take, as we have never done so before. We can however start with current revisions, and then slowly add older revisions as capacity permits. We know that wikitext is a lot smaller than HTML.

We do have some flexibility to adjust the scope of what we store to fit the space we have. This means that we can avoid running out of space at the cost of pushing back things we'd like to do.

out of curiosity how many (instances) do you think it'd make sense?

We were considering 2-4.

Update: Current growth rates are back around 20G/day per node, primarily because of some recent changes in Parsoid output. With more aggressive render thinning we can probably stabilize things below 10G/day in the longer term, which at the current disk usage would give us at least 100 days of runway before we run out of space.

Update: Current growth rates are back around 20G/day per node, primarily because of some recent changes in Parsoid output. With more aggressive render thinning we can probably stabilize things below 10G/day in the longer term, which at the current disk usage would give us at least 100 days of runway before we run out of space.

sounds good, so ~10G/day/machine and some variance into that I suppose? stored indefinitely? so yeah a batch of 3 machines would last ~180 days

Do you have figures on the expected growth rate?
out of curiosity how many (instances) do you think it'd make sense?

We were considering 2-4.

ok, let's track this in multiple instances per node in T95253

So, assuming servers with single-instance Cassandra nodes, how does the capacity/procurement planning looks like for a) short-term (this FY), b) long-term (next FY, up to next 14 months)? Keep codfw in mind for all that.

So, assuming servers with single-instance Cassandra nodes, how does the capacity/procurement planning looks like for a) short-term (this FY), b) long-term (next FY, up to next 14 months)? Keep codfw in mind for all that.

I assume we have some fairly standard hardware options to choose from, is that information available somewhere? It might be useful to factor in the available hardware options (and corresponding cost) to find the sweet spot in terms of size-of and quantity.

I assume we have some fairly standard hardware options to choose from, is that information available somewhere? It might be useful to factor in the available hardware options (and corresponding cost) to find the sweet spot in terms of size-of and quantity.

For this kind of procurement, we generally build-to-order from Dell or HP. @RobH can share more of how our procurement process works, as well as help with pricing estimates.

Re requirements:

Storage is hard to predict exactly (still quite a few variables that can influence it), but I think it's safe to say at this point that we'll at least need to double the storage capacity in the next 14 months, assuming we continue to store one render per revision, and start adding additional content types and wikitext. I doubt that this would be sufficient to store all wikitext revisions, but that in turn depends heavily on other work like LZMA compression.

Based on our GC / latency data so far, each instance should have no more than 1T of storage (aiming for ~750G working load), along with 8+ cores and 16+G of RAM. If we go with one instance per hardware node, that means that we'd need 2-3x (4x if we added storage to existing nodes) as many hardware nodes as we have right now. As @Eevans says, to determine whether that's cost-effective we have to look at the numbers.

See also the original hardware ticket at T76986.

If we're buying more, can we also figure out our plans (& procurement) for codfw? How would an additional DC fit in our plans? Would we still keep 3 replicas in eqiad (e.g. for I/O throughput reasons) or would we do 2+2? In any case, we should start procuring that soon and we should have the x2 factor in mind for budgetary purposes.

That's going to be governed by consistency requirements, I think. We use local quorum in eqiad right now by default, and that requires a minimum of 3 replicas, (a 2+2 would mean giving up DC-local quorum reads/writes, or resorting to global quorum and introducing cross-datacenter latencies).

Likewise, the minimum number of replicas in codfw will be influenced by the goals of that cluster, and in turn the consistency level it takes to achieve them? What is the purpose of the cluster in codfw? Is it a read-only hot spare? Disaster recovery? Geo-aware active-active?

TL;DR I suspect we need 3 replicas in codfw, too.

I'm also in favor of staying with three replicas for now. We could consider writing with consistency TWO and reading with ONE, but this would likely increase latency for writes and reads (but increase throughput). It also opens up the possibility of inconsistent reads if only a single node was down temporarily & joins the cluster in a slightly outdated state.

What is the purpose of the cluster in codfw? Is it a read-only hot spare? Disaster recovery? Geo-aware active-active?

Eventually, the idea is to read from several replica DCs, while writing to a single primary DC per domain. In the short term, we'd get the ability to quickly switch over to codfw in case eqiad goes down.

@RobH, to evaluate the main options (single instance vs. multi-instance), could you get rough price estimates for these options? For comparison purposes, they pretend that we didn't have any hardware in eqiad. In reality we'll continue to use the existing nodes, especially if we continue the same kind of hardware (variant A).

Variant A: Multi-instance, large HW nodes

18 nodes with the following specs:

  • 64 G RAM
  • 16 cores, similar to current nodes
  • 10G ethernet
  • 3T SSD storage each (ex: 3x Samsung 850 Evo, $400 each), no RAID controller or hot-swap bays necessary

Variant B: Single instance, small HW nodes

54 nodes with the following specs:

  • 16 G RAM
  • 6+ cores
  • 1G ethernet
  • 1T SSD storage each, no RAID controller or hot-swap bays necessary

Additionally, it would be nice to check a mid-way:

Variant C: Multi-instance, medium-sized HW nodes

28 nodes with the following specs:

  • 32-48 G RAM
  • 8-12 cores
  • 1G or 10G ethernet
  • 2T SSD storage each (ex:2x Samsung 850 Evo, $400 each), no RAID controller or hot-swap bays necessary

update from irc: The rt ticket has another variant quote already. I've requested that Gabriel sync up with Filippo on a hardware specification discussion before I keep requesting more variants from our vendor(s).

We reviewed this request during our operations meeting earlier today. Going forward on it, we'd like Filippo, Gabriel, and possibly Faidon (if needed) to review the performance of the current systems, and potential alternative specifications. Once we have a unified discussion, then we can get more quotes generated.

Not trying to block progress, we just want to make sure we don't end up generating 10+ quotes, since each one requires some back and forth to leverage pricing (and can add 24 hours to each request).

edit addtion: some of those 'gabriel's' should have been 'services', since Marko was in ops meeting, not Gabriel, etc...

Based on our GC / latency data so far, each instance should have no more than 1T of storage (aiming for ~750G working load), along with 8+ cores and 16+G of RAM. If we go with one instance per hardware node, that means that we'd need 2-3x (4x if we added storage to existing nodes) as many hardware nodes as we have right now. As @Eevans says, to determine whether that's cost-effective we have to look at the numbers.

So, given that we have 6 Cassandra nodes currently, how did you arrive below at the quantity of 54 single instance nodes with that spec? Assuming 3x as many nodes as currently, times two data centers, we'd need 36 in total, right? What am I missing here?

What am I missing here?

The comparison is for slightly upgraded storage of 27T, equivalent to three additional (9 total) nodes of the current spec.

For comparison purposes, they pretend that we didn't have any hardware in eqiad. In reality we'll continue to use the existing nodes, especially if we continue to use the same kind of hardware (variant A).

Some sample systems from Dell, based on the storage need projections for next fiscal in T97692:

  • 8x 2.5" disk 1U hot-swap chassis
  • 12 cores
  • 64G RAM
  • 10G ethernet
  • no disks
  • $3900 + $1200 SSDs (3x Samsung 850 Evo)
    • 36T / 12 box upgrade for eqiad: $61200
    • 54T / 18 boxes for codfw: $91800

  • 8x 2.5" disk chassis, 1U
  • 8 cores
  • 32G RAM
  • 1G ethernet
  • no disks
  • $2500 + $800 SSDs (2x Samsung 850 Evo)
    • 36T / 18 box upgrade eqiad: $59400
    • 54T / 27 box cluster codfw: $89100

The latter system *might* work for a single-instance setup. We'd need to test max latencies / timeouts at load.

ok as per related T97692 we're aiming at 54T per cluster, thus +36T in eqiad and +54T in codfw or 90T total, split among 30 new machines for option 1 or 45 for option 2 above (which correspond roughly to the quotes we already have)

@GWicke re: option 2 above how much hw would we need to realistically test that?

re: option 2 above how much hw would we need to realistically test that?

@fgiunchedi, we are already getting close to that with the existing boxes. The largest one currently has a load of close to 1.1T, and realistically we won't run those nodes at more than 1.7T or so when there is 2T of storage total. The latency is still bearable in normal operation and with the currently fairly low request load. There are only a few dozen request timeouts per day. We also just tweaked the GC settings, which might help improve max latencies slightly more.

If we wanted an even more realistic scenario we could add a box with similar RAM / CPU specs and two SSDs to the existing cluster.

For the overall costs it would be interesting to know an order-of-magnitude number for housing 1U for a couple of years.

I'll snag this and get the updated quotes put onto the RT ticket later today.

RobH raised the priority of this task from Medium to High.May 4 2015, 6:48 PM

Just discussed this in the ops meeting. The plan is to:

  1. order three of the smaller instances now (~r320 from https://phabricator.wikimedia.org/T93790#1251095), and see how single C* instances do on them, and later
  2. follow up with the bigger expansion using the data gained on those three nodes.

The order for these has been placed, and is tracked via https://rt.wikimedia.org/Ticket/Display.html?id=9337

I'll resolve this once the order has arrived and/or the new task for installation(s) and setup are linked.

The SSDs have an ETA of 2015-05-27.
The servers have an ETA of 2015-06-17.

RobH lowered the priority of this task from High to Medium.May 29 2015, 10:05 PM

The restbase servers and ssds arrived on-site.

RobH updated the task description. (Show Details)
GWicke moved this task from Backlog to In progress on the RESTBase board.

I'm resolving the hardware request for order, as we are now working on implementation.

We have consolidated a lot of information from this task and others in https://wikitech.wikimedia.org/wiki/Cassandra/Hardware.

GWicke updated the task description. (Show Details)

This has been reopened, the wikitech page has details for the current specifications (we had a meeting with services earlier today). the result is a new round of quote generation for review (not final ordering quite yet).

The new quotes are on https://rt.wikimedia.org/Ticket/Display.html?id=9506

Looking at the current quotes in the spreadsheet, I think it seems best to move forward with quote 712030866 for 3 instances. We could order with an extra drive carrier (712030866), allowing easy upgrading to more memory and an additional SSD if we need to, later.

As discussed on the mail thread, my proposal is to go with 6 nodes (15 by end-of-year) with 8 cores, 96G RAM and 4 Samsung SSDs each. This variant gives us the best cost / performance ratio, and should allow us to reach our annual goals within the budget.

I updated the capacity estimation based on the latest information in T97692#1538147, and also updated the spreadsheet to match. The good news is that our optimizations (see T93751) managed to reduce our storage needs quite significantly. Without full revision storage, we might be able to get by with only nine boxes (2x8 core, 96G RAM) in codfw.

RobH changed the task status from Open to Stalled.Aug 26 2015, 11:54 PM

The approved config on https://rt.wikimedia.org/Ticket/Display.html?id=9506 has been submitted to the leasing vendor for order.

This order is estimated to ship out on 9/9 with a delivery date of 9/15.

https://rt.wikimedia.org/Ticket/Display.html?id=9506 has been resolved and this hardware has been racked and is now being setup via T112683