Expand RESTBase cluster capacity
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Mar 24 2015, 7:08 PM

Description

Looking at the first production data it looks like our current RESTBase Cassandra cluster does not have a lot of margin on IO bandwidth and storage capacity. With more tuning we might be able to stretch it a bit more, but it is clear that we'll need more capacity sooner rather than later.

We are leaning towards moving to a setup with multiple Cassandra instances per hardware node, aiming for a maximum compressed load of ~600G per instance. With 1T SSDs this could be configured as:

one instance per physical SSD, JBOD
all instances sharing a RAID-0 (current setup)
some other RAID level

JBOD would give us a good amount of failure isolation, but also does not allow us to share the aggregate IO bandwidth between instances. RAID levels other than RAID-0 lose disk capacity, and are not strictly needed as the data is already replicated 3-way across the cluster.

For the current cluster, the main decision we have to make is whether we want to add one SSD to the existing nodes or not. (There is one slot left.)

See T97692 for the storage need projection for FY2015/16. Bottom line is: ~35T of additional storage in the next fiscal year, for a total of 53T per cluster.

From a replica placement perspective it makes sense to add nodes in increments of three, one per each of three rows.

RT tracking of the current order: https://rt.wikimedia.org/Ticket/Display.html?id=9506

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T93751 RFC: Next steps for long-term revision storage -- space needs, storage hierarchies
Resolved	RobH	T93790 Expand RESTBase cluster capacity
Resolved	fgiunchedi	T95253 Finish conversion to multiple Cassandra instances per hardware node
Resolved	fgiunchedi	T113733 column family cassandra metrics size
Resolved	fgiunchedi	T113939 assess impact of many cassandra seed nodes with multi instance
Resolved	Eevans	T117114 Ensure ansible-deploy can cope with multi-instance restarts
Resolved	Eevans	T121535 Perform cleanups to reclaim space from recent topology changes
Resolved	Eevans	T130540 Figure out if nodes in different DCs can be bootstrapped in parallel
Resolved	• GWicke	T97692 RESTBase capacity planning for 2015/16
Resolved	• GWicke	T97710 Estimate storage capacity needed for storing all HTML revisions
Resolved	• Cmjohnson	T101112 Setup/rack new restbase servers
Resolved	fgiunchedi	T102015 put new restbase servers in service
Resolved	• Cmjohnson	T102557 investigate new restbase machine disks timeouts
Resolved	fgiunchedi	T104208 alternative Cassandra metrics reporting

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• GWicke added a parent task: T93751: RFC: Next steps for long-term revision storage -- space needs, storage hierarchies.Mar 24 2015, 7:09 PM

• GWicke added projects: RESTBase, acl*sre-team.

• GWicke set Security to None.

• GWicke edited subscribers, added: RobH, • mobrovac, Eevans; removed: Aklapper.

• GWicke triaged this task as Medium priority.Mar 24 2015, 7:16 PM

• GWicke updated the task description. (Show Details)

• GWicke updated the task description. (Show Details)Mar 24 2015, 7:21 PM

• GWicke added a subscriber: mark.Mar 24 2015, 7:35 PM

• GWicke mentioned this in T93140: Cassandra compaction is getting behind.Mar 24 2015, 8:00 PM

mark added a project: hardware-requests.Mar 25 2015, 12:07 PM

What "does not have a lot of margin on IO bandwidth and storage capacity" mean exactly? Which resource is near exhaustion (IOPS, bandwidth or capacity) in your analysis and how long do you guys think we have until it becomes critical?

If we're buying more, can we also figure out our plans (& procurement) for codfw? How would an additional DC fit in our plans? Would we still keep 3 replicas in eqiad (e.g. for I/O throughput reasons) or would we do 2+2? In any case, we should start procuring that soon and we should have the x2 factor in mind for budgetary purposes.

Why do we need multiple Cassandra instances per hardware node?

In T93790#1148439, @faidon wrote:

What "does not have a lot of margin on IO bandwidth and storage capacity" mean exactly? Which resource is near exhaustion (IOPS, bandwidth or capacity) in your analysis and how long do you guys think we have until it becomes critical?

The most critical bit at this point is capacity, followed by bandwidth and IOPS in roughly equal parts.

Currently the nodes already use around 900G out of 2.5T. With several optimizations implemented last week and deletions of older content the growth rate has slowed now, but it is clear that even with only a single HTML render per edit revision we'll eventually use up all of the current storage space. We would also like to add additional content types and projects, but don't have a lot of spare storage capacity to accommodate this.

Cassandra write workloads are basically purely bandwidth-bound. We managed to lower write-driven iowait to (typically) below 1% by enabling the trickle_fsync feature which forces constant flushing.

We do see high IOPS rates during random reads and repairs, along with an increase in iowait and read latency while doing so. While not maxing out all the time, it looks likely that some of the latency increase we are seeing is caused by limited IOPS and bandwidth. We have stopped running periodic anti-entropy repairs for now to avoid the latency increases. It would be nice to have enough headroom to be able to run them again.

If we're buying more, can we also figure out our plans (& procurement) for codfw? How would an additional DC fit in our plans? Would we still keep 3 replicas in eqiad (e.g. for I/O throughput reasons) or would we do 2+2? In any case, we should start procuring that soon and we should have the x2 factor in mind for budgetary purposes.

We currently have six nodes in eqiad, and would like to add three for a total of nine. For codfw we could either set up an identical cluster, or look for possible savings with slightly cheaper hardware (1G networking might actually be sufficient after all). We'll need the same amount of SSD space if we are going to replicate to codfw.

Why do we need multiple Cassandra instances per hardware node?

GCs don't scale to arbitrary heap sizes / garbage production rates without increasing pause times. This means that the load per Cassandra instance needs to be limited to keep latencies reasonable. Running multiple Cassandra instances lets us lower the load per instance, and can thus let us use the available CPU and IO resources more effectively without hitting the GC latency issues.

GCs don't scale to arbitrary heap sizes / garbage production rates without increasing pause times. This means that the load per Cassandra instance needs to be limited to keep latencies reasonable. Running multiple Cassandra instances lets us lower the load per instance, and can thus let us use the available CPU and IO resources more effectively without hitting the GC latency issues.

Heh, sounds like a hack to bypass a software problem. It is going to complicate configuration on multiple fronts btw, isn't it ?

In T93790#1163710, @akosiaris wrote:

Heh, sounds like a hack to bypass a software problem.

More precisely, to bypass a Java design feature(TM) :P

It is going to complicate configuration on multiple fronts btw, isn't it ?

Yup. Which is why we need to figure out together with you guys what the best course of action would be.

Re init scripts: We could consider using systemd instances to avoid actually having to create init scripts. This feature lets us spin up one instance per config file.

Edit: Moved this to T95253: Finish conversion to multiple Cassandra instances per hardware node.

• GWicke mentioned this in T95253: Finish conversion to multiple Cassandra instances per hardware node.Apr 7 2015, 1:11 AM

Liuxinyu970226 subscribed.Apr 7 2015, 5:11 AM

RobH updated the task description. (Show Details)Apr 9 2015, 5:31 PM

• GWicke updated the task description. (Show Details)Apr 9 2015, 8:02 PM

RobH moved this task from Backlog to In Discussion / Review on the hardware-requests board.Apr 10 2015, 3:50 PM

In T93790#1163395, @GWicke wrote:

In T93790#1148439, @faidon wrote:

What "does not have a lot of margin on IO bandwidth and storage capacity" mean exactly? Which resource is near exhaustion (IOPS, bandwidth or capacity) in your analysis and how long do you guys think we have until it becomes critical?

The most critical bit at this point is capacity, followed by bandwidth and IOPS in roughly equal parts.

Currently the nodes already use around 900G out of 2.5T. With several optimizations implemented last week and deletions of older content the growth rate has slowed now, but it is clear that even with only a single HTML render per edit revision we'll eventually use up all of the current storage space. We would also like to add additional content types and projects, but don't have a lot of spare storage capacity to accommodate this.

Do you have figures on the expected growth rate? In other words how much we can expect the next batch of machines to last and plan accordingly.

Why do we need multiple Cassandra instances per hardware node?

GCs don't scale to arbitrary heap sizes / garbage production rates without increasing pause times. This means that the load per Cassandra instance needs to be limited to keep latencies reasonable. Running multiple Cassandra instances lets us lower the load per instance, and can thus let us use the available CPU and IO resources more effectively without hitting the GC latency issues.

out of curiosity how many do you think it'd make sense?

Do you have figures on the expected growth rate?

This depends on the factors mentioned in the summary. The disk space graphs in http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad are currently broken (T95627), but my guess would be that the current growth for HTML/data-parsoid content is < 10G/day. There are changes like LZMA compression (T93496) that could potentially reduce the rate by 1/2, so exact predictions are difficult.

We also don't have precise data on how much space storing all wikitext revisions will take, as we have never done so before. We can however start with current revisions, and then slowly add older revisions as capacity permits. We know that wikitext is a lot smaller than HTML.

We do have some flexibility to adjust the scope of what we store to fit the space we have. This means that we can avoid running out of space at the cost of pushing back things we'd like to do.

out of curiosity how many (instances) do you think it'd make sense?

We were considering 2-4.

Update: Current growth rates are back around 20G/day per node, primarily because of some recent changes in Parsoid output. With more aggressive render thinning we can probably stabilize things below 10G/day in the longer term, which at the current disk usage would give us at least 100 days of runway before we run out of space.

In T93790#1217256, @GWicke wrote:

Update: Current growth rates are back around 20G/day per node, primarily because of some recent changes in Parsoid output. With more aggressive render thinning we can probably stabilize things below 10G/day in the longer term, which at the current disk usage would give us at least 100 days of runway before we run out of space.

sounds good, so ~10G/day/machine and some variance into that I suppose? stored indefinitely? so yeah a batch of 3 machines would last ~180 days

In T93790#1198379, @GWicke wrote:

Do you have figures on the expected growth rate?
out of curiosity how many (instances) do you think it'd make sense?

We were considering 2-4.

ok, let's track this in multiple instances per node in T95253

So, assuming servers with single-instance Cassandra nodes, how does the capacity/procurement planning looks like for a) short-term (this FY), b) long-term (next FY, up to next 14 months)? Keep codfw in mind for all that.

In T93790#1231017, @faidon wrote:

So, assuming servers with single-instance Cassandra nodes, how does the capacity/procurement planning looks like for a) short-term (this FY), b) long-term (next FY, up to next 14 months)? Keep codfw in mind for all that.

I assume we have some fairly standard hardware options to choose from, is that information available somewhere? It might be useful to factor in the available hardware options (and corresponding cost) to find the sweet spot in terms of size-of and quantity.

In T93790#1231358, @Eevans wrote:

I assume we have some fairly standard hardware options to choose from, is that information available somewhere? It might be useful to factor in the available hardware options (and corresponding cost) to find the sweet spot in terms of size-of and quantity.

For this kind of procurement, we generally build-to-order from Dell or HP. @RobH can share more of how our procurement process works, as well as help with pricing estimates.

Re requirements:

Storage is hard to predict exactly (still quite a few variables that can influence it), but I think it's safe to say at this point that we'll at least need to double the storage capacity in the next 14 months, assuming we continue to store one render per revision, and start adding additional content types and wikitext. I doubt that this would be sufficient to store all wikitext revisions, but that in turn depends heavily on other work like LZMA compression.

Based on our GC / latency data so far, each instance should have no more than 1T of storage (aiming for ~750G working load), along with 8+ cores and 16+G of RAM. If we go with one instance per hardware node, that means that we'd need 2-3x (4x if we added storage to existing nodes) as many hardware nodes as we have right now. As @Eevans says, to determine whether that's cost-effective we have to look at the numbers.

See also the original hardware ticket at T76986.

In T93790#1148439, @faidon wrote:

If we're buying more, can we also figure out our plans (& procurement) for codfw? How would an additional DC fit in our plans? Would we still keep 3 replicas in eqiad (e.g. for I/O throughput reasons) or would we do 2+2? In any case, we should start procuring that soon and we should have the x2 factor in mind for budgetary purposes.

That's going to be governed by consistency requirements, I think. We use local quorum in eqiad right now by default, and that requires a minimum of 3 replicas, (a 2+2 would mean giving up DC-local quorum reads/writes, or resorting to global quorum and introducing cross-datacenter latencies).

Likewise, the minimum number of replicas in codfw will be influenced by the goals of that cluster, and in turn the consistency level it takes to achieve them? What is the purpose of the cluster in codfw? Is it a read-only hot spare? Disaster recovery? Geo-aware active-active?

TL;DR I suspect we need 3 replicas in codfw, too.

I'm also in favor of staying with three replicas for now. We could consider writing with consistency TWO and reading with ONE, but this would likely increase latency for writes and reads (but increase throughput). It also opens up the possibility of inconsistent reads if only a single node was down temporarily & joins the cluster in a slightly outdated state.

What is the purpose of the cluster in codfw? Is it a read-only hot spare? Disaster recovery? Geo-aware active-active?

Eventually, the idea is to read from several replica DCs, while writing to a single primary DC per domain. In the short term, we'd get the ability to quickly switch over to codfw in case eqiad goes down.

@RobH, to evaluate the main options (single instance vs. multi-instance), could you get rough price estimates for these options? For comparison purposes, they pretend that we didn't have any hardware in eqiad. In reality we'll continue to use the existing nodes, especially if we continue the same kind of hardware (variant A).

Variant A: Multi-instance, large HW nodes

18 nodes with the following specs:

64 G RAM
16 cores, similar to current nodes
10G ethernet
3T SSD storage each (ex: 3x Samsung 850 Evo, $400 each), no RAID controller or hot-swap bays necessary

Variant B: Single instance, small HW nodes

54 nodes with the following specs:

16 G RAM
6+ cores
1G ethernet
1T SSD storage each, no RAID controller or hot-swap bays necessary

Additionally, it would be nice to check a mid-way:

Variant C: Multi-instance, medium-sized HW nodes

28 nodes with the following specs:

32-48 G RAM
8-12 cores
1G or 10G ethernet
2T SSD storage each (ex:2x Samsung 850 Evo, $400 each), no RAID controller or hot-swap bays necessary

update from irc: The rt ticket has another variant quote already. I've requested that Gabriel sync up with Filippo on a hardware specification discussion before I keep requesting more variants from our vendor(s).

We reviewed this request during our operations meeting earlier today. Going forward on it, we'd like Filippo, Gabriel, and possibly Faidon (if needed) to review the performance of the current systems, and potential alternative specifications. Once we have a unified discussion, then we can get more quotes generated.

Not trying to block progress, we just want to make sure we don't end up generating 10+ quotes, since each one requires some back and forth to leverage pricing (and can add 24 hours to each request).

edit addtion: some of those 'gabriel's' should have been 'services', since Marko was in ops meeting, not Gabriel, etc...

In T93790#1231396, @GWicke wrote:

Based on our GC / latency data so far, each instance should have no more than 1T of storage (aiming for ~750G working load), along with 8+ cores and 16+G of RAM. If we go with one instance per hardware node, that means that we'd need 2-3x (4x if we added storage to existing nodes) as many hardware nodes as we have right now. As @Eevans says, to determine whether that's cost-effective we have to look at the numbers.

So, given that we have 6 Cassandra nodes currently, how did you arrive below at the quantity of 54 single instance nodes with that spec? Assuming 3x as many nodes as currently, times two data centers, we'd need 36 in total, right? What am I missing here?

What am I missing here?

The comparison is for slightly upgraded storage of 27T, equivalent to three additional (9 total) nodes of the current spec.

For comparison purposes, they pretend that we didn't have any hardware in eqiad. In reality we'll continue to use the existing nodes, especially if we continue to use the same kind of hardware (variant A).

• GWicke updated the task description. (Show Details)Apr 30 2015, 7:22 PM

• GWicke updated the task description. (Show Details)Apr 30 2015, 11:11 PM

Some sample systems from Dell, based on the storage need projections for next fiscal in T97692:

r430.pdf148 KBDownload

8x 2.5" disk 1U hot-swap chassis
12 cores
64G RAM
10G ethernet
no disks
$3900 + $1200 SSDs (3x Samsung 850 Evo)
- 36T / 12 box upgrade for eqiad: $61200
- 54T / 18 boxes for codfw: $91800

r320.pdf149 KBDownload

8x 2.5" disk chassis, 1U
8 cores
32G RAM
1G ethernet
no disks
$2500 + $800 SSDs (2x Samsung 850 Evo)
- 36T / 18 box upgrade eqiad: $59400
- 54T / 27 box cluster codfw: $89100

The latter system *might* work for a single-instance setup. We'd need to test max latencies / timeouts at load.

ok as per related T97692 we're aiming at 54T per cluster, thus +36T in eqiad and +54T in codfw or 90T total, split among 30 new machines for option 1 or 45 for option 2 above (which correspond roughly to the quotes we already have)

@GWicke re: option 2 above how much hw would we need to realistically test that?

re: option 2 above how much hw would we need to realistically test that?

@fgiunchedi, we are already getting close to that with the existing boxes. The largest one currently has a load of close to 1.1T, and realistically we won't run those nodes at more than 1.7T or so when there is 2T of storage total. The latency is still bearable in normal operation and with the currently fairly low request load. There are only a few dozen request timeouts per day. We also just tweaked the GC settings, which might help improve max latencies slightly more.

If we wanted an even more realistic scenario we could add a box with similar RAM / CPU specs and two SSDs to the existing cluster.

For the overall costs it would be interesting to know an order-of-magnitude number for housing 1U for a couple of years.

• GWicke mentioned this in T97322: Document what is left for having a full cluster installation in codfw.May 4 2015, 5:57 PM

I'll snag this and get the updated quotes put onto the RT ticket later today.

RobH raised the priority of this task from Medium to High.May 4 2015, 6:48 PM

Just discussed this in the ops meeting. The plan is to:

order three of the smaller instances now (~r320 from https://phabricator.wikimedia.org/T93790#1251095), and see how single C* instances do on them, and later
follow up with the bigger expansion using the data gained on those three nodes.

RT ticket for Dell quote: https://rt.wikimedia.org/Ticket/Display.html?id=9337

RobH moved this task from In Discussion / Review to In Review on the hardware-requests board.May 5 2015, 11:11 PM

RT tikcet for HP quote: https://rt.wikimedia.org/Ticket/Display.html?id=9349

The order for these has been placed, and is tracked via https://rt.wikimedia.org/Ticket/Display.html?id=9337

I'll resolve this once the order has arrived and/or the new task for installation(s) and setup are linked.

The SSDs have an ETA of 2015-05-27.
The servers have an ETA of 2015-06-17.

RobH lowered the priority of this task from High to Medium.May 29 2015, 10:05 PM

The restbase servers and ssds arrived on-site.

• Cmjohnson added a subtask: T101112: Setup/rack new restbase servers.Jun 2 2015, 3:46 PM

RobH closed this task as Resolved.Jun 3 2015, 6:26 PM

RobH updated the task description. (Show Details)

Liuxinyu970226 unsubscribed.Jun 3 2015, 11:45 PM

fgiunchedi closed subtask T101112: Setup/rack new restbase servers as Resolved.Jun 10 2015, 4:36 PM

fgiunchedi reopened subtask T101112: Setup/rack new restbase servers as Open.Jun 11 2015, 8:28 PM

• Cmjohnson closed subtask T101112: Setup/rack new restbase servers as Resolved.Jun 12 2015, 12:58 PM

• GWicke reopened this task as Open.Jun 27 2015, 12:39 AM

• GWicke mentioned this in T102306: Services team roadmap July - September 2015 (Q1 2015/16).

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 27 2015, 12:40 AM

• GWicke reassigned this task from RobH to Eevans.Jun 29 2015, 5:29 PM

• GWicke moved this task from Backlog to In progress on the RESTBase board.

I'm resolving the hardware request for order, as we are now working on implementation.

We have consolidated a lot of information from this task and others in https://wikitech.wikimedia.org/wiki/Cassandra/Hardware.

RobH reopened this task as Open.Jul 24 2015, 5:15 PM

• GWicke reopened this task as Open.Jul 24 2015, 5:15 PM

• GWicke updated the task description. (Show Details)

This has been reopened, the wikitech page has details for the current specifications (we had a meeting with services earlier today). the result is a new round of quote generation for review (not final ordering quite yet).

The new quotes are on https://rt.wikimedia.org/Ticket/Display.html?id=9506

RobH claimed this task.Jul 24 2015, 5:17 PM

• GWicke updated the task description. (Show Details)Jul 27 2015, 6:45 PM

fgiunchedi closed subtask T102015: put new restbase servers in service as Resolved.Aug 5 2015, 10:58 AM

• GWicke mentioned this in T108613: Set up multi-DC replication for Cassandra.Aug 10 2015, 6:41 PM

Looking at the current quotes in the spreadsheet, I think it seems best to move forward with quote 712030866 for 3 instances. We could order with an extra drive carrier (712030866), allowing easy upgrading to more memory and an additional SSD if we need to, later.

As discussed on the mail thread, my proposal is to go with 6 nodes (15 by end-of-year) with 8 cores, 96G RAM and 4 Samsung SSDs each. This variant gives us the best cost / performance ratio, and should allow us to reach our annual goals within the budget.

I updated the capacity estimation based on the latest information in T97692#1538147, and also updated the spreadsheet to match. The good news is that our optimizations (see T93751) managed to reduce our storage needs quite significantly. Without full revision storage, we might be able to get by with only nine boxes (2x8 core, 96G RAM) in codfw.

The approved config on https://rt.wikimedia.org/Ticket/Display.html?id=9506 has been submitted to the leasing vendor for order.

This order is estimated to ship out on 9/9 with a delivery date of 9/15.

https://rt.wikimedia.org/Ticket/Display.html?id=9506 has been resolved and this hardware has been racked and is now being setup via T112683

fgiunchedi closed subtask T95253: Finish conversion to multiple Cassandra instances per hardware node as Resolved.May 26 2016, 10:53 AM

• GWicke closed subtask T97692: RESTBase capacity planning for 2015/16 as Resolved.Oct 4 2016, 6:51 PM

• GWicke mentioned this in T152724: Current state and next steps for RESTBase storage.Dec 8 2016, 8:53 PM

	F159106: r320.pdf
	May 1 2015, 12:06 AM

	F159098: r430.pdf
	Apr 30 2015, 11:51 PM

Expand RESTBase cluster capacityClosed, ResolvedPublicActions