Page MenuHomePhabricator

9x or 15x additional Cassandra/RESTBase nodes
Closed, ResolvedPublic

Assigned To
Authored By
Eevans
Jul 11 2016, 4:50 PM
Referenced Files
F4443482: pasted_file
Sep 8 2016, 3:01 PM
F4443475: pasted_file
Sep 8 2016, 3:01 PM
F4443479: pasted_file
Sep 8 2016, 3:01 PM
F4441005: Screenshot from 2016-09-07 15-59-51.png
Sep 7 2016, 9:21 PM
F4441007: Screenshot from 2016-09-07 16-03-39.png
Sep 7 2016, 9:21 PM
F4441006: Screenshot from 2016-09-07 15-57-27.png
Sep 7 2016, 9:21 PM
F4441008: Screenshot from 2016-09-07 15-59-00.png
Sep 7 2016, 9:21 PM
F4327256: Screenshot from 2016-08-01 14-53-05.png
Aug 1 2016, 7:59 PM

Description

To accommodate the planned cluster expansions for the coming fiscal year, we will need 6 additional machines (specs matching that of existing production hardware).

  • 6x or 12x 16-way (w/ hyperthreading, 8 cores?) HPs, 128 GB RAM, and 5TB of SSD storage.
  • 3x the same, for the staging environment.

The last order of these was on T130218.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

@RobH could you prepare quotes for this? Thanks!

RobH mentioned this in Unknown Object (Task).Jul 21 2016, 5:05 PM
RobH added a subtask: Unknown Object (Task).

In discussing this with @GWicke the question of quantity came up again. Since Cassandra is deployed across two data-centers with three racks each, the cluster needs to be upgraded in multiples of six (6, 12, 18, etc). If we have the budget for it, then expanding by 12 new machines now would save us a lot of repeated effort down the road.

So in the last order, we didn't need to order SSDs for all the systems in codfw, since rebalancing freed up enough existing samsung SSDs for 2 of the 3 new hosts. The third host received Intel S3610 SSDs, which are the operations standard for SSDs in production.

Has there been any followup testing on the differences between these SSDs? Please advise.

@RobH: We don't have very conclusive data, as those nodes aren't seeing any read traffic. iowait from writes is lower on the intel ssds, but read latency looks about the same (based on compaction reads only).

To avoid blocking on this discussion, I would recommend to get quotes for both options.

Unfortunately, the Intel disks ended up in restbase2009.codfw.wmnet, where we don't typically see much traffic. However, I did some bootstraps in the containing rack in the last days, and so do have some data.

Rack d consists of restbase2005, restbase2006, and restbase2009. As of July 27, each had two instances. On July 27 @ ~19:00 I began an unthrottled bootstrap of 2005-c. On July 28 @ ~20:30 I started a bootstrap of 2006-c, and on July 29 @ ~18:30 I began the final bootstrap of 2009-c.

The differences in iowait values are quite pronounced:

Screenshot from 2016-08-01 14-53-05.png (758×1 px, 105 KB)

As is throughput:

Screenshot from 2016-08-01 14-57-09.png (753×1 px, 149 KB)

GWicke renamed this task from 6x additional Cassandra/RESTBase nodes to 9x or 15x additional Cassandra/RESTBase nodes.Aug 1 2016, 8:07 PM
GWicke updated the task description. (Show Details)

Ok, this just shifted the quantity from 6 to 12 (previously mentioned) along with another 3 in staging.

I want to clarify, this is for 6 (or 12) in each site (codfw and eqiad) and then an additional 3 staging nodes where?

Please advise,

@RobH: The current staging nodes are in eqiad, and so far the assumption has been that we would replace the old existing hardware.

Ok, so 6 or 12 nodes (depending on final pricing) for codfw and eqiad each, plus an additional 3 nodes in eqiad.

Thanks!

Ok, so 6 or 12 nodes (depending on final pricing) for codfw and eqiad each, plus an additional 3 nodes in eqiad.

No, sorry, that would be 6 or 12 hosts total (over both DCs) + 3 in eqiad for staging.

So only half of each to each DC, plus the staging in eqiad.

Yes, cluster expansions have to be in multiples of 6 (we have to add them to each rack, and there are 3 racks in each DC). So the minimum is 6, and the next increment is 12. And then yes, we're interested in replacing staging (in eqiad) with 3 new nodes, and thought it might be easier for everyone to combined the orders (providing it's not cost-prohibitive).

To follow up on the DC question: The staging nodes could also be placed in codfw. The existing staging nodes in either DC won't have enough storage, so we'll likely need to drop one DC anyway.

During last weeks Ops-Services sync meeting, there was some mention of (soon-to-be )decommissioned Varnish machines in esams that we might be able to repurpose as staging nodes. Could someone tell me the specs of these machines?

From IRC:

14:07 < urandom> robh: apparently there are some varnish servers that are coming down, or may have already come down, in esams.
                 They're old, and out of warranty, but it has been suggested that we might use them for restbase staging instead
                 of purchasing new machines.  I was curious if you knew anything about them.
14:08 <@robh> I don't sorry.  I recall you asking that on a task but I didnt know the answer
14:08 <@robh> I imagine either bblack or mark would know about esams decoms.

It seems this was @mark's idea to use these systems, so I'm assigning this request to him to field the question about the use of esams varnish decoms for staging.

Please feel free to assign back to me after input!

To summarize a conversation with @mark on IRC, there are 10 systems total, at least 3 of which satisfy the following:

  • 2 ea. Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz (6 cores ea., 12 total)
  • 96G RAM
  • 2x 300G SSDs
  • Dell H710 RAID controller (JBOD is not an option with this controller)

It sounds as if there might be some SSDs in the remaining 7 systems which could be scavenged to put into the 3, but there would be no more than 2x 300G per machine worth (possibly less).

Note: As these machines are being hosted in esams, it will be a requirement that they do not host any publicly accessible services.

Note: We cannot expect the same level of data-center support for these nodes, that we do for eqiad and codfw.

Small correction: JBOD is not an option on this controller. :) You have to make RAID arrays, although single-disk RAID0 can work.

Mentioned in SAL [2016-09-07T19:01:23Z] <urandom> T139961: Starting RESTBase htmldumper processes in codfw (read testing)

Mentioned in SAL [2016-09-07T19:40:53Z] <urandom> T139961: Actually starting RESTBase htmldumper processes in codfw (read testing)

Mentioned in SAL [2016-09-07T21:04:44Z] <urandom> T139961: Stopping RESTBase htmldumper in codfw

I ran some dumps from codfw in order to generate read traffic, and determine if the Intel SSDs out-perform the Samsung in reads as well as writes. A total of 9 dumps were run, not enough to seriously stress the cluster, but a request rate comparable to eqiad was achieved.

Screenshot from 2016-09-07 15-59-51.png (709×1 px, 87 KB)

Screenshot from 2016-09-07 15-57-27.png (699×1 px, 91 KB)

Screenshot from 2016-09-07 16-03-39.png (684×1 px, 134 KB)

This is the only interesting plot IMO. There wasn't enough stress put on the cluster to really drive up read latency (on any of the hosts), but the effect on iowait does seem definitive.

Screenshot from 2016-09-07 15-59-00.png (693×1 px, 210 KB)

Here are some graphs of read latency during the test run:

p50:

pasted_file (1×1 px, 344 KB)

p95:

pasted_file (1×1 px, 686 KB)

p99:

pasted_file (1×1 px, 607 KB)

There is no noticeable difference in median read latencies.

We do see some differences in tail latencies. On average, p95 is 10ms lower for the intels, and p99 is ~50ms lower.

From a user perspective, the main benefit of the intels seems to be in a reduction in tail latencies during writes.

The importance of this depends on the write volume, as well as the severity of the latency spikes. Write volume per disk is projected to drop significantly from cluster expansion and compression improvements.

One factor we have ignored a bit in this discussion is the longer term plan to separate current revisions from archival storage (see T120171). Current revision storage handles the vast majority of accesses while using a small part of overall space, and thus benefits especially from performance improvements like better hardware & geo-distributed replication. Older revisions use up a lot of space, but have fewer accesses & lower latency requirements.

While we could still consider using rotating disks for archival storage, relative price development along with factors like performance and reliability are making low/mid-range SSDs like the Samsungs a more and more attractive option for that use case. Current revision storage on the other hand could make good use of performance improvements from faster SSDs, and possibly geo-replication.

With this in mind, I think it makes a lot of sense to stay with the Samsungs for the current cluster, and then consider faster SSDs like the Intels for current revision storage later.

In case the AMS procurement hasn't happened yet, it might make sense to also consider the old AQS nodes (see T147460) in eqiad for use as staging hosts. Specs are:

  • 48G RAM
  • 12 cores (24 threads)

These currently have rotating disks, so would need SSDs as well. There seem to be quite a few disk slots in these boxes, although likely 3.5".

During Thursday's (2016-10-13) ops-services-syncup meeting, a final decision to use Intel SSDs was made. TTBMK, there are no further blockers to proceeding here.

This task was actually filled months ago, and I neglected to clean up and resolve this (actual service implementation was done via other tasks.)

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 7:55 PM