Page MenuHomePhabricator

Staging / Test environment(s) for RESTBase
Closed, ResolvedPublic

Description

The staging environment for RESTBase is currently a special case in that it is hosted on real hardware in the production network. This was done because it is often the case that changes being staged require access to resources comparable to that of production, a requirement that cannot be (easily) met in labs. Unfortunately, this environment is still too far from production specs to be useful in many cases.

For example:

  • Available per-instance storage isn't enough, even though we only load data for a single wiki. We typically do not have enough working space to (for example), perform operations that rewrite SSTables (scrub, upgradesstables, etc).
  • The amount of memory is considerably lower than what is available in production, requiring us to use heap sizes and GC settings that do not even approximate what is used in production.
  • The specs are quite different across the cluster, preventing the use of a homogeneous configuration (for example: eqiad machines are only able to host a single instance, codfw 2 instances).

When the staging environment comes up short, we do our best, then move the remainder of testing to production (and I fear that this has become so routine at this point, that we don't even stop to acknowledge that is what we are doing).

In addition to the Staging use-case, is the more development focused, ad hoc testing. The requirements here are for all intents and purposes the same, but with a lack of better options, we typically perform this sort of testing in the staging environment. This isn't awesome either; Ideally we'd maintain the staging environment identically to production, with code and configuration that we expect to move to production RSN (i.e. changes staged for production).

Ask

Staging

6 dedicated hosts (always-on), that can be deployed in a multi-datacenter configuration (in a perfect world, 3 in one data-center, 3 in another, though this could be simulated easily enough), of roughly half the specs we're using in production (8-core, 48-64G RAM, 2T SSDs).

This could be older, disparate hardware no longer under warranty.

Rationale

We need to emulate the multi-datacenter configuration used in production, including the ability to apply DC-local quorums, which mandates a minimum of 6 hosts. Back-of-napkin, if the current production hosts have the resources to host 4 instances (which seems likely at this point), then machines with half the specs should be able to host 2 instances (the minimum needed to properly test multi-instances).

Testing

Generally, machine requirements here match that of Staging above with some exceptions. For example, depending on the nature of the test, it might be the case that a full 6 machines aren't needed. Most tests of performance could be achieved with 3 machines, (6 would largely be for tests of a functional nature).

Additionally, ephemeral configurations would be acceptable, and in some ways even desirable. Unlike the Staging use-case meant to emulate production, the need for a test environment exists only for the duration of the testing, and ideally, it'd be easy to re-baseline the environment between tests. Obviously, all of this is possible with a dedicated, static environment, but may also open the door to other solutions as well.

Event Timeline

faidon triaged this task as Medium priority.May 26 2016, 5:41 PM
faidon added a project: procurement.
faidon added a subscriber: mark.

We don't have spare or older hardware close to the requested specification of single cpu (not stated, assumed) 8-core, 48-64G RAM, 2T SSDs. The large RAM and SSD storage are out of spec for any spare systems in both locations. So this would likely have to be a specific purchase to meet the requested specifications.

We have the old restbase1001-1003 left in eqiad, which exceed the cpu requirement and meet the RAM requirement. They have no disks installed, so they would have to be purchased. We don't have any such spare machines in codfw.

In reviewing this request, it isn't clear to me how the administration of these machines would exist.

Would these be normal production machines, on production vlans, and handled by operations and puppet?

The repeated use of testing and emulation makes me think this would be something that wouldn't fall into production, and then would be a labs candidate. Is this something that would be better handled with labs on bare metal?

In reviewing this request, it isn't clear to me how the administration of these machines would exist.

Would these be normal production machines, on production vlans, and handled by operations and puppet?

I think we're at an early enough stage that nothing has been decided, but not having them on the production vlans has already been listed as a Feature. And, the staging nodes in particular would be need to be operated as closely as possible to how the production environment is, so "handed by Ops", TBD, "managed by Puppet", most definitely.

The repeated use of testing and emulation makes me think this would be something that wouldn't fall into production, and then would be a labs candidate. Is this something that would be better handled with labs on bare metal?

Bare-metal-labs is something that came up. As I remember, there were some unknowns here, but it wasn't ruled out.

I think that covers the basics! I'll create a procurement task as a blocker to this, and gather some pricing info.

A good first-step here might be the procurement of 3 additional machines, instead of the full 6. This doesn't provide for parity with our multi-data center production configuration, but it would be adequate for the majority of our performance testing (it would for example allow us to properly test T125904: Brotli compression for Cassandra).

Additionally, T139961 was submitted for the order of 6 (or 12) new machines for a production cluster expansion, if there isn't significant saving to be had from the smaller specs listed above, perhaps it would be easiest to further expand that procurement by an additional 3 hosts (to either 9, or 15 machines total).

Another option for large-scale Cassandra testing that we can pursue in parallel is using cloud infrastructure like GCE. A recent demo showed about 1000 Cassandra nodes on GCE with the new Kubernetes PetSet abstraction (see also T136385: Research: Investigate Cassandra Kubernetification using upcoming PetSet abstraction in K8s 1.3). Additionally, there is https://github.com/scylladb/cassandra-test-and-deploy, a framework for testing Cassandra and ScyllaDB on EC2.

One option that has been made available:

There are 10 Varnish machines that are coming down in esams, 3 of which could be made available to us (see T139961#2541110).

Pros:

  • Basically free; These machines would likely need to have disks ordered for them, but otherwise would not cost us anything

Cons:

  • Owing to their location, data-center support would be minimal
  • The machines are 4+ years old and entirely out of warranty (spares?)
  • We can only get 3, so either we'd have to reduce to a single data-center configuration for now, or somehow"fake" a multi-datacenter config (I'm not sure how that would look)
  • Generation of traffic, collection of metrics, latency measurements, etc would all need to traverse The Pond (though perhaps this isn't an issue in practice)
  • Owing to their location we'd need to be more careful wrt to exposed services and PII

Disk options (+costs), assuming >= 2T usable as specified above:

OptionUsable spaceCostCost/GBPer machineTotal
3 Samsungs~2.7T$514.31 * 3~$0.56$1542.93$4628.79
4 Samsungs w/ 20% space reserved~2.8T$514.31 * 4~$0.72$2057.24$6171.72
2 Intels~2.9T$1069.20 * 2~$0.72$2138.40$6415.20

Owing to their location, data-center support would be minimal

This looks like the biggest issue to me. Since the staging cluster expansion is blocking brotli testing, it would be great to make progress on this soonish.

@mark is effectively the only on-site engineer for AMS, and has many other things on his plate. This means that it might be harder to predict how long it would take to install new SSDs.

Another option I am wondering about is upgrading storage in the existing codfw staging nodes. What is the CPU / RAM in those like?

Owing to their location, data-center support would be minimal

This looks like the biggest issue to me. Since the staging cluster expansion is blocking brotli testing, it would be great to make progress on this soonish.

@mark is effectively the only on-site engineer for AMS, and has many other things on his plate. This means that it might be harder to predict how long it would take to install new SSDs.

Another option I am wondering about is upgrading storage in the existing codfw staging nodes. What is the CPU / RAM in those like?

Dell PowerEdge R420
2 @ Intel(R) Xeon(R) CPU E5-2440 0 @ 2.40GHz (6 core (12 total), 12 threads (24 total))
32GB memory
2 @ 500GB disks (7200RPM SATA)

So we'd need disks (as with esams), and memory as well.

Eevans created subtask Unknown Object (Task).Aug 25 2016, 8:29 PM

From https://phabricator.wikimedia.org/T139961#2541110 re: the available AMS nodes:

To summarize a conversation with @mark on IRC, there are 10 systems total, at least 3 of which satisfy the following:

However, Analytics has recently decommissioned their legacy AQS cluster consisting of 3 nodes:

Compared to the esams nodes these have slower procs, are on the lower end of the memory requirement, would still require that we purchase disks, and there may be an issue with the performance of the RAID controller (which I am told is still required, even in a JBOD/software RAID configuration). That said, they are located in eqiad, not esams, and so do not suffer the same issues, vis-a-vis an onsite engineer, and PII sensitivity.

RAID performance issues notwithstanding, would this be an easier/better option for Ops?

GWicke edited projects, added Services (next); removed Services.

@mark, @faidon, @RobH: Could you comment on the AQS cluster option? The staging cluster expansion is a blocker for brotli compression testing and -rollout (T147961), so we would like to make this decision fairly soon.

From https://phabricator.wikimedia.org/T139961#2541110 re: the available AMS nodes:

To summarize a conversation with @mark on IRC, there are 10 systems total, at least 3 of which satisfy the following:

However, Analytics has recently decommissioned their legacy AQS cluster consisting of 3 nodes:

Compared to the esams nodes these have slower procs, are on the lower end of the memory requirement, would still require that we purchase disks, and there may be an issue with the performance of the RAID controller (which I am told is still required, even in a JBOD/software RAID configuration). That said, they are located in eqiad, not esams, and so do not suffer the same issues, vis-a-vis an onsite engineer, and PII sensitivity.

RAID performance issues notwithstanding, would this be an easier/better option for Ops?

Assuming we're talking about aqs1001-1003, and Analytics is willing to decommission them at this time: I have no objections to using those 3 servers for this purpose.

Assuming we're talking about aqs1001-1003, and Analytics is willing to decommission them at this time: I have no objections to using those 3 servers for this purpose.

Already created https://phabricator.wikimedia.org/T147926 for their repurpose :)

We've been able to find some H710 controllers, which we can swap for the H310s. That should allow these 3 boxes to be used with decent I/O performance, and seems the best option at this point.

@mark: That sounds great, thank you! Do you need anything else from us for the disk procurement?

I think this is ready to proceed with procurement and setup — @RobH?

RobH changed the task status from Open to Stalled.Oct 27 2016, 4:04 PM

I think this is ready to proceed with procurement and setup — @RobH?

The linked task T147926 has the reclaim to spares steps, which include upgrading the controllers. Once that is done, they'll be allocated for this use.

I'm stalling this until that time (I expect it should be a day or two).

Eevans mentioned this in Unknown Object (Task).Oct 27 2016, 6:30 PM

Please note that the 3 R720xd systems on the sub-task, they won't fit SSDs into the LFF hot swap slots.

That means these three old aqs100[123] are only useful if services can use them in this staging allocation with their spinning HDDs, no SSD available.

Since this is testing/staging, and it is splitting the load with a hw raid controller across 12 spindles, would these still work?

Assigning back to @Eevans for feedback (since he was the original requester.) Please provide feedback and assign back to me for followup, thanks!

@RobH, we need SSDs in these boxes. Could we use generic 2.5"->3.5" adapters to fit 2.5" SSDs in 3.5" slots?

@RobH, we need SSDs in these boxes. Could we use generic 2.5"->3.5" adapters to fit 2.5" SSDs in 3.5" slots?

Unfortunately no, they won't work/fit for these hosts. (We investigated that on sub-task T147926.)

As far as I can tell, we don't have any spare systems that are out of warranty and meet the testing/staging host criteria.

Is this a general limitation, or some limitation of the particular adapter used? We'd save a lot of money and time if we could use these hosts, so maybe it would be worth trying with a second adapter? There are some that are especially geared towards hot swap setups, such as this one.

That particular adapter may work, I'd advise we purchase just one and test with spare non s3610 SSDs that are spare.

Since this is an out of warranty testing host, those old spare SSDs may be good enough for the testing/staging as well, but that can be determined later.

RobH created subtask Unknown Object (Task).Nov 10 2016, 7:54 PM
RobH created subtask Unknown Object (Task).Nov 17 2016, 5:22 PM
RobH closed subtask Unknown Object (Task) as Declined.Nov 18 2016, 6:38 PM

@RobH, could you update this task with a summary of the progress so far & ideally an estimate of the ETA for these boxes?

@GWicke: Task T151075 tracks the setup and installation of these hosts. It shows that while the hosts have been selected, and the SSDs ordered via task T150968. They arrived already, but they haven't been installed into the systems. I've pinged @Cmjohnson to find out the status of the SSD install, he will check tomorrow.

Please note that restbase-test1001 has an issue detecting one of the disks, but the other two are ready for use (restbase-test100[23]).

I've not reassgined the sub task for setup yet T151075, since restbase-test1001 hasn't been installed yet.

Resolving this request for hardware, as its all been allocated, and 2 of the 3 are available now (the other one should be ready tomorrow.)

@RobH: Excellent, thanks for the update!

RobH closed subtask Unknown Object (Task) as Resolved.Jun 12 2017, 7:54 PM
RobH closed subtask Unknown Object (Task) as Resolved.