The staging environment for RESTBase is currently a special case in that it is hosted on real hardware in the production network. This was done because it is often the case that changes being staged require access to resources comparable to that of production, a requirement that cannot be (easily) met in labs. Unfortunately, this environment is still too far from production specs to be useful in many cases.
For example:
- Available per-instance storage isn't enough, even though we only load data for a single wiki. We typically do not have enough working space to (for example), perform operations that rewrite SSTables (scrub, upgradesstables, etc).
- The amount of memory is considerably lower than what is available in production, requiring us to use heap sizes and GC settings that do not even approximate what is used in production.
- The specs are quite different across the cluster, preventing the use of a homogeneous configuration (for example: eqiad machines are only able to host a single instance, codfw 2 instances).
When the staging environment comes up short, we do our best, then move the remainder of testing to production (and I fear that this has become so routine at this point, that we don't even stop to acknowledge that is what we are doing).
In addition to the Staging use-case, is the more development focused, ad hoc testing. The requirements here are for all intents and purposes the same, but with a lack of better options, we typically perform this sort of testing in the staging environment. This isn't awesome either; Ideally we'd maintain the staging environment identically to production, with code and configuration that we expect to move to production RSN (i.e. changes staged for production).
Ask
Staging
6 dedicated hosts (always-on), that can be deployed in a multi-datacenter configuration (in a perfect world, 3 in one data-center, 3 in another, though this could be simulated easily enough), of roughly half the specs we're using in production (8-core, 48-64G RAM, 2T SSDs).
This could be older, disparate hardware no longer under warranty.
Rationale
We need to emulate the multi-datacenter configuration used in production, including the ability to apply DC-local quorums, which mandates a minimum of 6 hosts. Back-of-napkin, if the current production hosts have the resources to host 4 instances (which seems likely at this point), then machines with half the specs should be able to host 2 instances (the minimum needed to properly test multi-instances).
Testing
Generally, machine requirements here match that of Staging above with some exceptions. For example, depending on the nature of the test, it might be the case that a full 6 machines aren't needed. Most tests of performance could be achieved with 3 machines, (6 would largely be for tests of a functional nature).
Additionally, ephemeral configurations would be acceptable, and in some ways even desirable. Unlike the Staging use-case meant to emulate production, the need for a test environment exists only for the duration of the testing, and ideally, it'd be easy to re-baseline the environment between tests. Obviously, all of this is possible with a dedicated, static environment, but may also open the door to other solutions as well.