Page MenuHomePhabricator

Beta Cluster Tech Decision Forum
Open, Needs TriagePublic

Description

Problem

Beta is a cluster of 72 virtual machines running on Wikimedia Cloud Services, and it’s the closest thing we have to a production-like staging. But it’s not very production-like, and it’s not maintainable.

What does the future look like if this is achieved?

  • We run the mainline branch of MediaWiki, extensions, and skins in a subset of our production infrastructure.
  • After code is merged, it gets automatically deployed to a subset of production.
  • It’s linked to production data, and production wikis, but is only accessible via specially crafted requests (e.g., via X-Wikimedia-Debug headers).

What happens if we do nothing?

We built about half of the virtual machines in the Beta Cluster (30/72) so long ago that their operating system no longer receives security updates. If we do nothing, our only pre-production environment will have half its instances removed and won’t work anymore.

🧐 Why?

  • Beta cluster is valuable as a pre-production environment—it’s a place where developers can test the mainline branch of their code.
    • In a 2018 survey, roughly 94% of developers agreed they use Beta for at least some of their testing
    • Two-thirds of developers rely on it to make informed decisions about deployment
  • Beta is not maintainable in its current state
    • We’ve temporarily dedicated resources to “fixing” Beta, in the past They ‘ve always managed to make improvements, but the problem of maintaining Beta is only growing over time.
    • Beta cluster has been up for code stewardship since 2019 (T215217)

Beta tracks our production infrastructure and our production environment only grows.

Growth of the operations/puppet Repository*
yearFiles**Lines of codeCollaborative Cost Model — People Required to Build****
20162,732139,43020.5
20173,708 (+976)185,394 (+45,964)24.6 (+4.1)
20184,709 (+1001)234,510 (+49,116)28.7 (+4.1)
20195,421 (+712)262,015 (+27,505)30.9 (+2.2)
20206,025 (+604)308,442 (+46,427)34.3 (+3.4)
20216,349 (+324)355,107 (+46,664)37.6 (+3.3)
Avg Change+723 Files/Year+43,135 LoC/Year+3.4 People/Year

*SLOC count via: https://github.com/boyter/scc
**COCOMO: Using the organic model https://en.wikipedia.org/wiki/COCOMO

You should view the actual numbers above with skepticism. But the trend is evident: our production environment grows every year.

We haven’t added resources to Beta, and we’ve burned out the resources we’ve assigned as they struggle to keep up with change.

Event Timeline

+1 to the idea of deprecating the Beta Cluster, once there's a better replacement. I would suggest retitling this task to make the objective clear, either "Blue-green deployment" if the goal is to focus on a single replacement infrastructure, or "Deprecate Beta Cluster" if the goal is simply to get rid of the cluster by any possible means.

I see this task is brand new so this is just a recommendation for future work, but as you refine the RFC please link to existing work around this topic, such as T168843: Continous Delivery, https://wikitech.wikimedia.org/wiki/Deployment_pipeline and so on, to improve the quality of discussion here.

thcipriani renamed this task from Beta Cluster to Beta Cluster Tech Decision Forum.May 13 2022, 4:40 PM
thcipriani updated the task description. (Show Details)
thcipriani added subscribers: akosiaris, kostajh, cscott and 10 others.
thcipriani added a subscriber: Joe.

Am I reading this correctly that the proposed Beta replacement would be directly using and updating production data and also share the production bottlenecks (such as databases)?

In T308283#7927987, @Majavah wrote:

Am I reading this correctly that the proposed Beta replacement would be directly using and updating production data and also share the production bottlenecks (such as databases)?

You are reading that right.

The biggest unknown around this idea is data.

Using separate database has pros and cons:

  • (pro) ability to perform destructive end-to-end testing
  • (pro) could be updated on-the-fly with an automated process without worry
  • (con) limits our ability to reproduce production scenarios
  • Unknowns: what data? Where would it come from and where would it live?

Likewise using production data has pros and cons:

  • (con) may be destructive to production data (although, if this is a post-merge environment, we may find that out with the train in a less controlled way)
  • (con) New code can access PII (although, again, in a post-merge environment, this code will be doing that in a week to an hour anyway)
  • (pro) new code can be manually tested with production data
  • (pro) no maintenance of a secondary system
  • Unknowns: how to update database schema?

You are reading that right.

Thank you for that clarification! I have a couple of comments regarding your points though.

  • (con) limits our ability to reproduce production scenarios

Does it? We already have the ability to hack mwdebug servers to run whatever code we need for debugging purposes. I don't think the alternative (merging debugging code to master and then reverting it) is any better.

  • (con) may be destructive to production data (although, if this is a post-merge environment, we may find that out with the train in a less controlled way)

This doesn't take into account that some code is behind a feature flag and only enabled when the developer manually wants it to be enabled. For example the extension I help maintain, CentralAuth, is a security-critical component that we can't test on a single isolated wiki on the production realm. There are two major use cases where a proper staging environment gives us something that local testing can't give us:

  • There will always be weird edge cases that we can't think of when testing things locally. These are usually created by small behavior changes when the code is updated or when a new set of data is introduced to new entries that can't be backfilled to old rows. A widely-used staging environment will have a small amount of these edge cases that would otherwise be found in production.
  • Some components, such as various levels of caching, are difficult to replicate locally with production-like behavior. Being able to test new code with the components and overall setup similar to production makes me more confident that the production deployment will not have any serious problems.
In T308283#7928242, @Majavah wrote:

You are reading that right.

Thank you for that clarification! I have a couple of comments regarding your points though.

  • (con) limits our ability to reproduce production scenarios

Does it? We already have the ability to hack mwdebug servers to run whatever code we need for debugging purposes. I don't think the alternative (merging debugging code to master and then reverting it) is any better.

This was something mentioned by @cscott about parsoid testing. But you're right that many (most? some? unknown) have access to mwdebug machines.

  • (con) may be destructive to production data (although, if this is a post-merge environment, we may find that out with the train in a less controlled way)

This doesn't take into account that some code is behind a feature flag and only enabled when the developer manually wants it to be enabled. For example the extension I help maintain, CentralAuth, is a security-critical component that we can't test on a single isolated wiki on the production realm. There are two major use cases where a proper staging environment gives us something that local testing can't give us:

  • There will always be weird edge cases that we can't think of when testing things locally. These are usually created by small behavior changes when the code is updated or when a new set of data is introduced to new entries that can't be backfilled to old rows. A widely-used staging environment will have a small amount of these edge cases that would otherwise be found in production.
  • Some components, such as various levels of caching, are difficult to replicate locally with production-like behavior. Being able to test new code with the components and overall setup similar to production makes me more confident that the production deployment will not have any serious problems.

I'm trying to follow you're saying, but I'm not clear: Are you saying that rollout on the train is more controlled than I think it is? Or that a true pre-production environment would be helpful for testing? Or something else?

Regarding data and databases in production. You could potentially have a dedicated section in databases and make sure the appservers in test environment only be able to connect that section. We already have a test-s4 section. This would need some work wrt our grants and dedicated hardware (2 boxes per dc = 4) but not impossible to do.