Page MenuHomePhabricator

deployment-prep: Code stewardship request
Open, MediumPublic

Description

Intro

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by technical volunteers {{cn}}. In the years since it has become a resource that is used by technical volunteers, the Wikimedia CI pipeline {{cn}}, Foundation staff, and manual testers. It is not however proactively maintained by any Foundation staff in their staff capacity.

This is a "weird" stewardship request because this project is not technically part of the Wikimedia production environment. It is also not exactly a single software system. Instead it is a shared environment used by multiple stakeholders to validate code and configuration changes in a non-production environment. A decision to sunset the beta cluster would be highly disruptive if it did not come along with a proposal to build a replacement environment of some type. This environment however has spent years in a state of uncertain maintainership and the code stewardship process seems like the most mature process we have to discuss the merits of the project and how it might be better supported/resourced going forward.

Issues

  • Unclear ownership of the Cloud VPS project instances (meaning that there are a large number of project admins, but little to no documentation about which people are taking care of which instances)
  • Production Puppet code is used, but often needs customization to work within the various constraints of the Cloud VPS project environment which requires special +2 rights. No holder of such rights is currently active in the project.
  • Not all Wikimedia production software changes are deployed in this environment {{cn}}
  • Puppet failures triggered by upstream configuration changes can remain for days or weeks before being addressed potentially blocking further testing of code and configuration changes

Event Timeline

bd808 created this task.Feb 4 2019, 10:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 4 2019, 10:44 PM
Krenair added a subscriber: Krenair.
bd808 updated the task description. (Show Details)Feb 4 2019, 10:56 PM
greg added a subscriber: greg.Feb 7 2019, 10:34 PM
Bawolff added a subscriber: Bawolff.EditedFeb 19 2019, 5:26 PM

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by technical volunteers {{cn}}

Pretty sure that's not true (In particular, I was under the impression that the idea originated with https://en.labs.wikimedia.org ~2009 with usability initiave & flagged revisions, and then later morphed into its current form under the guidance of the platform team https://www.mediawiki.org/w/index.php?title=QA_and_testing/Labs_plan&oldid=534819 in 2012). However I suppose history doesn't matter, what matters is today.


TBH: I'm kind of surprised to see this task. I don't follow betalabs too much, but I thought it was pretty well accepted that it was a RelEng responsibility generally, and that cloud was only responsible for provided the underlying cloud VPS platform. Is that not the general understanding?

greg added a comment.Feb 19 2019, 5:53 PM

TBH: I'm kind of surprised to see this task. I don't follow betalabs too much, but I thought it was pretty well accepted that it was a RelEng responsibility generally, and that cloud was only responsible for provided the underlying cloud VPS platform. Is that not the general understanding?

FTR: Responsibility of the services/pieces that make up the Beta Cluster[0] lies with the people who maintain them in production; there's no other way it could conceivably work.

The question is basically: how can a team tasked with everything else it is doing (deployments, tooling, CI, etc) keep up with an SRE team of 20 people when maintaining a shadow environment? It can't is the answer :) This is a long understood problem/imbalance by both SRE and RelEng (as in, we all see the problem and have no good answer).

To focus on the future: Things will be changing with what is possible and what is needed as we migrate more and more parts of our infrastructure to the Deployment Pipeline. We (RelEng and SRE) should scope out what that is and how that impacts Beta Cluster in the short, medium, and long term (read: that's the conversation that should happen to move this stewardship review forward).

[0] https://wikitech.wikimedia.org/wiki/Help:Labs_labs_labs#Beta_Cluster

jeena added a subscriber: jeena.Feb 20 2019, 12:46 AM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMar 20 2019, 12:43 PM

The question is basically: how can a team tasked with everything else it is doing (deployments, tooling, CI, etc) keep up with an SRE team of 20 people when maintaining a shadow environment?

Question from someone uninitiated: Following which rule do the "production environment" and the "shadow environment" have to be (that's what it sounds like) maintained by different groups? Following the mentioned purpose, "validate code and configuration changes in a non-production environment", maximum benefit could be expected by identical setups which arguably would be the easiest to achieve by using the same tech stack and know-how (i.e. people). Thanks.

Jrbranaa moved this task from In Review to Prioritized on the Code-Stewardship-Reviews board.

Normally I'm all for celebrating birthdays.... I can't believe this baby is one year old already... Let us not need to celebrate its second birthday (dark, I know) :-/

Mvolz added a subscriber: Mvolz.Mar 20 2020, 8:03 PM
Krinkle added a subscriber: Krinkle.EditedMay 11 2020, 3:08 PM

Some services and products are maintained by their owners in both production data centers and Beta Cluster alike (Most Product teams, and in Tech: Perf, Analytics, and a few others).

For some other services this is not the case, which halts much development and testing whenever they prop up.

A non-urgent but recent example to illustrate this is T139044: Enable GTID on beta cluster mariaDB once upgraded.

dpifke added a subscriber: dpifke.Oct 12 2020, 6:00 PM
Joe added a subscriber: Joe.Oct 13 2020, 6:29 AM

Normally I'm all for celebrating birthdays.... I can't believe this baby is one year old already... Let us not need to celebrate its second birthday (dark, I know) :-/

We are a few months away from the second birthday party, is anything moving on this front? :) I ask as I was just reminded how bad of a job we're collectively doing at keeping deployment-prep healthy (see T257118#6536304).

I had a conversation about this ticket with @nskaggs and I felt that I should update this ticket after our conversation.

The problem of stewardship for beta cluster is really a series of problems:

  1. Beta means different things to different people
  2. Maintenance of beta
  3. Sunsetting beta

Beta means different things to different people

In 2018 a few folks on Release-Engineering-Team conducted a survey on the uses of beta cluster. From the survey we were able to identify the following uses of Beta Cluster:

  • Showcasing new work
  • End-to-end/unit testing of changes in isolation
  • Manual QA, quick iteration on bug fixes
  • Long-term testing of alpha features & services in an integrated environment
  • Test how changes integrate with a production-like environment before release
  • Test the deployment procedure
  • Test performance regressions
  • Test integration of changes with production-like data
  • Test with live traffic

The first thing to notice is that some of these use-cases are working against one another. Testing isolated changes cannot be done along with long-term testing of alpha features. New services and new extensions not in production makes the environment less "production-like". New versions of production software in beta makes beta less stable. But, delayed upgrade of production software in beta might also leave beta unstable.

Beta has many purposes but not a single primary purpose -- it's used for everything: it's a tragedy of the commons. There has never been a shared understanding of what "production-like" means for the beta cluster. It likely means different things to different people.

There is no single perfect thing for beta to become because it's doing so many things currently. There is no perfect beta cluster only perfect beta clusters tailored for their use-cases. Back in 2015 the idea of “Beta Cluster as a Service” (BCaaS [bɪˈkɒz]) had some minor traction, but for all the reasons mentioned in T215217#4965494 it didn't happen.

Maintenance

Production is maintained by a group of 23 people (SRE) dedicated to keeping that environment running, up-to-date, and safe. Release-Engineering-Team used to pretend that we could keep pace with production as a group of 7 people who are also responsible for CI, deployment, code review, and development environments, but it's proven not to work in practice. The environment is also different enough from production that the folks familiar with production are also not able to productively maintain it.

A fantastic example of the types of maintenance problems we have was me breaking beta a few hours ago T267439: MediaWiki beta varnish is down. -- an upstream puppet patch broke puppet in beta and when I fixed puppet it caused problems with packages I've never heard of. There is a lot of specialized knowledge needed to keep production running and it just gets more specialized all the time.

Currently there is a project to move existing services (as well as MediaWiki) through the deployment-pipeline and into kubernetes in production. This is making beta cluster even less production-like: there is no k8s in beta and no team has a plan to build or maintain one.

My stance on beta cluster has been, Release-Engineering-Team cares if beta-cluster is broken and we'll try to wrangle the appropriate people to help. This is very different from the kind of active maintenance that beta needs to fight entropy.

Sunsetting Beta

Another finding from the 2018 survey was that 80% of respondents said that they "agree" or "mostly agree" with the statement, "I depend on Beta Cluster for some of my regular testing needs". This past week the beta cluster found 3 release train blockers that never hit production. Beta is important and has no replacement currently. Many of its instances are pets not cattle.

Beta is also definitely an ongoing pain point for both Release-Engineering-Team and cloud-services-team

Sunsetting beta requires a plan to replace the use-cases of beta with something more maintainable. We're in the midst of a large transition in production, containerizing our services. There is a staging cluster for services that will likely supplant some portion of beta's use-cases (a "production-like" environment). The remaining use-cases will likely fall into the realm of local development and (possibly) something that utilizes existing containers to allow developers to share changes with one another -- something akin to the existing patchdemo project. This was a major recommendation that was made as part of the exploration of existing local development tooling. As we begin to supplant the use-cases of beta cluster in the future we can form a more fully realized plan about shutting it down.

dpifke added a comment.Sat, Nov 7, 1:38 AM

As someone who considers beta essential to my role, I'll add a data point with my use case.

I have root on the webperf hosts, but those are configured via puppet and I don't have +2 rights in operations/puppet. But I do have root in beta, so I'm able to cherry-pick patches there for testing. (Even with our puppet linter and compiler infrastructure, it's extremely difficult to craft working patches without some way to test them, which requires having a puppetmaster and hosts with the affected roles.)

A specific example: upgrading the performance team services to use Python 3 (T267269) requires a series of inter-dependent patches to update both our code and some system library dependencies. The puppet changes took several patchsets to get right, e.g. figuring out why services weren't being restarted. It would have been extremely painful to iterate on this in production.

Some pain points I've experienced:

  1. Often, the first step in testing a puppet patch is to get beta back to a working state, pre-patch. For example: T244776#6364483 (Swift in beta had been mostly broken for some time).
  1. Sometimes, differences between production and beta create problems unique to beta. For example: T248041 (puppetmaster OOMs).
  1. Long-lived divergences between beta and production can be a problem, e.g. merge conflicts. For example: T244624. It'd be nice to have a clear policy about when it's OK to un-cherry-pick someone else's patch. (My stance on this re: my patches is in T245402#6517866 - please un-cherry-pick at will).

For the most part, I budget for the above when scoping testing of patches. Certainly not having a testing environment—or having a less permissive test environment without root access—would be way worse than the unrelated issues I've had to fix along the way.

There's a tragedy of the commons, but there are also economies from having a shared environment. I'm not sure it would be reasonable to expect someone to spin up e.g. their own Swift stack whenever they wanted to test a related change. Given our current dependence on puppet in production, I'm not sure spinning up a usable local testing environment for most services is even possible.