deployment-prep (beta cluster): Code stewardship request
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Feb 4 2019, 10:44 PM

Description

Intro

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by the Platform team together with Ops (now SRE) as a final-stage test environment for new features. In the years since it has become a resource that is used by technical volunteers, the Wikimedia CI pipeline {{cn}}, Foundation staff, and manual testers. It is not however proactively maintained by any Foundation staff in their staff capacity.

This is a "weird" stewardship request because this project is not technically part of the Wikimedia production environment. It is also not exactly a single software system. Instead it is a shared environment used by multiple stakeholders to validate code and configuration changes in a non-production environment. A decision to sunset the beta cluster would be highly disruptive if it did not come along with a proposal to build a replacement environment of some type. This environment however has spent years in a state of uncertain maintainership and the code stewardship process seems like the most mature process we have to discuss the merits of the project and how it might be better supported/resourced going forward.

Issues

Unclear ownership of the Cloud VPS project instances (meaning that there are a large number of project admins, but little to no documentation about which people are taking care of which instances)
Production Puppet code is used, but often needs customization to work within the various constraints of the Cloud VPS project environment which requires special +2 rights. No holder of such rights is currently active in the project.
Not all Wikimedia production software changes are deployed in this environment {{cn}}
Puppet failures triggered by upstream configuration changes can remain for days or weeks before being addressed potentially blocking further testing of code and configuration changes

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Lferreira	T215217 deployment-prep (beta cluster): Code stewardship request
Resolved		herron	T254801 Logstash-Beta cannot be accessed: 504 Gateway Time-out
Resolved		jbond	T258451 deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z
Resolved		None	T257118 Beta cluster has reached its quota
Duplicate		None	T298253 Upgrade deployment-prep Swift cluster to Debian Buster or newer
Resolved		Ottomata	T304433 Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster)
Invalid		None	T308283 Beta Cluster Tech Decision Forum
Resolved	BUG REPORT	kostajh	T351930 HTTP 504 connection timeout error accessing MW API on Beta cluster
Open	Goal	None	T369112 Group -1 pre-train QTE validation environment
Resolved	Feature	dduvall	T369115 [WE6.2.1] Publish pre-train single version containers
Resolved		bd808	T375145 Draft and get approval for next hypothesis to follow WE6.2.1
In Progress	Goal	None	T379683 [WE6.2.6] Create design document for Group -1 deployment

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As someone who considers beta essential to my role, I'll add a data point with my use case.

I have root on the webperf hosts, but those are configured via puppet and I don't have +2 rights in operations/puppet. But I do have root in beta, so I'm able to cherry-pick patches there for testing. (Even with our puppet linter and compiler infrastructure, it's extremely difficult to craft working patches without some way to test them, which requires having a puppetmaster and hosts with the affected roles.)

A specific example: upgrading the performance team services to use Python 3 (T267269) requires a series of inter-dependent patches to update both our code and some system library dependencies. The puppet changes took several patchsets to get right, e.g. figuring out why services weren't being restarted. It would have been extremely painful to iterate on this in production.

Some pain points I've experienced:

Often, the first step in testing a puppet patch is to get beta back to a working state, pre-patch. For example: T244776#6364483 (Swift in beta had been mostly broken for some time).

Sometimes, differences between production and beta create problems unique to beta. For example: T248041 (puppetmaster OOMs).

Long-lived divergences between beta and production can be a problem, e.g. merge conflicts. For example: T244624. It'd be nice to have a clear policy about when it's OK to un-cherry-pick someone else's patch. (My stance on this re: my patches is in T245402#6517866 - please un-cherry-pick at will).

For the most part, I budget for the above when scoping testing of patches. Certainly not having a testing environment—or having a less permissive test environment without root access—would be way worse than the unrelated issues I've had to fix along the way.

There's a tragedy of the commons, but there are also economies from having a shared environment. I'm not sure it would be reasonable to expect someone to spin up e.g. their own Swift stack whenever they wanted to test a related change. Given our current dependence on puppet in production, I'm not sure spinning up a usable local testing environment for most services is even possible.

bd808 added a subtask: T257118: Beta cluster has reached its quota.Nov 16 2020, 7:17 PM

Ladsgroup subscribed.Nov 23 2020, 1:41 AM

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

In T215217#6651450, @jcrespo wrote:

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

For brief outages, I'd think #wikimedia-releng (and the related SAL) is probably sufficient - that's where I look when something isn't working to see if someone else is already fixing it.

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

In T215217#6656571, @dpifke wrote:

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org)

After some discussion, the Release Engineering and Quality and Test Engineering teams have decided to make QTE the "Product Owners" of BetaCluster. This decision comes as part of a larger testing infrastructure effort. The details of what this means and how we will proceed will come out over the course of the coming weeks. In the meantime, this task will be marked as Resolved as the primary objective of this task was to address the lack of "Code Stewardship" or more aptly "Product Ownership".

• Jrbranaa moved this task from Prioritized to Done on the Code-Stewardship-Reviews board.Dec 3 2020, 1:14 AM

Apologies for posting on this closed task, but is there any news on the above, some sort of eta on an announcement, details, etc?

Legoktm mentioned this in T276650: Re-consider setting up a Kubernetes cluster on the Beta cluster.Mar 6 2021, 7:41 AM

In T215217#6665452, @Jrbranaa wrote:

The details of what this means and how we will proceed will come out over the course of the coming weeks.

Was this ever done?

Boldly re-opening this task, given the details mentioned in T215217#6665452 have not been published (it's been several months now, outside the "weeks" range) and the primary problem of beta cluster being unmaintained and broken is still an issue.

I just linked this task to someone today after explaining that, afaik the code ownership of beta is stalled and I don't know why. So an open status makes the most sense indeed.

brennen edited projects, added Release-Engineering-Team; removed Release-Engineering-Team (Code Health), Release-Engineering-Team-TODO.Aug 11 2021, 9:03 PM

brennen moved this task from INBOX to Radar on the Release-Engineering-Team board.

brennen edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team.

brennen moved this task from Limbo to Watching/External on the Release-Engineering-Team (Radar) board.

brennen subscribed.

@Majavah - yes, this work has stalled due to a shift in my priorities over the last few months. However, it's back on the "front burner". I think it makes sense for this task to remain open until a plan has been pulled together and published.

Frostly subscribed.Aug 16 2021, 5:48 PM

taavi moved this task from Done to Backlog on the Code-Stewardship-Reviews board.Nov 7 2021, 3:21 PM

bd808 mentioned this in T295477: Request creation of deployment-prep-k8s VPS project.Nov 10 2021, 3:31 PM

Ladsgroup mentioned this in T296059: Deploy the Commons recon service on a test instance of Commons.Nov 19 2021, 9:41 AM

MatthewVernon added a subtask: T298253: Upgrade deployment-prep Swift cluster to Debian Buster or newer.Jan 10 2022, 11:43 AM

• Kormat subscribed.Jan 10 2022, 1:24 PM

taavi mentioned this in T302699: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022).Mar 2 2022, 11:36 AM

Zabe subscribed.Mar 8 2022, 12:02 AM

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022. These include:

The entire media storage (Swift) cluster
The entire ElasticSearch cluster
kafka-main and kafka-jumbo clusters, responsible for the MW job queue, purging cached pages, and other tasks, plus Zookeeper responsible for providing authentication to all Kafka clusters
Multiple miscellaneous support services

Is anyone going to work on those?

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

Michael subscribed.Mar 24 2022, 2:41 PM

In T215217#7796938, @Majavah wrote:

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022.

Thanks for raising this, @Majavah and for all the work you've done on beta — it's in a better place than you've found it.

As I mentioned in T215217#6610236, Release-Engineering-Team cares if Beta is down; however, we're not resourced to rebuild all of beta (which is what needs to happen now).

My current plan is to draft something for the tech decision forum so we can figure it out together.

In T215217#7801154, @nskaggs wrote:

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

If this is an acceptable solution to buy time, I'm in favor of doing this.

In the time that this would buy, we can figure out how to sustain beta (I hope).

• vyuen subscribed.Mar 24 2022, 4:33 PM

Lucas_Werkmeister_WMDE mentioned this in T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project.Mar 24 2022, 6:16 PM

• vyuen mentioned this in T304843: [Spike] Investigate deployment strategy for v0 [high priority].Mar 28 2022, 1:59 PM

Nikerabbit subscribed.Mar 30 2022, 10:15 AM

Pinging because one month has passed since the last comment on this.

Jdforrester-WMF closed subtask T257118: Beta cluster has reached its quota as Resolved.Apr 26 2022, 1:49 PM

For everyone's info, currently no Code-Stewardship-Reviews are taking place as there is no clear path forward and as this is not prioritized work.
(Entirely personal opinion: I also assume lack of decision authority due to WMF not having a CTO currently. However, discussing this is off-topic for this task.)

thcipriani mentioned this in T308283: Beta Cluster Tech Decision Forum.May 13 2022, 4:45 PM

Aklapper added a subtask: T308283: Beta Cluster Tech Decision Forum.May 13 2022, 7:37 PM

TheresNoTime subscribed.May 28 2022, 8:34 PM

Ottomata closed subtask T304433: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) as Resolved.Jun 15 2022, 1:28 PM

Beta was down for 8 hours yesterday due to the mess it is - see https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_Beta_Cluster_502

Ladsgroup mentioned this in T306049: Upgrade deployment-docker-citoid01 host to Buster.Aug 24 2022, 9:07 AM

RoySmith subscribed.Sep 1 2022, 7:50 PM

Matthewrbowker subscribed.Sep 1 2022, 8:20 PM

I would like to point out that especially on dewiki, beta is used actively downstream for development of templates, modules, javascript etc., with permissions elevated in comparison to production. It would be a pity to lose these capabilities.

Stang subscribed.Nov 20 2022, 1:06 AM

bking subscribed.Nov 21 2022, 9:44 PM

Ronysorkar subscribed.Dec 7 2022, 6:40 PM

fnegri subscribed.Apr 26 2023, 3:00 PM

Addshore unsubscribed.Jun 27 2023, 12:40 PM

Aklapper mentioned this in T344834: Captchas are broken in the beta cluster.Aug 24 2023, 6:46 AM

bd808 updated the task description. (Show Details)Nov 30 2023, 9:01 PM

bd808 added a subtask: T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster.Nov 30 2023, 9:16 PM

Jdforrester-WMF updated the task description. (Show Details)Nov 30 2023, 9:55 PM

kostajh closed subtask T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster as Resolved.Dec 1 2023, 7:12 AM

daniel reopened subtask T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster as Open.Dec 4 2023, 11:48 AM

daniel closed subtask T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster as Resolved.Dec 4 2023, 10:30 PM

kostajh renamed this task from deployment-prep: Code stewardship request to deployment-prep (beta cluster): Code stewardship request.Feb 27 2024, 11:29 AM

kostajh subscribed.

ArielGlenn mentioned this in T358329: beta-update-databases-eqiad job times out / beta databases are having issues.Feb 27 2024, 12:35 PM

Aklapper mentioned this in T359768: upload.wikimedia.beta.wmflabs.org: cannot find server.Mar 28 2024, 10:45 AM

Beta cluster is actively used to test the Commons app. Must be annoying for testers that upload.wikimedia.beta.wmflabs.org has been down for weeks. I frequently use it to test gadgets. Found a train blocker or two while doing so. But I don't see a solution either. ;-(

@Shashankiitbhu @Sebastian_Berlin-WMSE: here's the reason why upload.wikimedia.beta.wmflabs.org isn't working.

bd808 added a subtask: T369112: Group -1 pre-train QTE validation environment.Jul 2 2024, 7:52 PM

In T215217#6610236, @thcipriani wrote:

Sunsetting beta requires a plan to replace the use-cases of beta with something more maintainable. We're in the midst of a large transition in production, containerizing our services. There is a staging cluster for services that will likely supplant some portion of beta's use-cases (a "production-like" environment). The remaining use-cases will likely fall into the realm of local development and (possibly) something that utilizes existing containers to allow developers to share changes with one another -- something akin to the existing patchdemo project. This was a major recommendation that was made as part of the exploration of existing local development tooling. As we begin to supplant the use-cases of beta cluster in the future we can form a more fully realized plan about shutting it down.

This idea from @thcipriani and others is how we are actively dealing with the "problem" of deployment-prep today. Projects like T369112: Group -1 pre-train QTE validation environment and Catalyst are happening as part of the 2024-2025 Wikimedia Foundation Annual Plan to incrementally find better supported homes for various use cases in deployment-prep. Data from past surveys on the general topic led us to pick these projects as our first experiments. Future technical community surveys will be used to check on how we did as well as help us find the next and the next and the next focus area. Eventually we will be able to look back and see that we have collectively moved past the problem of deployment-prep ownership by sun setting the project one use case at a time.

@bd808 to add to your point: We're near another inflection point for deployment-prep: soon the puppet code for configuring mediawiki in deployment-prep is going to be dismissed and not used in the production environment anymore. By the end of the calendar year, we count on having moved all of production (hopefully, all on kubernetes) to using php 8.x.

No one is tasked with maintaining deployment-prep so I doubt anyone will pick up the (quite lengthy) job of updating our puppet code to configure mediawiki to php 8, in absence of a need on the production side of things.

In T215217#10009317, @Joe wrote:

@bd808 to add to your point: We're near another inflection point for deployment-prep: soon the puppet code for configuring mediawiki in deployment-prep is going to be dismissed and not used in the production environment anymore. By the end of the calendar year, we count on having moved all of production (hopefully, all on kubernetes) to using php 8.x.

No one is tasked with maintaining deployment-prep so I doubt anyone will pick up the (quite lengthy) job of updating our puppet code to configure mediawiki to php 8, in absence of a need on the production side of things.

I have been thinking about this general problem as well this quarter. I am interested in putting a Kubernetes cluster into deployment-prep, possibly by using Magnum and OpenTofu to provision the service. In theory we could then get folks to add config for this new environment to their Helm charts for MediaWiki and other things that are hosted via Kubernetes in the production realm.

In theory we could then get folks to add config for this new environment to their Helm charts for MediaWiki and other things that are hosted via Kubernetes in the production realm.

This is a great idea. The 'staging' wikikube cluster would not be dissimilar to this environment, at least from the helm chart configuration perpective. Having a values-beta.yaml helmfile for a service sounds pretty nice and easy enough to maintain.

In T215217#10010991, @bd808 wrote:

I have been thinking about this general problem as well this quarter. I am interested in putting a Kubernetes cluster into deployment-prep, possibly by using Magnum and OpenTofu to provision the service. In theory we could then get folks to add config for this new environment to their Helm charts for MediaWiki and other things that are hosted via Kubernetes in the production realm.

I think the problem is larger, especially if you talk about mediawiki. We'll need to build a new image building pipeline (including a docker registry) just for deployment-prep, for instance, and we'd need to possibly adapt the mediawiki chart and make it more complex. It's a significant amount of work and I can't see it as justified for any team; it would surely be much more work than porting the mediawiki puppet code forward - the only reason that could justify it is if the organization decides we want to officially support deployment-prep and not replace it.

We'll need to build a new image building pipeline (including a docker registry)

beta couldn't use the same MW image?

In T215217#10011380, @Joe wrote:

I think the problem is larger, especially if you talk about mediawiki. We'll need to build a new image building pipeline (including a docker registry) just for deployment-prep, for instance, and we'd need to possibly adapt the mediawiki chart and make it more complex. It's a significant amount of work and I can't see it as justified for any team; it would surely be much more work than porting the mediawiki puppet code forward - the only reason that could justify it is if the organization decides we want to officially support deployment-prep and not replace it.

This is a complex topic that we certainly will not hash out on a semi-related ticket in a few comments. I choose not to believe however that we will collectively let deployment-prep parish without any practical substitutes at all. I cannot currently say if that will mean that we find a hero (thanks @Southparkfan for being the latest in this role!), invest more paid human hours in deployment-prep, block various SRE decomm projects until they have suitable replacements, or (likely) use some combination of these and other things to keep entropy from winning. I do however have great faith in the collective intelligence and determination of the paid and volunteer technical contributors to the Wikimedia movement. As a group we can accomplish anything we choose to[*].

[*]: subject to the laws of nature; price and participation may vary; not available in all multiverses; tax, title, and dealer doc fees not included.

In T215217#10011649, @Ottomata wrote:

We'll need to build a new image building pipeline (including a docker registry)

beta couldn't use the same MW image?

Not as built today because of the inclusion of production secrets including embargoed security patches. We are thinking about how to change this too.

akosiaris mentioned this in T370458: Remove or replace poolcounter06.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation).Aug 2 2024, 5:46 AM

bd808 mentioned this in T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.Sep 5 2024, 7:29 PM

The Developer Experience Group is officially taking stewardship of Beta Cluster and its product offering. As the next step, we will upgrade the beta cluster to PHP 8.1 and communicate how we are further evolving the testing platform offering.

A few examples of the work we are doing were already shared by @bd808 earlier in July, though at this point, we have more information and believe that Beta Cluster won't necessarily be sunsetted, but its use cases trimmed down:

This idea from @thcipriani and others is how we are actively dealing with the "problem" of deployment-prep today. Projects like T369112: Group -1 pre-train QTE validation environment and Catalyst are happening as part of the 2024-2025 Wikimedia Foundation Annual Plan to incrementally find better supported homes for various use cases in deployment-prep. Data from past surveys on the general topic led us to pick these projects as our first experiments. Future technical community surveys will be used to check on how we did as well as help us find the next and the next and the next focus area. Eventually we will be able to look back and see that we have collectively moved past the problem of deployment-prep ownership by sun setting the project one use case at a time.