deployment-prep (beta cluster): Code stewardship request
Open, MediumPublic
Actions

Assigned To

None

Authored By

	bd808
	Feb 4 2019, 10:44 PM

Description

Intro

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by the Platform team together with Ops (now SRE) as a final-stage test environment for new features. In the years since it has become a resource that is used by technical volunteers, the Wikimedia CI pipeline {{cn}}, Foundation staff, and manual testers. It is not however proactively maintained by any Foundation staff in their staff capacity.

This is a "weird" stewardship request because this project is not technically part of the Wikimedia production environment. It is also not exactly a single software system. Instead it is a shared environment used by multiple stakeholders to validate code and configuration changes in a non-production environment. A decision to sunset the beta cluster would be highly disruptive if it did not come along with a proposal to build a replacement environment of some type. This environment however has spent years in a state of uncertain maintainership and the code stewardship process seems like the most mature process we have to discuss the merits of the project and how it might be better supported/resourced going forward.

Issues

Unclear ownership of the Cloud VPS project instances (meaning that there are a large number of project admins, but little to no documentation about which people are taking care of which instances)
Production Puppet code is used, but often needs customization to work within the various constraints of the Cloud VPS project environment which requires special +2 rights. No holder of such rights is currently active in the project.
Not all Wikimedia production software changes are deployed in this environment {{cn}}
Puppet failures triggered by upstream configuration changes can remain for days or weeks before being addressed potentially blocking further testing of code and configuration changes

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T215217 deployment-prep (beta cluster): Code stewardship request
Resolved		herron	T254801 Logstash-Beta cannot be accessed: 504 Gateway Time-out
Resolved		jbond	T258451 deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z
Resolved		None	T257118 Beta cluster has reached its quota
Duplicate		None	T298253 Upgrade deployment-prep Swift cluster to Debian Buster or newer
Resolved		Ottomata	T304433 Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster)
Open		TAdeleye_WMF	T308283 Beta Cluster Tech Decision Forum
Resolved	BUG REPORT	kostajh	T351930 HTTP 504 connection timeout error accessing MW API on Beta cluster

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T215217#5850281, @bd808 wrote:

Happy bug birthday T215217: deployment-prep (beta cluster): Code stewardship request!

Normally I'm all for celebrating birthdays.... I can't believe this baby is one year old already... Let us not need to celebrate its second birthday (dark, I know) :-/

Mvolz subscribed.Mar 20 2020, 8:03 PM

• marcella subscribed.Mar 23 2020, 3:11 PM

Some services and products are maintained by their owners in both production data centers and Beta Cluster alike (Most Product teams, and in Tech: Perf, Analytics, and a few others).

For some other services this is not the case, which halts much development and testing whenever they prop up.

A non-urgent but recent example to illustrate this is T139044: Enable GTID on beta cluster mariaDB once upgraded.

bd808 added a subtask: T254801: Logstash-Beta cannot be accessed: 504 Gateway Time-out.Jun 9 2020, 5:09 PM

herron closed subtask T254801: Logstash-Beta cannot be accessed: 504 Gateway Time-out as Resolved.Jun 16 2020, 7:33 PM

ArielGlenn subscribed.Jun 16 2020, 7:45 PM

taavi subscribed.Jun 19 2020, 3:44 PM

DannyS712 subscribed.Jul 6 2020, 5:11 PM

• Mholloway added a subtask: T258451: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z.Jul 21 2020, 5:01 PM

• Mholloway closed subtask T258451: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z as Resolved.Jul 22 2020, 2:40 PM

jbond reopened subtask T258451: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z as Open.Jul 22 2020, 2:48 PM

jbond closed subtask T258451: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z as Resolved.

Peachey88 mentioned this in T257118: Beta cluster has reached its quota.Oct 12 2020, 11:21 AM

• dpifke subscribed.Oct 12 2020, 6:00 PM

In T215217#5850757, @Jrbranaa wrote:

In T215217#5850281, @bd808 wrote:

Happy bug birthday T215217: deployment-prep (beta cluster): Code stewardship request!

Normally I'm all for celebrating birthdays.... I can't believe this baby is one year old already... Let us not need to celebrate its second birthday (dark, I know) :-/

We are a few months away from the second birthday party, is anything moving on this front? :) I ask as I was just reminded how bad of a job we're collectively doing at keeping deployment-prep healthy (see T257118#6536304).

• nskaggs subscribed.Oct 13 2020, 2:09 PM

Addshore subscribed.Nov 3 2020, 12:50 PM

I had a conversation about this ticket with @nskaggs and I felt that I should update this ticket after our conversation.

The problem of stewardship for beta cluster is really a series of problems:

Beta means different things to different people
Maintenance of beta
Sunsetting beta

Beta means different things to different people

In 2018 a few folks on Release-Engineering-Team conducted a survey on the uses of beta cluster. From the survey we were able to identify the following uses of Beta Cluster:

Showcasing new work
End-to-end/unit testing of changes in isolation
Manual QA, quick iteration on bug fixes
Long-term testing of alpha features & services in an integrated environment
Test how changes integrate with a production-like environment before release
Test the deployment procedure
Test performance regressions
Test integration of changes with production-like data
Test with live traffic

The first thing to notice is that some of these use-cases are working against one another. Testing isolated changes cannot be done along with long-term testing of alpha features. New services and new extensions not in production makes the environment less "production-like". New versions of production software in beta makes beta less stable. But, delayed upgrade of production software in beta might also leave beta unstable.

Beta has many purposes but not a single primary purpose -- it's used for everything: it's a tragedy of the commons. There has never been a shared understanding of what "production-like" means for the beta cluster. It likely means different things to different people.

There is no single perfect thing for beta to become because it's doing so many things currently. There is no perfect beta cluster only perfect beta clusters tailored for their use-cases. Back in 2015 the idea of “Beta Cluster as a Service” (BCaaS [bɪˈkɒz]) had some minor traction, but for all the reasons mentioned in T215217#4965494 it didn't happen.

Maintenance

Production is maintained by a group of 23 people (SRE) dedicated to keeping that environment running, up-to-date, and safe. Release-Engineering-Team used to pretend that we could keep pace with production as a group of 7 people who are also responsible for CI, deployment, code review, and development environments, but it's proven not to work in practice. The environment is also different enough from production that the folks familiar with production are also not able to productively maintain it.

A fantastic example of the types of maintenance problems we have was me breaking beta a few hours ago T267439: MediaWiki beta varnish is down. -- an upstream puppet patch broke puppet in beta and when I fixed puppet it caused problems with packages I've never heard of. There is a lot of specialized knowledge needed to keep production running and it just gets more specialized all the time.

Currently there is a project to move existing services (as well as MediaWiki) through the deployment-pipeline and into kubernetes in production. This is making beta cluster even less production-like: there is no k8s in beta and no team has a plan to build or maintain one.

My stance on beta cluster has been, Release-Engineering-Team cares if beta-cluster is broken and we'll try to wrangle the appropriate people to help. This is very different from the kind of active maintenance that beta needs to fight entropy.

Sunsetting Beta

Another finding from the 2018 survey was that 80% of respondents said that they "agree" or "mostly agree" with the statement, "I depend on Beta Cluster for some of my regular testing needs". This past week the beta cluster found 3 release train blockers that never hit production. Beta is important and has no replacement currently. Many of its instances are pets not cattle.

Beta is also definitely an ongoing pain point for both Release-Engineering-Team and cloud-services-team

Sunsetting beta requires a plan to replace the use-cases of beta with something more maintainable. We're in the midst of a large transition in production, containerizing our services. There is a staging cluster for services that will likely supplant some portion of beta's use-cases (a "production-like" environment). The remaining use-cases will likely fall into the realm of local development and (possibly) something that utilizes existing containers to allow developers to share changes with one another -- something akin to the existing patchdemo project. This was a major recommendation that was made as part of the exploration of existing local development tooling. As we begin to supplant the use-cases of beta cluster in the future we can form a more fully realized plan about shutting it down.

As someone who considers beta essential to my role, I'll add a data point with my use case.

I have root on the webperf hosts, but those are configured via puppet and I don't have +2 rights in operations/puppet. But I do have root in beta, so I'm able to cherry-pick patches there for testing. (Even with our puppet linter and compiler infrastructure, it's extremely difficult to craft working patches without some way to test them, which requires having a puppetmaster and hosts with the affected roles.)

A specific example: upgrading the performance team services to use Python 3 (T267269) requires a series of inter-dependent patches to update both our code and some system library dependencies. The puppet changes took several patchsets to get right, e.g. figuring out why services weren't being restarted. It would have been extremely painful to iterate on this in production.

Some pain points I've experienced:

Often, the first step in testing a puppet patch is to get beta back to a working state, pre-patch. For example: T244776#6364483 (Swift in beta had been mostly broken for some time).

Sometimes, differences between production and beta create problems unique to beta. For example: T248041 (puppetmaster OOMs).

Long-lived divergences between beta and production can be a problem, e.g. merge conflicts. For example: T244624. It'd be nice to have a clear policy about when it's OK to un-cherry-pick someone else's patch. (My stance on this re: my patches is in T245402#6517866 - please un-cherry-pick at will).

For the most part, I budget for the above when scoping testing of patches. Certainly not having a testing environment—or having a less permissive test environment without root access—would be way worse than the unrelated issues I've had to fix along the way.

There's a tragedy of the commons, but there are also economies from having a shared environment. I'm not sure it would be reasonable to expect someone to spin up e.g. their own Swift stack whenever they wanted to test a related change. Given our current dependence on puppet in production, I'm not sure spinning up a usable local testing environment for most services is even possible.

bd808 added a subtask: T257118: Beta cluster has reached its quota.Nov 16 2020, 7:17 PM

Ladsgroup subscribed.Nov 23 2020, 1:41 AM

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

In T215217#6651450, @jcrespo wrote:

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

For brief outages, I'd think #wikimedia-releng (and the related SAL) is probably sufficient - that's where I look when something isn't working to see if someone else is already fixing it.

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

In T215217#6656571, @dpifke wrote:

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org)

After some discussion, the Release Engineering and Quality and Test Engineering teams have decided to make QTE the "Product Owners" of BetaCluster. This decision comes as part of a larger testing infrastructure effort. The details of what this means and how we will proceed will come out over the course of the coming weeks. In the meantime, this task will be marked as Resolved as the primary objective of this task was to address the lack of "Code Stewardship" or more aptly "Product Ownership".

Jrbranaa moved this task from Prioritized to Done on the Code-Stewardship-Reviews board.Dec 3 2020, 1:14 AM

Apologies for posting on this closed task, but is there any news on the above, some sort of eta on an announcement, details, etc?

Legoktm mentioned this in T276650: Re-consider setting up a Kubernetes cluster on the Beta cluster.Mar 6 2021, 7:41 AM

In T215217#6665452, @Jrbranaa wrote:

The details of what this means and how we will proceed will come out over the course of the coming weeks.

Was this ever done?

Boldly re-opening this task, given the details mentioned in T215217#6665452 have not been published (it's been several months now, outside the "weeks" range) and the primary problem of beta cluster being unmaintained and broken is still an issue.

I just linked this task to someone today after explaining that, afaik the code ownership of beta is stalled and I don't know why. So an open status makes the most sense indeed.

brennen edited projects, added Release-Engineering-Team; removed Release-Engineering-Team (Code Health), Release-Engineering-Team-TODO.Aug 11 2021, 9:03 PM

brennen moved this task from INBOX to Radar on the Release-Engineering-Team board.

brennen edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team.

brennen moved this task from Limbo to Watching/External on the Release-Engineering-Team (Radar) board.

brennen subscribed.

@Majavah - yes, this work has stalled due to a shift in my priorities over the last few months. However, it's back on the "front burner". I think it makes sense for this task to remain open until a plan has been pulled together and published.

Frostly subscribed.Aug 16 2021, 5:48 PM

taavi moved this task from Done to Backlog on the Code-Stewardship-Reviews board.Nov 7 2021, 3:21 PM

bd808 mentioned this in T295477: Request creation of deployment-prep-k8s VPS project.Nov 10 2021, 3:31 PM

Ladsgroup mentioned this in T296059: Deploy the Commons recon service on a test instance of Commons.Nov 19 2021, 9:41 AM

MatthewVernon added a subtask: T298253: Upgrade deployment-prep Swift cluster to Debian Buster or newer.Jan 10 2022, 11:43 AM

Kormat subscribed.Jan 10 2022, 1:24 PM

taavi mentioned this in T302699: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022).Mar 2 2022, 11:36 AM

Zabe subscribed.Mar 8 2022, 12:02 AM

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022. These include:

The entire media storage (Swift) cluster
The entire ElasticSearch cluster
kafka-main and kafka-jumbo clusters, responsible for the MW job queue, purging cached pages, and other tasks, plus Zookeeper responsible for providing authentication to all Kafka clusters
Multiple miscellaneous support services

Is anyone going to work on those?

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

Michael subscribed.Mar 24 2022, 2:41 PM

In T215217#7796938, @Majavah wrote:

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022.

Thanks for raising this, @Majavah and for all the work you've done on beta — it's in a better place than you've found it.

As I mentioned in T215217#6610236, Release-Engineering-Team cares if Beta is down; however, we're not resourced to rebuild all of beta (which is what needs to happen now).

My current plan is to draft something for the tech decision forum so we can figure it out together.

In T215217#7801154, @nskaggs wrote:

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

If this is an acceptable solution to buy time, I'm in favor of doing this.

In the time that this would buy, we can figure out how to sustain beta (I hope).

• vyuen subscribed.Mar 24 2022, 4:33 PM

Lucas_Werkmeister_WMDE mentioned this in T304328: Move Termbox SSR for Beta Wikidata into deployment-prep project.Mar 24 2022, 6:16 PM

• vyuen mentioned this in T304843: [Spike] Investigate deployment strategy for v0 [high priority].Mar 28 2022, 1:59 PM

Nikerabbit subscribed.Mar 30 2022, 10:15 AM

Pinging because one month has passed since the last comment on this.

Jdforrester-WMF closed subtask T257118: Beta cluster has reached its quota as Resolved.Apr 26 2022, 1:49 PM

For everyone's info, currently no Code-Stewardship-Reviews are taking place as there is no clear path forward and as this is not prioritized work.
(Entirely personal opinion: I also assume lack of decision authority due to WMF not having a CTO currently. However, discussing this is off-topic for this task.)

thcipriani mentioned this in T308283: Beta Cluster Tech Decision Forum.May 13 2022, 4:45 PM

Aklapper added a subtask: T308283: Beta Cluster Tech Decision Forum.May 13 2022, 7:37 PM

TheresNoTime subscribed.May 28 2022, 8:34 PM

Ottomata closed subtask T304433: Upgrade event platform related VMs in deployment-prep to Debian bullsye (or buster) as Resolved.Jun 15 2022, 1:28 PM

Beta was down for 8 hours yesterday due to the mess it is - see https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_Beta_Cluster_502

Ladsgroup mentioned this in T306049: Upgrade deployment-docker-citoid01 host to Buster.Aug 24 2022, 9:07 AM

RoySmith subscribed.Sep 1 2022, 7:50 PM

Matthewrbowker subscribed.Sep 1 2022, 8:20 PM

I would like to point out that especially on dewiki, beta is used actively downstream for development of templates, modules, javascript etc., with permissions elevated in comparison to production. It would be a pity to lose these capabilities.

Stang subscribed.Nov 20 2022, 1:06 AM

bking subscribed.Nov 21 2022, 9:44 PM

Ronysorkar subscribed.Dec 7 2022, 6:40 PM

fnegri subscribed.Apr 26 2023, 3:00 PM

Addshore unsubscribed.Jun 27 2023, 12:40 PM

Aklapper mentioned this in T344834: Captchas are broken in the beta cluster.Aug 24 2023, 6:46 AM

bd808 updated the task description. (Show Details)Nov 30 2023, 9:01 PM

bd808 added a subtask: T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster.Nov 30 2023, 9:16 PM

Jdforrester-WMF updated the task description. (Show Details)Nov 30 2023, 9:55 PM

kostajh closed subtask T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster as Resolved.Dec 1 2023, 7:12 AM

daniel reopened subtask T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster as Open.Dec 4 2023, 11:48 AM

daniel closed subtask T351930: HTTP 504 connection timeout error accessing MW API on Beta cluster as Resolved.Dec 4 2023, 10:30 PM

kostajh renamed this task from deployment-prep: Code stewardship request to deployment-prep (beta cluster): Code stewardship request.Feb 27 2024, 11:29 AM

kostajh subscribed.

ArielGlenn mentioned this in T358329: beta-update-databases-eqiad job times out / beta databases are having issues.Feb 27 2024, 12:35 PM

Aklapper mentioned this in T359768: upload.wikimedia.beta.wmflabs.org: cannot find server.Mar 28 2024, 10:45 AM

Beta cluster is actively used to test the Commons app. Must be annoying for testers that upload.wikimedia.beta.wmflabs.org has been down for weeks. I frequently use it to test gadgets. Found a train blocker or two while doing so. But I don't see a solution either. ;-(

@Shashankiitbhu @Sebastian_Berlin-WMSE: here's the reason why upload.wikimedia.beta.wmflabs.org isn't working.

deployment-prep (beta cluster): Code stewardship requestOpen, MediumPublicActions