Page MenuHomePhabricator

deployment-prep (beta cluster): Code stewardship request
Closed, ResolvedPublic

Assigned To
Authored By
bd808
Feb 4 2019, 10:44 PM
Referenced Files
None
Tokens
"Love" token, awarded by Michael."Love" token, awarded by MGChecker."Love" token, awarded by Nemoralis."Pterodactyl" token, awarded by Mholloway."Cup of Joe" token, awarded by Addshore."Evil Spooky Haunted Tree" token, awarded by Krenair.

Description

Intro

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by the Platform team together with Ops (now SRE) as a final-stage test environment for new features. In the years since it has become a resource that is used by technical volunteers, the Wikimedia CI pipeline {{cn}}, Foundation staff, and manual testers. It is not however proactively maintained by any Foundation staff in their staff capacity.

This is a "weird" stewardship request because this project is not technically part of the Wikimedia production environment. It is also not exactly a single software system. Instead it is a shared environment used by multiple stakeholders to validate code and configuration changes in a non-production environment. A decision to sunset the beta cluster would be highly disruptive if it did not come along with a proposal to build a replacement environment of some type. This environment however has spent years in a state of uncertain maintainership and the code stewardship process seems like the most mature process we have to discuss the merits of the project and how it might be better supported/resourced going forward.

Issues

  • Unclear ownership of the Cloud VPS project instances (meaning that there are a large number of project admins, but little to no documentation about which people are taking care of which instances)
  • Production Puppet code is used, but often needs customization to work within the various constraints of the Cloud VPS project environment which requires special +2 rights. No holder of such rights is currently active in the project.
  • Not all Wikimedia production software changes are deployed in this environment {{cn}}
  • Puppet failures triggered by upstream configuration changes can remain for days or weeks before being addressed potentially blocking further testing of code and configuration changes

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As someone who considers beta essential to my role, I'll add a data point with my use case.

I have root on the webperf hosts, but those are configured via puppet and I don't have +2 rights in operations/puppet. But I do have root in beta, so I'm able to cherry-pick patches there for testing. (Even with our puppet linter and compiler infrastructure, it's extremely difficult to craft working patches without some way to test them, which requires having a puppetmaster and hosts with the affected roles.)

A specific example: upgrading the performance team services to use Python 3 (T267269) requires a series of inter-dependent patches to update both our code and some system library dependencies. The puppet changes took several patchsets to get right, e.g. figuring out why services weren't being restarted. It would have been extremely painful to iterate on this in production.

Some pain points I've experienced:

  1. Often, the first step in testing a puppet patch is to get beta back to a working state, pre-patch. For example: T244776#6364483 (Swift in beta had been mostly broken for some time).
  1. Sometimes, differences between production and beta create problems unique to beta. For example: T248041 (puppetmaster OOMs).
  1. Long-lived divergences between beta and production can be a problem, e.g. merge conflicts. For example: T244624. It'd be nice to have a clear policy about when it's OK to un-cherry-pick someone else's patch. (My stance on this re: my patches is in T245402#6517866 - please un-cherry-pick at will).

For the most part, I budget for the above when scoping testing of patches. Certainly not having a testing environment—or having a less permissive test environment without root access—would be way worse than the unrelated issues I've had to fix along the way.

There's a tragedy of the commons, but there are also economies from having a shared environment. I'm not sure it would be reasonable to expect someone to spin up e.g. their own Swift stack whenever they wanted to test a related change. Given our current dependence on puppet in production, I'm not sure spinning up a usable local testing environment for most services is even possible.

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

I know this is an ongoing issue, but if I were to do some maintenance on beta (read only or temporary outage, T268628), who should I notify?

For brief outages, I'd think #wikimedia-releng (and the related SAL) is probably sufficient - that's where I look when something isn't working to see if someone else is already fixing it.

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

Cloud folks: it'd be cool if there were automatically-generated email lists based on project membership, e.g. a way to address a reply to all recipients of the "Puppet failure on ..." emails.

T47828: Implement mail aliases for Cloud-VPS projects (<novaproject>@wmcloud.org)

After some discussion, the Release Engineering and Quality and Test Engineering teams have decided to make QTE the "Product Owners" of BetaCluster. This decision comes as part of a larger testing infrastructure effort. The details of what this means and how we will proceed will come out over the course of the coming weeks. In the meantime, this task will be marked as Resolved as the primary objective of this task was to address the lack of "Code Stewardship" or more aptly "Product Ownership".

Apologies for posting on this closed task, but is there any news on the above, some sort of eta on an announcement, details, etc?

The details of what this means and how we will proceed will come out over the course of the coming weeks.

Was this ever done?

taavi removed Jrbranaa as the assignee of this task.
taavi added a subscriber: Jrbranaa.

Boldly re-opening this task, given the details mentioned in T215217#6665452 have not been published (it's been several months now, outside the "weeks" range) and the primary problem of beta cluster being unmaintained and broken is still an issue.

I just linked this task to someone today after explaining that, afaik the code ownership of beta is stalled and I don't know why. So an open status makes the most sense indeed.

@Majavah - yes, this work has stalled due to a shift in my priorities over the last few months. However, it's back on the "front burner". I think it makes sense for this task to remain open until a plan has been pulled together and published.

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022. These include:

  • The entire media storage (Swift) cluster
  • The entire ElasticSearch cluster
  • kafka-main and kafka-jumbo clusters, responsible for the MW job queue, purging cached pages, and other tasks, plus Zookeeper responsible for providing authentication to all Kafka clusters
  • Multiple miscellaneous support services

Is anyone going to work on those?

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

In T215217#7796938, @Majavah wrote:

Just noting here that the deployment-prep Cloud VPS project currently has 30 instances running Debian Stretch, which must be upgraded or deleted by May 2022.

Thanks for raising this, @Majavah and for all the work you've done on beta — it's in a better place than you've found it.


As I mentioned in T215217#6610236, Release-Engineering-Team cares if Beta is down; however, we're not resourced to rebuild all of beta (which is what needs to happen now).

My current plan is to draft something for the tech decision forum so we can figure it out together.

Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.

If this is an acceptable solution to buy time, I'm in favor of doing this.

In the time that this would buy, we can figure out how to sustain beta (I hope).

Pinging because one month has passed since the last comment on this.

For everyone's info, currently no Code-Stewardship-Reviews are taking place as there is no clear path forward and as this is not prioritized work.
(Entirely personal opinion: I also assume lack of decision authority due to WMF not having a CTO currently. However, discussing this is off-topic for this task.)

I would like to point out that especially on dewiki, beta is used actively downstream for development of templates, modules, javascript etc., with permissions elevated in comparison to production. It would be a pity to lose these capabilities.

kostajh renamed this task from deployment-prep: Code stewardship request to deployment-prep (beta cluster): Code stewardship request.Feb 27 2024, 11:29 AM
kostajh subscribed.

Beta cluster is actively used to test the Commons app. Must be annoying for testers that upload.wikimedia.beta.wmflabs.org has been down for weeks. I frequently use it to test gadgets. Found a train blocker or two while doing so. But I don't see a solution either. ;-(

@Shashankiitbhu @Sebastian_Berlin-WMSE: here's the reason why upload.wikimedia.beta.wmflabs.org isn't working.

Sunsetting beta requires a plan to replace the use-cases of beta with something more maintainable. We're in the midst of a large transition in production, containerizing our services. There is a staging cluster for services that will likely supplant some portion of beta's use-cases (a "production-like" environment). The remaining use-cases will likely fall into the realm of local development and (possibly) something that utilizes existing containers to allow developers to share changes with one another -- something akin to the existing patchdemo project. This was a major recommendation that was made as part of the exploration of existing local development tooling. As we begin to supplant the use-cases of beta cluster in the future we can form a more fully realized plan about shutting it down.

This idea from @thcipriani and others is how we are actively dealing with the "problem" of deployment-prep today. Projects like T369112: Group -1 pre-train QTE validation environment and Catalyst are happening as part of the 2024-2025 Wikimedia Foundation Annual Plan to incrementally find better supported homes for various use cases in deployment-prep. Data from past surveys on the general topic led us to pick these projects as our first experiments. Future technical community surveys will be used to check on how we did as well as help us find the next and the next and the next focus area. Eventually we will be able to look back and see that we have collectively moved past the problem of deployment-prep ownership by sun setting the project one use case at a time.

@bd808 to add to your point: We're near another inflection point for deployment-prep: soon the puppet code for configuring mediawiki in deployment-prep is going to be dismissed and not used in the production environment anymore. By the end of the calendar year, we count on having moved all of production (hopefully, all on kubernetes) to using php 8.x.

No one is tasked with maintaining deployment-prep so I doubt anyone will pick up the (quite lengthy) job of updating our puppet code to configure mediawiki to php 8, in absence of a need on the production side of things.

@bd808 to add to your point: We're near another inflection point for deployment-prep: soon the puppet code for configuring mediawiki in deployment-prep is going to be dismissed and not used in the production environment anymore. By the end of the calendar year, we count on having moved all of production (hopefully, all on kubernetes) to using php 8.x.

No one is tasked with maintaining deployment-prep so I doubt anyone will pick up the (quite lengthy) job of updating our puppet code to configure mediawiki to php 8, in absence of a need on the production side of things.

I have been thinking about this general problem as well this quarter. I am interested in putting a Kubernetes cluster into deployment-prep, possibly by using Magnum and OpenTofu to provision the service. In theory we could then get folks to add config for this new environment to their Helm charts for MediaWiki and other things that are hosted via Kubernetes in the production realm.

In theory we could then get folks to add config for this new environment to their Helm charts for MediaWiki and other things that are hosted via Kubernetes in the production realm.

This is a great idea. The 'staging' wikikube cluster would not be dissimilar to this environment, at least from the helm chart configuration perpective. Having a values-beta.yaml helmfile for a service sounds pretty nice and easy enough to maintain.

I have been thinking about this general problem as well this quarter. I am interested in putting a Kubernetes cluster into deployment-prep, possibly by using Magnum and OpenTofu to provision the service. In theory we could then get folks to add config for this new environment to their Helm charts for MediaWiki and other things that are hosted via Kubernetes in the production realm.

I think the problem is larger, especially if you talk about mediawiki. We'll need to build a new image building pipeline (including a docker registry) just for deployment-prep, for instance, and we'd need to possibly adapt the mediawiki chart and make it more complex. It's a significant amount of work and I can't see it as justified for any team; it would surely be much more work than porting the mediawiki puppet code forward - the only reason that could justify it is if the organization decides we want to officially support deployment-prep and not replace it.

We'll need to build a new image building pipeline (including a docker registry)

beta couldn't use the same MW image?

I think the problem is larger, especially if you talk about mediawiki. We'll need to build a new image building pipeline (including a docker registry) just for deployment-prep, for instance, and we'd need to possibly adapt the mediawiki chart and make it more complex. It's a significant amount of work and I can't see it as justified for any team; it would surely be much more work than porting the mediawiki puppet code forward - the only reason that could justify it is if the organization decides we want to officially support deployment-prep and not replace it.

This is a complex topic that we certainly will not hash out on a semi-related ticket in a few comments. I choose not to believe however that we will collectively let deployment-prep parish without any practical substitutes at all. I cannot currently say if that will mean that we find a hero (thanks @Southparkfan for being the latest in this role!), invest more paid human hours in deployment-prep, block various SRE decomm projects until they have suitable replacements, or (likely) use some combination of these and other things to keep entropy from winning. I do however have great faith in the collective intelligence and determination of the paid and volunteer technical contributors to the Wikimedia movement. As a group we can accomplish anything we choose to[*].

[*]: subject to the laws of nature; price and participation may vary; not available in all multiverses; tax, title, and dealer doc fees not included.

We'll need to build a new image building pipeline (including a docker registry)

beta couldn't use the same MW image?

Not as built today because of the inclusion of production secrets including embargoed security patches. We are thinking about how to change this too.

The Developer Experience Group is officially taking stewardship of Beta Cluster and its product offering. As the next step, we will upgrade the beta cluster to PHP 8.1 and communicate how we are further evolving the testing platform offering.

A few examples of the work we are doing were already shared by @bd808 earlier in July, though at this point, we have more information and believe that Beta Cluster won't necessarily be sunsetted, but its use cases trimmed down:

This idea from @thcipriani and others is how we are actively dealing with the "problem" of deployment-prep today. Projects like T369112: Group -1 pre-train QTE validation environment and Catalyst are happening as part of the 2024-2025 Wikimedia Foundation Annual Plan to incrementally find better supported homes for various use cases in deployment-prep. Data from past surveys on the general topic led us to pick these projects as our first experiments. Future technical community surveys will be used to check on how we did as well as help us find the next and the next and the next focus area. Eventually we will be able to look back and see that we have collectively moved past the problem of deployment-prep ownership by sun setting the project one use case at a time.

Lferreira claimed this task.