Page MenuHomePhabricator

deployment-prep: Code stewardship request
Open, NormalPublic

Description

Intro

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by technical volunteers {{cn}}. In the years since it has become a resource that is used by technical volunteers, the Wikimedia CI pipeline {{cn}}, Foundation staff, and manual testers. It is not however proactively maintained by any Foundation staff in their staff capacity.

This is a "weird" stewardship request because this project is not technically part of the Wikimedia production environment. It is also not exactly a single software system. Instead it is a shared environment used by multiple stakeholders to validate code and configuration changes in a non-production environment. A decision to sunset the beta cluster would be highly disruptive if it did not come along with a proposal to build a replacement environment of some type. This environment however has spent years in a state of uncertain maintainership and the code stewardship process seems like the most mature process we have to discuss the merits of the project and how it might be better supported/resourced going forward.

Issues

  • Unclear ownership of the Cloud VPS project instances (meaning that there are a large number of project admins, but little to no documentation about which people are taking care of which instances)
  • Production Puppet code is used, but often needs customization to work within the various constraints of the Cloud VPS project environment which requires special +2 rights. No holder of such rights is currently active in the project.
  • Not all Wikimedia production software changes are deployed in this environment {{cn}}
  • Puppet failures triggered by upstream configuration changes can remain for days or weeks before being addressed potentially blocking further testing of code and configuration changes

Event Timeline

bd808 created this task.Feb 4 2019, 10:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 4 2019, 10:44 PM
Krenair added a subscriber: Krenair.
bd808 updated the task description. (Show Details)Feb 4 2019, 10:56 PM
greg added a subscriber: greg.Feb 7 2019, 10:34 PM
Bawolff added a subscriber: Bawolff.EditedFeb 19 2019, 5:26 PM

Deployment-prep, also known as the beta cluster, is a Cloud VPS project originally created by technical volunteers {{cn}}

Pretty sure that's not true (In particular, I was under the impression that the idea originated with https://en.labs.wikimedia.org ~2009 with usability initiave & flagged revisions, and then later morphed into its current form under the guidance of the platform team https://www.mediawiki.org/w/index.php?title=QA_and_testing/Labs_plan&oldid=534819 in 2012). However I suppose history doesn't matter, what matters is today.


TBH: I'm kind of surprised to see this task. I don't follow betalabs too much, but I thought it was pretty well accepted that it was a RelEng responsibility generally, and that cloud was only responsible for provided the underlying cloud VPS platform. Is that not the general understanding?

greg added a comment.Feb 19 2019, 5:53 PM

TBH: I'm kind of surprised to see this task. I don't follow betalabs too much, but I thought it was pretty well accepted that it was a RelEng responsibility generally, and that cloud was only responsible for provided the underlying cloud VPS platform. Is that not the general understanding?

FTR: Responsibility of the services/pieces that make up the Beta Cluster[0] lies with the people who maintain them in production; there's no other way it could conceivably work.

The question is basically: how can a team tasked with everything else it is doing (deployments, tooling, CI, etc) keep up with an SRE team of 20 people when maintaining a shadow environment? It can't is the answer :) This is a long understood problem/imbalance by both SRE and RelEng (as in, we all see the problem and have no good answer).

To focus on the future: Things will be changing with what is possible and what is needed as we migrate more and more parts of our infrastructure to the Deployment Pipeline. We (RelEng and SRE) should scope out what that is and how that impacts Beta Cluster in the short, medium, and long term (read: that's the conversation that should happen to move this stewardship review forward).

[0] https://wikitech.wikimedia.org/wiki/Help:Labs_labs_labs#Beta_Cluster

jeena added a subscriber: jeena.Feb 20 2019, 12:46 AM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMar 20 2019, 12:43 PM

The question is basically: how can a team tasked with everything else it is doing (deployments, tooling, CI, etc) keep up with an SRE team of 20 people when maintaining a shadow environment?

Question from someone uninitiated: Following which rule do the "production environment" and the "shadow environment" have to be (that's what it sounds like) maintained by different groups? Following the mentioned purpose, "validate code and configuration changes in a non-production environment", maximum benefit could be expected by identical setups which arguably would be the easiest to achieve by using the same tech stack and know-how (i.e. people). Thanks.

Jrbranaa moved this task from In Review to Prioritized on the Code-Stewardship-Reviews board.