Use Beta cluster as a true canary for code deployments (epic)
Open, MediumPublic
Actions

Assigned To

None

Authored By

	greg
	Jul 16 2013, 11:48 PM

Description

Beta Cluster is awesome. It is catching a lot of breakages that would otherwise hit users. We're grateful of it.

What are the specific limitations with Beta Cluster that is preventing us from whole-heartedly trusting a breakage on Beta Cluster as a blocker for wider deployment? Either mark those are blockers of this bug or report them and mark them as blockers :-)

Details

Reference: bz51494

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Open	None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Resolved	None	T52622 Special:NewPagesFeed intermittently fails on beta cluster; causes test failure
Resolved	matthiasmullie	T52623 Entering AFTv5 feedback causes error
Resolved	• AlexMonk-WMF	T64835 Setup a Swift cluster on beta-cluster to match production
Resolved	Reedy	T64836 Make use of twemproxy
Stalled	None	T53497 Setup monitoring for Beta Cluster (tracking)
Resolved	yuvipanda	T54357 Set up graphite monitoring for the beta cluster
Resolved	None	T62058 implement master-slave DB for beta labs
Resolved	hashar	T65538 enable SSL/https support again
Resolved	Krenair	T50501 beta: Get SSL certificates for *.{projects}.beta.wmflabs.org
Resolved	bd808	T65746 Use scap to deploy on apaches
Open	None	T87220 Minimize infrastructure differences between Beta Cluster and production
Resolved	Krinkle	T90983 ResourceLoader debug urls should bypass cache when they change
Duplicate	None	T130045 use a deployment branch for beta
Declined	None	T130047 integrate browsertests with beta deployment
		· · ·

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:06 AM

• bzimport added projects: Beta-Cluster-Infrastructure, Tracking-Neverending.

• bzimport set Reference to bz51494.

• bzimport added a subscriber: Unknown Object (MLST).

greg created this task.Jul 16 2013, 11:48 PM

I understand it generally runs master rather than the current deployment branch, so it's not really useful for testing changes which happen outside of the normal CI cycle.

It uses Varnish for text instead of Squid, exposing known bugs that do not occur in production.

It apparently uses a different set of extensions to production.

It uses a different deployment system to production, which makes it difficult to reproduce bugs related to non-atomic code tree update.

(In reply to comment #1)

I understand it generally runs master rather than the current deployment
branch, so it's not really useful for testing changes which happen outside of
the normal CI cycle.

Right (if I'm understanding you correctly) this (beta cluster) won't catch things that aren't first merged to master for some amount of time before being on the production cluster.

It uses Varnish for text instead of Squid, exposing known bugs that do not
occur in production.

Unfortunate, but hopefully the switch of production text to varnish will happen soon enough. Do you think it is worth it to switch Beta Cluster (back?) to Squid for the time being? I guess that depends on how long the Varnish text transition will end up taking...

It apparently uses a different set of extensions to production.

Some of this is by design (eg: Flow), but I'm curious now which extensions differ and why...

It uses a different deployment system to production, which makes it difficult
to reproduce bugs related to non-atomic code tree update.

Right, maybe the wording of "true canary for code deployments" wasn't the best. Maybe, "true canary for production"? Deploying will be different on Beta Cluster until/when/if production moves to a Continuous Deployment system, no way to get around that. Luckily, experience with Beta Cluster should help inform that transition.

(In reply to comment #1)

I understand it generally runs master rather than the current deployment
branch, so it's not really useful for testing changes which happen outside of
the normal CI cycle.

When we created beta the aim was to catch bugs before they land in wmf branches. We used test.wikipedia.org to test out wmf branch before syncing. Maybe we could set up some more wiki that would use the wmf branches as well.

It uses Varnish for text instead of Squid, exposing known bugs that do not
occur in production.

That followed a discussion I had with Mark over IRC. Since text varnish was (and is) going to land in production it seemed like a good idea to play test on beta. We did discover a few bugs and I think it helped move varnish text forward.

I would prefer we do not revert back to squid, its configuration is not handled via puppet and I dont think it is worth the effort.

It apparently uses a different set of extensions to production.

There might be some differences. IIRC CheckUser has been explicitly disabled. But if an extension is missing we should add it in and configure it for beta.

It uses a different deployment system to production, which makes it difficult
to reproduce bugs related to non-atomic code tree update.

We use a shared NFS export (/data/project) which is where deployment-bastion (aka tin) and the apaches/jobrunner are reading files from. So we just git pull and have instant deploy, just like we used to do a while ago with Zwinger.

Back in January 2013, we had git-deploy on beta to stage it before deploying in production. The project is apparently stalled and had some issues with labs so we reverted back to the NFS share. With Sartoris apparently getting some attention, the people working on it could well migrate beta to Sartoris.

Additionally, the reason we are not using scap is that it depends on debian package and a myriad of puppet changes. I don't have merge right on operations/puppet.git and eventually got fed up trying to get change merged in, so I just abandoned the idea of using scap.

yuvipanda closed subtask T54357: Set up graphite monitoring for the beta cluster as Resolved.Nov 25 2014, 12:11 PM

hashar added a project: Release-Engineering-Team.Nov 25 2014, 1:59 PM

hashar set Security to None.

greg moved this task from To Triage to Backlog on the Beta-Cluster-Infrastructure board.Nov 25 2014, 8:52 PM

zeljkofilipin unsubscribed.Nov 26 2014, 10:00 AM

greg updated the task description. (Show Details)Feb 5 2015, 6:12 PM

Nemo_bis added a subtask: T87220: Minimize infrastructure differences between Beta Cluster and production.Jul 4 2015, 9:28 AM

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptJul 4 2015, 9:28 AM

Nemo_bis added a subtask: T90983: ResourceLoader debug urls should bypass cache when they change.Jul 4 2015, 9:30 AM

IMHO issues such as T90983 are somehow inherent in the essence of Beta and it's not the only "hopeless" blocker of this bug. Perhaps it would be easier to scope down Beta, give up on the "clone production": focus on few things and do these well with less effort.

In particular, I think Beta is great to find PHP errors and let users do usability testing/exploratory testing. For that sort of things, even a single-instance MediaWiki would suffice: what really matters for devs and users is

automatic deployment of recent changes (done),
shared maintenance for the sysop part (mostly existing now),
prefilled content to have something to test on (mostly missing now; Beta is empty).

hashar changed the status of subtask T64835: Setup a Swift cluster on beta-cluster to match production from Open to Stalled.Oct 8 2015, 10:27 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 8 2015, 10:27 AM

hashar changed the status of subtask T53497: Setup monitoring for Beta Cluster (tracking) from Open to Stalled.Oct 30 2015, 10:51 PM

hashar changed the status of subtask T64835: Setup a Swift cluster on beta-cluster to match production from Stalled to Open.Jan 20 2016, 2:14 PM

Krinkle closed subtask T90983: ResourceLoader debug urls should bypass cache when they change as Resolved.Feb 21 2016, 2:33 AM

JanZerebecki added a subtask: T130045: use a deployment branch for beta.Mar 15 2016, 6:27 PM

JanZerebecki added a subtask: T130047: integrate browsertests with beta deployment.Mar 15 2016, 6:43 PM

Krinkle unsubscribed.Mar 15 2016, 6:50 PM

Luke081515 moved this task from INBOX to Backlog (ARCHIVED) on the Release-Engineering-Team board.Mar 22 2016, 6:27 PM

greg moved this task from Backlog (ARCHIVED) to Epics (ARCHIVED) on the Release-Engineering-Team board.May 31 2016, 3:35 PM

• AlexMonk-WMF closed subtask T64835: Setup a Swift cluster on beta-cluster to match production as Resolved.Jul 13 2016, 8:37 PM

greg removed a parent task: T4007: [DO NOT USE] Tracking bug [superseded by #Tracking].Jul 13 2016, 9:19 PM

Nemo_bis edited projects, added Epic; removed Tracking-Neverending.Jul 24 2016, 8:22 PM

greg moved this task from Backlog to Epics / Tracking on the Beta-Cluster-Infrastructure board.Aug 5 2016, 8:48 PM

hashar closed subtask T50501: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org as Resolved.Aug 31 2016, 3:15 PM

Jdforrester-WMF renamed this task from Use Beta cluster as a true canary for code deployments (tracking) to Use Beta cluster as a true canary for code deployments (epic).Apr 30 2018, 7:57 PM

hashar unsubscribed.Jul 31 2018, 10:37 PM

DannyS712 subscribed.Feb 27 2019, 2:40 AM

Jdforrester-WMF subscribed.Apr 9 2019, 5:38 PM

hashar closed subtask T130047: integrate browsertests with beta deployment as Declined.Apr 17 2019, 1:15 PM

AfroThundr3007730 subscribed.Jun 5 2019, 2:54 AM

• Phabricator_maintenance edited projects, added Release-Engineering-Team-TODO; removed Release-Engineering-Team.Jun 12 2019, 11:40 PM

• Phabricator_maintenance moved this task from Should be empty (use Release-Engineering-Team) to Epics on the Release-Engineering-Team-TODO board.Jun 12 2019, 11:41 PM

greg added a project: Release-Engineering-Team.Jun 21 2019, 10:35 PM

Aklapper changed the status of subtask T53497: Setup monitoring for Beta Cluster (tracking) from Stalled to Open.May 19 2020, 4:00 PM

greg changed the status of subtask T53497: Setup monitoring for Beta Cluster (tracking) from Open to Stalled.May 19 2020, 4:13 PM

Krinkle removed a subtask: T59583: Make deployment prep have continuous replication lag.Apr 16 2021, 7:16 PM

thcipriani edited projects, added Release-Engineering-Team (thcipriani-workboard-fiddling); removed Release-Engineering-Team, Release-Engineering-Team-TODO.Apr 20 2021, 3:44 AM