Page MenuHomePhabricator

Use Beta cluster as a true canary for code deployments (epic)
Open, MediumPublic

Description

Beta Cluster is awesome. It is catching a lot of breakages that would otherwise hit users. We're grateful of it.

What are the specific limitations with Beta Cluster that is preventing us from whole-heartedly trusting a breakage on Beta Cluster as a blocker for wider deployment? Either mark those are blockers of this bug or report them and mark them as blockers :-)

Details

Reference
bz51494

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:06 AM
bzimport set Reference to bz51494.
bzimport added a subscriber: Unknown Object (MLST).

I understand it generally runs master rather than the current deployment branch, so it's not really useful for testing changes which happen outside of the normal CI cycle.

It uses Varnish for text instead of Squid, exposing known bugs that do not occur in production.

It apparently uses a different set of extensions to production.

It uses a different deployment system to production, which makes it difficult to reproduce bugs related to non-atomic code tree update.

(In reply to comment #1)

I understand it generally runs master rather than the current deployment
branch, so it's not really useful for testing changes which happen outside of
the normal CI cycle.

Right (if I'm understanding you correctly) this (beta cluster) won't catch things that aren't first merged to master for some amount of time before being on the production cluster.

It uses Varnish for text instead of Squid, exposing known bugs that do not
occur in production.

Unfortunate, but hopefully the switch of production text to varnish will happen soon enough. Do you think it is worth it to switch Beta Cluster (back?) to Squid for the time being? I guess that depends on how long the Varnish text transition will end up taking...

It apparently uses a different set of extensions to production.

Some of this is by design (eg: Flow), but I'm curious now which extensions differ and why...

It uses a different deployment system to production, which makes it difficult
to reproduce bugs related to non-atomic code tree update.

Right, maybe the wording of "true canary for code deployments" wasn't the best. Maybe, "true canary for production"? Deploying will be different on Beta Cluster until/when/if production moves to a Continuous Deployment system, no way to get around that. Luckily, experience with Beta Cluster should help inform that transition.

(In reply to comment #1)

I understand it generally runs master rather than the current deployment
branch, so it's not really useful for testing changes which happen outside of
the normal CI cycle.

When we created beta the aim was to catch bugs before they land in wmf branches. We used test.wikipedia.org to test out wmf branch before syncing. Maybe we could set up some more wiki that would use the wmf branches as well.

It uses Varnish for text instead of Squid, exposing known bugs that do not
occur in production.

That followed a discussion I had with Mark over IRC. Since text varnish was (and is) going to land in production it seemed like a good idea to play test on beta. We did discover a few bugs and I think it helped move varnish text forward.

I would prefer we do not revert back to squid, its configuration is not handled via puppet and I dont think it is worth the effort.

It apparently uses a different set of extensions to production.

There might be some differences. IIRC CheckUser has been explicitly disabled. But if an extension is missing we should add it in and configure it for beta.

It uses a different deployment system to production, which makes it difficult
to reproduce bugs related to non-atomic code tree update.

We use a shared NFS export (/data/project) which is where deployment-bastion (aka tin) and the apaches/jobrunner are reading files from. So we just git pull and have instant deploy, just like we used to do a while ago with Zwinger.

Back in January 2013, we had git-deploy on beta to stage it before deploying in production. The project is apparently stalled and had some issues with labs so we reverted back to the NFS share. With Sartoris apparently getting some attention, the people working on it could well migrate beta to Sartoris.

Additionally, the reason we are not using scap is that it depends on debian package and a myriad of puppet changes. I don't have merge right on operations/puppet.git and eventually got fed up trying to get change merged in, so I just abandoned the idea of using scap.

IMHO issues such as T90983 are somehow inherent in the essence of Beta and it's not the only "hopeless" blocker of this bug. Perhaps it would be easier to scope down Beta, give up on the "clone production": focus on few things and do these well with less effort.

In particular, I think Beta is great to find PHP errors and let users do usability testing/exploratory testing. For that sort of things, even a single-instance MediaWiki would suffice: what really matters for devs and users is

  • automatic deployment of recent changes (done),
  • shared maintenance for the sysop part (mostly existing now),
  • prefilled content to have something to test on (mostly missing now; Beta is empty).
Jdforrester-WMF renamed this task from Use Beta cluster as a true canary for code deployments (tracking) to Use Beta cluster as a true canary for code deployments (epic).Apr 30 2018, 7:57 PM