Page MenuHomePhabricator

Experiment with continuous deployment using Blubberoid
Open, Stalled, MediumPublic

Description

The Release Engineering team wants to experiment with continuous deployment of services, automatically from Gerrit (after a +2 code review) to a container in Kubernetes. We've chosen the Blubberoid as the service to experiment with. This task is the umbrella task for that. Parts of the work will be added here as sub-tasks.

The acceptance criteria for this is: after a RelEng team member with +2 rights in Gerrit votes +2 on a change for Blubberoid, it gets built, tested, and deployed to Kubernetes fully automatically, within five minutes.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think we should start by looking at the steps from a developer pushing a new change to Gerrit. This should trigger Zuul to run a Jenkins job that builds Blubberoid, runs its unit tests, and builds, publishes, and tags the Docker image(s). Then things stall until a reviewer votes +2 for code review. That will trigger Zuul to run another Jenkins job, which deploys the docker image previously tagged to Kubernetes, and changes that K8s or load balancer configuration to use containers using that new image instead of the previously used image.

We'll want to have a manually triggered rollback that somehow takes the new image from use, and makes the previous image to be used instead.

I'm unsure as to the details, but does this approach seem sensible overall? Which parts do we have already that we can reuse?

Reedy renamed this task from Experiment with continuous deployment using Blubberoid to Experiment with continuous deployment using Blubberoid.Jan 18 2019, 7:33 PM

This is a first rough draft of an outline of a continuous delivery (CDep) pipeline for the Blubberoid service, to be a starting point for discussions. The goal is to make everything from a +2 code review vote until the change runs in production to be fully automated. Note that this is for Blubberoid ONLY, and for deploying it to Kubernetes ONLY.

  • PUSH: Developer pushes patch (or set of patches) to gerrit as refs/for/master. Gerrit creates an entry in its database to track the change. Potential reviewers are added and notified.

    This should work already and require no changes.
  • BUILD: Zuul notices a new patch set, and triggers a job to build the patch set and run any unit tests and other tests that can be run from the build tree. (FIXME: does this need to be specified in more detail? should a Docker image be built as well at this stage?). The job runs on a Jenkins worker. If the job fails, Jenkins notifies Gerrit, which records a -1 verification vote on the change (in this case the pipeline fails and stops, but can be restarted if the developer pushes a new patch set).

    Except for details of the jobs that gets triggered, this should already work.
  • REVISE: Time passes. Reviewers may request changes. Developer may push new versions of the patch set, which start the pipeline from the beginning.

    As far as I understand, this already works.
  • APPROVE: A suitable authorized reviewer votes +2 for code review. Zuul notices the +2 vote and triggers a job that merges the change to master, rebuilds, retests, builds Docker images, and publishes the images in the image store, tagged suitably.

    As I understand, this should work, except maybe for the details of the job.
  • DEPLOY: Zuul also triggers a job (or the same job continues with this) to deploy the built docker image to production Kubernetes and to migrate traffic to containers running the new image.

    This is new. We need a Jenkins worker (or a whole new Jenkins instance?) which is enabled to build to production. What permissions do we need for that?

(I've labelled the steps so we can refer to them easily.)

Unless there is some part I am missing, I don't think we'll need to be changing any k8s configuration regarding load balancers, etc.

If we use helm to upgrade the release as it appears we do during deployments currently, there won't be a previous deployment that we'll need to manage. Then, in the event of a rollback, we can use helm to roll back to the previous revision. That involves knowing which revision we would like to roll back to, which we can use helm history to find out.

My understanding is that this is missing the token for Jenkins to be able to push a deployment to K8s.

greg triaged this task as Medium priority.May 9 2019, 11:49 PM

Change 634354 had a related patch set uploaded (by Jeena Huneidi; owner: Jeena Huneidi):
[operations/deployment-charts@master] Helmfile for continuous deployment

https://gerrit.wikimedia.org/r/634354

Change 658750 had a related patch set uploaded (by Jeena Huneidi; owner: Jeena Huneidi):
[operations/puppet@production] [WIP] Apply global helmfile after pull

https://gerrit.wikimedia.org/r/658750

Change 658750 abandoned by Jeena Huneidi:
[operations/puppet@production] [WIP] Apply global helmfile after pull

Reason:
After discussion in the pipeline repo, it's indeed not secure enough to trust a 2 from gerrit. It would also be best to have an automatic rollback strategy before implementing continuous delivery.

https://gerrit.wikimedia.org/r/658750

thcipriani changed the task status from Open to Stalled.Mar 30 2021, 8:24 PM
thcipriani subscribed.

discussed with serviceops -- pausing this work on rollback strategy and gitlab work