Page MenuHomePhabricator

Experiment with hosted kubernetes solutions for Beta
Open, Stalled, NormalPublic

Description

As we're moving more services through the Deployment Pipeline to production, beta is beginning to suffer.

There are several proposed solutions; Let's see if using an existing hosted k8s solution is viable.

Problems

  1. The Deployment Pipeline is currently unable to perform system tests that incorporate both a change to a service and an existing MediaWiki installation; It is limited to e2e testing only the service itself.
  2. A k8s cluster that integrates with Beta Cluster (as in has secure network ingress/egress between deployed pods/services and existing deployment-prep instances) would allow the Deployment Pipeline to perform this kind of testing. However, at this time neither SRE nor RelEng can commit to maintaining an in-house k8s cluster for this purpose.

Proposal

Experiment with [third party k8s provider] to evaluate its potential as a third-party hosted k8s cluster that can:

  1. Provide a k8s cluster that the Deployment Pipeline can target as part of its graduated deployment/testing strategy.
  2. Securely integrate with Beta Cluster at a network level.
  3. Run e2e helm tests that exercise service changes and existing MediaWiki deployments in Beta Cluster together.

Evaluation

A very basic test for teasing out whether any third party k8s is viable could be:

  1. Can our existing Mathoid helm chart be used to deploy there?
  2. If not, how much refactoring would the chart(s) need? More precisely, can we make them work with both [third party k8s] and WMF k8s without too much divergence?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 8 2019, 4:50 PM
thcipriani assigned this task to dduvall.May 8 2019, 4:51 PM
thcipriani triaged this task as Normal priority.

assigning to @dduvall based on hangout discussion

Krenair added a subscriber: Krenair.May 8 2019, 4:56 PM
dduvall updated the task description. (Show Details)May 8 2019, 5:41 PM
dduvall updated the task description. (Show Details)May 8 2019, 5:48 PM
dduvall updated the task description. (Show Details)May 8 2019, 6:01 PM
dduvall updated the task description. (Show Details)
jeena added a subscriber: jeena.May 8 2019, 6:05 PM
dduvall removed dduvall as the assignee of this task.Jun 7 2019, 11:55 PM

I was able to run Mathoid just fine on GKE using the latest chart. What remains of this experiment, however, is getting Beta Cluster's MediaWiki talking to a service deployed to GKE (or Amazon EKS if that makes sense for us policy/budget wise) and vice versa; Basically the networking part.

A couple of options:

  1. VPN between the deployment-prep labs project and Google GKE using ipsec and Google VPC. This might involve more investment to set up initially, but once (if) it's working Beta instances should be able to freely communicate with anything deployed on GKE. There may be a DNS component as well.
  2. Ingress for both ends and public communicate. This might be easier to set up initially but requires additional ingress configuration for each service. MediaWiki/service communication shouldn't include anything sensitive, but maybe?

Unassigning while on leave. Anyone should feel free to pick this up.

greg added a subscriber: greg.

Moving to our current kanban board so @jeena can chat with @dduvall when he's back next week.

jeena removed jeena as the assignee of this task.Wed, Sep 18, 6:00 PM
jeena moved this task from Doing to INBOX on the Release-Engineering-Team-TODO (201909) board.
Bstorm added a subscriber: Bstorm.Wed, Sep 18, 7:01 PM

If this project ends up integrating with WMCS-managed stuff at all (Beta cluster -- does that mean deployment-prep?), I'd at least be interested in being a fly on the wall. I'm kind of curious what people come up with in general for our use or understanding, but if we are doing any peering or VPN with things in Cloud, I definitely would like to know to see how it impacts things.

bd808 added a subscriber: bd808.Wed, Sep 18, 7:19 PM
thcipriani changed the task status from Open to Stalled.Wed, Sep 18, 8:58 PM

If this project ends up integrating with WMCS-managed stuff at all (Beta cluster -- does that mean deployment-prep?), I'd at least be interested in being a fly on the wall. I'm kind of curious what people come up with in general for our use or understanding, but if we are doing any peering or VPN with things in Cloud, I definitely would like to know to see how it impacts things.

Yep, that was the tentative plan here; however, at the moment the Deployment Pipeline project that was the reason we started down this path wasn't funded as part of the annual planning process.

I'm going to call this task stalled for the moment since we don't have the bandwidth to work on it :(

I would like to say that if we are considering external clouds to integrate into deployment-prep we should ensure we have access to those sorted out for existing deployment-prep members and new ones going forward, before committing to anything. I don't want to end up in a situation where part of deployment-prep is only administer-able from inside the wikimedia.org google domain or something.

Beta cluster -- does that mean deployment-prep?

Yes.