Page MenuHomePhabricator

turn up 'aux' k8s cluster for o11y and other "ancillary"/"supportive" services
Closed, ResolvedPublic

Description

The defined scope of the 'wikikube' cluster explicitly excludes monitoring tools like Grafana, Kibana, etc.

This means that we need somewhere else to run the centralized parts of Jaeger (writing into storage, and the query + UI elements for retrieving trace data).

We discussed and rejected the idea of running these in a custom Docker / docker-compose setup on either a VM or a baremetal host. We decided it would be about as much work -- and much more reusable -- to simply turn up a small, new k8s cluster, which we decided to term 'aux'. (Another option that was considered was "ancillary", which was thought more expressive but much longer to type.)

The scope for this cluster, at least to start with, would be confined to just observability tools and other SRE-supported critical infrastructure services (for example, Netbox would be considered in-scope). We can broaden this later, but the intent is to avoid the cluster becoming a 'junk drawer'.

With aux as the cluster name prefix, we also decided upon aux-k8s-etcd, aux-k8s-ctrl, aux-k8s-worker as the machine name prefixes, similar to what Data Engineering did with dse-k8s-.

The initial plan is to run all of this on Ganeti, in just one of the core clusters, and to start with just a couple worker nodes.

So, on eqiad Ganeti, we need to turn up:

  • 3x aux-k8s-etcd nodes, 1G RAM, 1vcpu each
  • 2x aux-k8s-ctrl nodes, 4G RAM, 1vcpu each
  • 2x aux-k8s-worker nodes, 16G RAM, 8vcpu each

Event Timeline

CDanis added a subscriber: jhathaway.

@jhathaway have you had the opportunity to work with our Ganeti installation yet? if not please take a look at the instructions and start turning up some nodes :) You can file the provisioning tickets as sub-tasks of this one

sounds good, I'll grab this one

My thanks for spawning this new cluster that is clearly needed.

I 've gone ahead and add https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#aux to make sure we document the tight scope and stated goal and avoid the 'junk drawer' issue.

In case you haven't found them already, docs for spinning up a new cluster are at: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New

Change 850586 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: ctrl & wrkr roles

https://gerrit.wikimedia.org/r/850586

Change 850586 merged by JHathaway:

[operations/puppet@production] aux-k8s: ctrl & wrkr roles

https://gerrit.wikimedia.org/r/850586

Change 850604 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: disable lvs

https://gerrit.wikimedia.org/r/850604

Change 850604 merged by JHathaway:

[operations/puppet@production] aux-k8s: disable lvs

https://gerrit.wikimedia.org/r/850604

Change 853004 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/deployment-charts@master] aux-k8s: initial values

https://gerrit.wikimedia.org/r/853004

Change 853004 merged by JHathaway:

[operations/deployment-charts@master] aux-k8s: initial values

https://gerrit.wikimedia.org/r/853004

Change 853009 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/deployment-charts@master] aux-k8s: env config

https://gerrit.wikimedia.org/r/853009

Change 853009 merged by JHathaway:

[operations/deployment-charts@master] aux-k8s: env config

https://gerrit.wikimedia.org/r/853009

I've just happened upon this cluster too and I think that it's an excellent idea. Thanks @CDanis and @jhathaway for taking the trouble.

If I can be of any help, please do feel free to include me in any reviews or discussions about it, especially given that this is another cluster that's based on the wikikube/ml-serve/dse-k8s model.

Hopefully, we will be able to make effective use of the fact that both dse-k8s and aux-k8s will be starting out with very limited workloads and might therefore be useful for testing upgrade paths etc. that would be common to all of these clusters.

Change 854110 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/homer/public@master] aux-k8s: add BGP config for calico

https://gerrit.wikimedia.org/r/854110

Change 855039 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: use default cni-config

https://gerrit.wikimedia.org/r/855039

Change 855039 merged by JHathaway:

[operations/puppet@production] aux-k8s: use default cni-config

https://gerrit.wikimedia.org/r/855039

Change 855092 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/deployment-charts@master] aux-k8s: remove istio mesh values

https://gerrit.wikimedia.org/r/855092

Change 855092 merged by JHathaway:

[operations/deployment-charts@master] aux-k8s: remove istio mesh values

https://gerrit.wikimedia.org/r/855092

Change 854110 merged by JHathaway:

[operations/homer/public@master] aux-k8s: add BGP config for calico

https://gerrit.wikimedia.org/r/854110

Change 856694 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: fix bgp_peers

https://gerrit.wikimedia.org/r/856694

Change 856694 merged by JHathaway:

[operations/puppet@production] aux-k8s: fix bgp_peers

https://gerrit.wikimedia.org/r/856694

Change 857009 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/deployment-charts@master] aux-k8s: remove CoreDNS affinity rules

https://gerrit.wikimedia.org/r/857009

Change 857009 merged by JHathaway:

[operations/deployment-charts@master] aux-k8s: remove CoreDNS affinity rules

https://gerrit.wikimedia.org/r/857009

Change 857035 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/deployment-charts@master] aux-k8s: fix affinity for coredns

https://gerrit.wikimedia.org/r/857035

Change 857035 merged by JHathaway:

[operations/deployment-charts@master] aux-k8s: fix affinity for coredns

https://gerrit.wikimedia.org/r/857035

Change 857043 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: add pki intermediate for cfssl

https://gerrit.wikimedia.org/r/857043

Change 857043 merged by JHathaway:

[operations/puppet@production] aux-k8s: add pki intermediate for cfssl

https://gerrit.wikimedia.org/r/857043

Change 857045 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: add deployment service

https://gerrit.wikimedia.org/r/857045

Change 857045 merged by JHathaway:

[operations/puppet@production] aux-k8s: add deployment service

https://gerrit.wikimedia.org/r/857045

Change 857668 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: allow kubepods to talk to pki

https://gerrit.wikimedia.org/r/857668

Change 857668 merged by JHathaway:

[operations/puppet@production] aux-k8s: allow kubepods to talk to pki

https://gerrit.wikimedia.org/r/857668

Change 857786 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/deployment-charts@master] aux-k8s: fix pod ips for network policies

https://gerrit.wikimedia.org/r/857786

Change 857786 merged by JHathaway:

[operations/deployment-charts@master] aux-k8s: fix pod ips for network policies

https://gerrit.wikimedia.org/r/857786

Change 858395 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] aux-k8s: monitor eqiad BGP sessions

https://gerrit.wikimedia.org/r/858395

Change 858395 merged by JHathaway:

[operations/puppet@production] aux-k8s: monitor eqiad BGP sessions

https://gerrit.wikimedia.org/r/858395

Cluster is up and operational, all known bugs or missconfigurations have been resolved, though I suspect there are some unknown ones! Feel free to try to deploy workloads to the cluster.

Change 978129 had a related patch set uploaded (by CDanis; author: Chris Danis):

[operations/deployment-charts@master] [aux-k8s-eqiad] add kube-state-metrics

https://gerrit.wikimedia.org/r/978129

Change 978129 merged by jenkins-bot:

[operations/deployment-charts@master] [aux-k8s-eqiad] add kube-state-metrics

https://gerrit.wikimedia.org/r/978129