Page MenuHomePhabricator

Cloud services enhancement proposal: Prometheus metrics for Toolforge/Toolsbeta/Paws Kubernetes clusters
Closed, ResolvedPublic

Description

Proposal Title: Prometheus metrics for Toolforge/Toolsbeta/Paws Kubernetes clusters

Brief description: Proposal to add local Prometheus instances in the VPS projects hosting kubeadm based Kubernetes clusters for Kubernetes system metrics, and hooking those up to Alertmanager.

Why:
The first part of this proposal involves splitting PAWS from tools-prometheus plus building tools-prometheus instances in the toolsbeta project. PAWS is included in the tools prometheus mostly for historical reasons, and splitting it off reduces coupling between those projects and simplifies building a copy in toolsbeta. Having prometheus in toolsbeta would let us test changes without worrying about breaking the production environment. Kubernetes metrics consume a fair bit of space, so we currently don't want those in the shared metricsinfra install.

The second part involves hooking those local prometheus instances into the metricsinfra alertmanager instance so we can be notified when a problem was found in the metrics. This is expanding the scope of metricsinfra a bit (bring-your-own-prometheus for alerts), but I think that's fine for this instance -- alertmanager supports multiple prometheus instances just fine, the karma dashboard acls should be able to handle this if we tag those metrics right and this means that we have one dashboard (prometheus-alerts.wmcloud.org) instead of four.

Risks:

  • Incomplete/missing metrics makes service troubleshooting harder
  • Incomplete/missing alerts can lead into (more) user-visible service downtime
  • Prometheus needs kubernetes read-only credentials - risk of cluster compromise?
  • Hooking up to metricsinfra alertmanagers in theory gives admins in those projects cloud-wide powers

Design documentation: none yet

More info:
Open question: where to manage alert rules?

Event Timeline

Change 774381 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] paws: add paws prometheus role/profile

https://gerrit.wikimedia.org/r/774381

Change 774382 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] paws: add haproxy routing for prometheus

https://gerrit.wikimedia.org/r/774382

Change 774381 merged by Vivian Rook:

[operations/puppet@production] paws: add paws prometheus role/profile

https://gerrit.wikimedia.org/r/774381

Change 774382 merged by David Caro:

[operations/puppet@production] paws: add haproxy routing for prometheus

https://gerrit.wikimedia.org/r/774382

We discussed this in the WMCS team meeting today, and pretty much agreed with this idea.

Change 778622 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::paws::prometheus: add kubernetes prometheus jobs

https://gerrit.wikimedia.org/r/778622

Change 778673 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: remove paws jobs

https://gerrit.wikimedia.org/r/778673

Change 778622 merged by Vivian Rook:

[operations/puppet@production] P:wmcs::paws::prometheus: add kubernetes prometheus jobs

https://gerrit.wikimedia.org/r/778622

Change 778673 merged by Vivian Rook:

[operations/puppet@production] P:toolforge::prometheus: remove paws jobs

https://gerrit.wikimedia.org/r/778673

Change 779474 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: simplify prometheus config

https://gerrit.wikimedia.org/r/779474

Change 779474 merged by David Caro:

[operations/puppet@production] P:toolforge::prometheus: simplify prometheus config

https://gerrit.wikimedia.org/r/779474

Change 788303 had a related patch set uploaded (by Majavah; author: Majavah):

[labs/private@master] ssl: Add dummy key for toolsbeta k8s prometheus

https://gerrit.wikimedia.org/r/788303

Change 788305 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: add toolsbeta support

https://gerrit.wikimedia.org/r/788305

Change 788303 merged by Andrew Bogott:

[labs/private@master] ssl: Add dummy key for toolsbeta k8s prometheus

https://gerrit.wikimedia.org/r/788303

Change 788305 merged by David Caro:

[operations/puppet@production] P:toolforge::prometheus: add toolsbeta support

https://gerrit.wikimedia.org/r/788305

Change 795192 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:metricsinfra::alertmanager: proxy access for trusted projects

https://gerrit.wikimedia.org/r/795192

Change 802104 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint

https://gerrit.wikimedia.org/r/802104

Change 795192 merged by David Caro:

[operations/puppet@production] P:metricsinfra::alertmanager: proxy access for trusted projects

https://gerrit.wikimedia.org/r/795192

Change 802104 merged by Filippo Giunchedi:

[operations/puppet@production] P:(toolforge|wmcs::paws)::prometheus: configure alertmanager endpoint

https://gerrit.wikimedia.org/r/802104

Change 890489 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] alerts: Allow customizing the git repository info

https://gerrit.wikimedia.org/r/890489

Change 890490 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: deploy alert rules from GitLab

https://gerrit.wikimedia.org/r/890490

Change 890489 merged by Filippo Giunchedi:

[operations/puppet@production] alerts: Allow customizing the git repository info

https://gerrit.wikimedia.org/r/890489

Change 890490 merged by David Caro:

[operations/puppet@production] P:toolforge::prometheus: deploy alert rules from GitLab

https://gerrit.wikimedia.org/r/890490