Proposal Title: Prometheus metrics for Toolforge/Toolsbeta/Paws Kubernetes clusters
Brief description: Proposal to add local Prometheus instances in the VPS projects hosting kubeadm based Kubernetes clusters for Kubernetes system metrics, and hooking those up to Alertmanager.
Why:
The first part of this proposal involves splitting PAWS from tools-prometheus plus building tools-prometheus instances in the toolsbeta project. PAWS is included in the tools prometheus mostly for historical reasons, and splitting it off reduces coupling between those projects and simplifies building a copy in toolsbeta. Having prometheus in toolsbeta would let us test changes without worrying about breaking the production environment. Kubernetes metrics consume a fair bit of space, so we currently don't want those in the shared metricsinfra install.
The second part involves hooking those local prometheus instances into the metricsinfra alertmanager instance so we can be notified when a problem was found in the metrics. This is expanding the scope of metricsinfra a bit (bring-your-own-prometheus for alerts), but I think that's fine for this instance -- alertmanager supports multiple prometheus instances just fine, the karma dashboard acls should be able to handle this if we tag those metrics right and this means that we have one dashboard (prometheus-alerts.wmcloud.org) instead of four.
Risks:
- Incomplete/missing metrics makes service troubleshooting harder
- Incomplete/missing alerts can lead into (more) user-visible service downtime
- Prometheus needs kubernetes read-only credentials - risk of cluster compromise?
- Hooking up to metricsinfra alertmanagers in theory gives admins in those projects cloud-wide powers
Design documentation: none yet
More info:
Open question: where to manage alert rules?