We can't rely on metricsinfra to alert us if metricsinfra itself breaks. When it's mature enough to be usable by the general Cloud VPS community, we'll want to add external monitoring to it so we're aware of any issues inside it.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • aborrero | T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet | |||
| Resolved | • aborrero | T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer | |||
| Resolved | taavi | T341060 openstack eqiad1: introduce cloud-private and cloudlb | |||
| Open | tappof | T395441 Port all Icinga checks to Prometheus/Alertmanager: preparation | |||
| Open | tappof | T321808 Port all Icinga checks to Prometheus/Alertmanager | |||
| Open | Goal | None | T328502 Move WMCS off of Icinga and introduce alertmanager | ||
| Open | None | T345983 Remove Icinga checks for Cloud VPS projects (not: infrastructure) | |||
| Open | None | T313030 [toolforge.infra] Replace Toolschecker alerts with Prometheus based ones | |||
| Open | None | T347148 Determine how to monitor services in cloud-private / cloudlb | |||
| Open | None | T288053 Add external meta-monitoring for metricsinfra |
Event Timeline
I am going to be working on this. The general plan is that there'll be a VM in metricsinfra that hosts a toolschecker-style web service to do a bunch of checks, and the cloud Prometheus instance can monitor that (via the outbound web proxy).
Change 966804 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] P:wmcs::metricsinfra: add meta monitoring app skeleton
Change 966805 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] P:wmcs::metriscinfra: haproxy: add route for meta monitor service
Change 982788 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/alerts@master] team-wmcs: metricsinfra: page when alertmanager is unreachable
Change 982788 merged by jenkins-bot:
[operations/alerts@master] team-wmcs: metricsinfra: page when alertmanager is unreachable