Page MenuHomePhabricator

Add external meta-monitoring for metricsinfra
Open, MediumPublic

Description

We can't rely on metricsinfra to alert us if metricsinfra itself breaks. When it's mature enough to be usable by the general Cloud VPS community, we'll want to add external monitoring to it so we're aware of any issues inside it.

Event Timeline

taavi triaged this task as Lowest priority.
taavi added a project: cloud-services-team.

I am going to be working on this. The general plan is that there'll be a VM in metricsinfra that hosts a toolschecker-style web service to do a bunch of checks, and the cloud Prometheus instance can monitor that (via the outbound web proxy).

Change 966804 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::metricsinfra: add meta monitoring app skeleton

https://gerrit.wikimedia.org/r/966804

Change 966805 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::metriscinfra: haproxy: add route for meta monitor service

https://gerrit.wikimedia.org/r/966805

Change 982788 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/alerts@master] team-wmcs: metricsinfra: page when alertmanager is unreachable

https://gerrit.wikimedia.org/r/982788

Change 982788 merged by jenkins-bot:

[operations/alerts@master] team-wmcs: metricsinfra: page when alertmanager is unreachable

https://gerrit.wikimedia.org/r/982788

taavi removed taavi as the assignee of this task.Jun 25 2024, 3:35 PM