Page MenuHomePhabricator

Build Prometheus service for use by all Cloud VPS projects and their instances
Open, HighPublic

Description

In T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken we worked on a proof of concept project to design a replacement for monitoring a subset of projects using Shinken with Prometheus. Later we realized that T210993: Deprecate Diamond collectors in Cloud VPS means that we also need to migrate to Prometheus for basic dashboards like T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.

We now need to redesign and build out the POC to scale to collecting at least basic instance health information for all instances in all projects for some reasonable amount of time (3+ months for sure, 1+ year ideally).

Features needed beyond POC:

Related Objects

StatusSubtypeAssignedTask
Resolved fgiunchedi
Resolvedcolewhite
StalledNone
OpenNone
OpenNone
OpenNone
Resolved JHedden
Resolved JHedden
ResolvedBstorm
Resolvedbd808
ResolvedAndrew
DeclinedNone
OpenNone
Resolvednskaggs
OpenMajavah
OpenNone
ResolvedMajavah
OpenMajavah
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenMajavah
OpenNone
OpenNone
ResolvedMajavah
ResolvedMajavah
OpenNone
OpenNone
OpenAndrew
OpenNone

Event Timeline

The POC project was a bit ahead of work by the Observability folks on similar alerting for the production realm. This build out should include examining the profiles and modules that have been built for prod now to see how much of the POC can be replaced with shared setup.

bd808 triaged this task as High priority.Oct 20 2020, 5:23 PM

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T10:35:46Z] <majavah> undeploy old ingress T266050

Wrong task, was meant to be T264221.