Page MenuHomePhabricator

Build Prometheus service for use by all Cloud VPS projects and their instances
Closed, ResolvedPublic

Description

In T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken we worked on a proof of concept project to design a replacement for monitoring a subset of projects using Shinken with Prometheus. Later we realized that T210993: Deprecate Diamond collectors in Cloud VPS means that we also need to migrate to Prometheus for basic dashboards like T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.

We now need to redesign and build out the POC to scale to collecting at least basic instance health information for all instances in all projects for some reasonable amount of time (3+ months for sure, 1+ year ideally).

Features needed beyond POC:

Related Objects

StatusSubtypeAssignedTask
Resolvedfgiunchedi
Resolvedcolewhite
ResolvedMoritzMuehlenhoff
OpenNone
Resolvedtaavi
Opendcaro
Resolvedtaavi
Resolvedtaavi
OpenNone
Resolvedtaavi
Resolved JHedden
Resolved JHedden
Resolved Bstorm
Resolvedbd808
ResolvedAndrew
DeclinedNone
OpenNone
Resolved nskaggs
OpenNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvedtaavi
OpenNone
OpenNone
Resolvedtaavi
Resolvedjbond
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
OpenNone
Opentaavi
OpenNone
Resolvedtaavi
Resolveddcaro
ResolvedAndrew

Event Timeline

The POC project was a bit ahead of work by the Observability folks on similar alerting for the production realm. This build out should include examining the profiles and modules that have been built for prod now to see how much of the POC can be replaced with shared setup.

bd808 triaged this task as High priority.Oct 20 2020, 5:23 PM

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T10:35:46Z] <majavah> undeploy old ingress T266050

Wrong task, was meant to be T264221.

taavi updated the task description. (Show Details)

While there are still open feature requests as subtasks, I'm closing this task as the metricsinfra Prometheus service is now monitoring all instances and can replace Diamond.