Page MenuHomePhabricator

Build Prometheus service for use by all Cloud VPS projects and their instances
Closed, ResolvedPublic

Description

In T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken we worked on a proof of concept project to design a replacement for monitoring a subset of projects using Shinken with Prometheus. Later we realized that T210993: Deprecate Diamond collectors in Cloud VPS means that we also need to migrate to Prometheus for basic dashboards like T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.

We now need to redesign and build out the POC to scale to collecting at least basic instance health information for all instances in all projects for some reasonable amount of time (3+ months for sure, 1+ year ideally).

Features needed beyond POC:

Related Objects

StatusSubtypeAssignedTask
Resolvedfgiunchedi
Resolvedcolewhite
ResolvedMoritzMuehlenhoff
Resolvedtaavi
Resolvedtaavi
Opendcaro
Resolvedtaavi
Resolvedtaavi
OpenNone
Resolvedtaavi
Resolved JHedden
Resolved JHedden
Resolved Bstorm
Resolvedbd808
ResolvedAndrew
DeclinedNone
Resolved nskaggs
Resolvedtaavi
Resolvedjbond
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
Resolveddcaro
ResolvedAndrew

Event Timeline

The POC project was a bit ahead of work by the Observability folks on similar alerting for the production realm. This build out should include examining the profiles and modules that have been built for prod now to see how much of the POC can be replaced with shared setup.

bd808 triaged this task as High priority.Oct 20 2020, 5:23 PM

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T10:35:46Z] <majavah> undeploy old ingress T266050

Wrong task, was meant to be T264221.

taavi updated the task description. (Show Details)

While there are still open feature requests as subtasks, I'm closing this task as the metricsinfra Prometheus service is now monitoring all instances and can replace Diamond.

aborrero added a project: Epic.
aborrero subscribed.

I'm boldly reopen to keep this "epic" task as the entry point for all the other subtasks.

No, please create new tasks to track future work. This task was for the initial buildout that is now complete.