Page MenuHomePhabricator

Replacement needed for obsolete Diamond/Graphite monitoring of integration instances
Closed, ResolvedPublic

Description

The WMCS instances for the integration project do not have any metrics collected. As a result we no more having any monitoring available for the CI nodes (cpu, disk io, memory consumption).

Browsing Graphana Labs shows only 5 instances:

integration-agent-pkgbuilder-1001Buster
integration-agent-pkgbuilder-1002Buster
integration-castor03Stretch
integration-cuminBuster
integration-puppetmaster-02Buster

Essentially the metrics have vanished from the newish (T252071) Bullseye instances. The reason is https://gerrit.wikimedia.org/r/c/operations/puppet/+/691958/ :

Diamond isn't packaged for Bullseye and is deprecated in prod
so we're not likely to ever get it.

We need to restore some monitoring for the integration project.

Event Timeline

taavi subscribed.

The planned Diamond replacement is Prometheus, which is already tracked as T266050. While work on that hasn't been progressing a lot lately, integration is one of the few scraped projects of the current setup, which means that it should be available on the "metricsinfra prometheus" datasource available on grafana-cloud.wikimedia.org (aka grafana-labs).

I have been using the dashboard Cloud VPS Project Board but it retrieves metrics from Graphite that is how I found out the integration instances have vanished.

I have confirmed one of the instance has the Prometheus exporter running and found several other dashboard that seems to use Prometheus for data but they don't seem to be functional for the integration project :-\ For example https://grafana-cloud.wikimedia.org/d/000000590/instance-details has a bunch of fetch failures and if I manually fill in a different project name nothing happens.

I am guessing the cloud-services-team is managing those boards.

bd808 renamed this task from integration instances have lost Diamond/Graphite monitoring to Replacement needed for obsolete Diamond/Graphite monitoring of integration instances.May 11 2022, 9:13 PM

That is affecting any Debian Bullseye instances regardless of the WMCS project. We have the issue with gitlab-runners instance which have disk filing up but no monitoring is available since they are Bullseye based.

I have poked Cloud-Services IRC channel:

14:13:24 <hashar> hello, I am wondering how to get metrics collection enabled again for Bullseye + instances
14:13:27 <hashar>    For CI and Gitlab we have a fleet of instances which some time have CPU/Disk/IO/memory issues, they got collected with `diamond` and shown at  https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board  
14:13:58 <hashar> but `diamond` has been removed from Debian Bullseye and as such instances based on that version no more have any monitoring, which is a bit of a pain :D
14:14:26 <hashar> so my question is: could we package diamond for Bullseye and enable it again, or what is the alternative and why isn't it enabled by default for Bullseye instances
14:15:02 <hashar> my specific task is https://phabricator.wikimedia.org/T307655 "Replacement needed for obsolete Diamond/Graphite monitoring of integration instances" , but that affects any Bullseye instance
hashar changed the task status from Open to Stalled.Jun 16 2022, 2:35 PM

@Majavah indicated integration already has a Prometheus scraper https://prometheus.wmcloud.org/cloud/targets and he has added gitlab-runners as well.

For the UI

https://grafana-cloud.wikimedia.org/d/000000590/instance-details?orgId=1 should have everything, but I suspect a recent grafana upgrade has broken compability with the older prometheus version we have, I filed T310799 to fix it

You may be able to use https://grafana-cloud.wikimedia.org/d/8Npp-46Zz/project-overview?orgId=1&var-project=integration something in the meantime or https://grafana-cloud.wikimedia.org/d/8Npp-46Zz/project-overview?orgId=1&var-project=gitlab-runners

I am marking the task Stalled pending for verification after T310799 has been completed.

hashar claimed this task.

Essentially that is solved. T310799 / T307465 will address the lack of project/host list on https://grafana-cloud.wikimedia.org/d/000000590/instance-details?orgId=1

Change 806233 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/docroot@master] integration: link to Prometheus dashboard for agents

https://gerrit.wikimedia.org/r/806233

Change 806233 merged by jenkins-bot:

[integration/docroot@master] integration: link to Prometheus dashboard for agents

https://gerrit.wikimedia.org/r/806233