Change Details

Create a new OpenStack project and Prometheus server to scrape metrics from the existing node-exporter running on virtual machines in tools and cloud-infra projects. Once these projects are configured WMCS will evaluate adding more CloudVPS projects to this configuration. Initial steps to deploy the new monitoring stack: [x] Create OpenStack project "metricsinfra" with wmcs-team as admins (T250210) [x] Create a new virtual machine "prometheus01.metricsinfra.eqiad.wikimedia.cloud" (T250206#6056467) [x] Configure Prometheus to discover scrape targets using the openstack SD configuration (https://gerrit.wikimedia.org/r/#/c/588803/) [x] Update existing tools and cloudinfra security groups to allow prometheus to connect to the node-exporter running on TCP port 9100 [x] Configure a proxy to allow Grafana access to the Prometheus API - https://prometheus.wmflabs.org/cloud [x] Add Metricsinfra Prometheus datasource to Grafana-labs [x] Configure the Prometheus alert manager to monitor puppet status https://gerrit.wikimedia.org/r/c/operations/puppet/+/589716 [x] Configure the alert managerrules to monitor host up/down state https://gerrit.wikimedia.org/r/c/operations/puppet/+/589716 [ ] Configure the alert managerrules to monitor disk capacity [ x] Configure the alert manager to notify wmcs-team email and IRC #https://gerrit.wikimedia-cloud-feed.org/r/c/operations/puppet/+/591202 [ ] Setup an IRC bot to use as an alert manager webhook, sending notifications to #wikimedia-cloud-feed Once we have an idea on data retention and usage: [ ] Update openstack service discovery to either monitor all projects or a specific list of projects - update existing security groups as appropriate [ ] Configure new project template with updated security group rules (if we decide to do all projects above)