Page MenuHomePhabricator

Exclude Magnum instances from metricsinfra monitoring
Closed, ResolvedPublic

Description

The metricsinfra VM discovery logic needs to be updated, as Magnum breaks the assumption that all VMs run on a puppetized base image with node-exporter running.

Currently we use Prometheus's built-in OpenStack service discovery system, with a job per project. The list of projects is managed by prometheus-manager, which is configured to exclude the trove project which is our existing case of non-Puppetized instances.

To fix the issues, we would need to either

  1. figure out how to configure Prometheus's openstack_sd_config to exclude Magnum-managed instances, or
  2. figure out how to detect them in Python, and move the instance discovery to prometheus-manager (while still keeping the frequent update reate)

Event Timeline

And just to make it clear: Prometheus relabeling can be used to filter specific targets. So if it's possible to detect Magnum managed instances using any of the "meta labels" listed on the openstack_sd_config documentation, it's quite simple to add a filter to ignore those instances. That would be my preferred option.

And just to make it clear: Prometheus relabeling can be used to filter specific targets. So if it's possible to detect Magnum managed instances using any of the "meta labels" listed on the openstack_sd_config documentation, it's quite simple to add a filter to ignore those instances. That would be my preferred option.

Would the other way around not be more appropriate? Where prometheus only searches instances with a particular tag? Allowing for monitoring on instances that we have root on, and ignoring anything else that gets stirred in. Thus only modifying one deployment strategy, rather than magnum, trove, and all future images that don't fit the assumption?

Do you mean an opt-in tag to enable monitoring for an instance, or do you have a way to automatically tag all instances running our images? I think that the fact that metricsinfra automatically monitors all Cloud VPS instances is valuable without any actions needed by the project members and would prefer to preserve that.

There does not seem to be a label to filter on the base image of the instance, I wonder if upstream would take a patch adding one. I think we could add metadata to the Glance images to detect which images are puppetized, and then use that to filter instances.

... or do you have a way to automatically tag all instances running our images?

I don't have a way, but that is what I am suggesting. As I equally don't have a way to have all magnum instances carry a label. I've looked some for it, though came up blank. Additionally it would likely be on the template level, meaning that anyone who made their own template would have to realize that they need to add a particular tag. I suspect most wouldn't notice that, thus increasing the noise in alert manager. The same kind of thing would become a problem if we were to introduce other images, such as allowing people to run their own image, possibly another openstack project (so far we have trove and magnum). This feels like an n+1 problem, where we have to find ways to label new things as they come in. Where if we identify how to label the images that we control and want automatic monitoring on, that should be the entirety of the work.

Some progress here: I sent a PR to Prometheus which makes it possible to filter instances based on the image they're running. It was merged but hasn't yet made it into a release, but I think we should be able to patch the debs we use to deploy Prometheus to include that specific patch.

After that, we need to add a tag to the puppetized glance images and update the metricsinfra config logic to pull the list of images with that tag and filter instances based on that.

Change 936373 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: wmcs-image-create: tag created images

https://gerrit.wikimedia.org/r/936373

Running this shell one-liner on both deployments to add tags:

$ sudo wmcs-openstack image list -f json | jq -r '.[]|select(.Name|contains("debian"))|.ID' | xargs -L1 sudo wmcs-openstack image set --tag wmcs-puppetized

Change 936374 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/metricsinfra/prometheus-manager@master] Import puppetized image IDs from Glance

https://gerrit.wikimedia.org/r/936374

Change 936374 merged by jenkins-bot:

[cloud/metricsinfra/prometheus-manager@master] Import puppetized image IDs from Glance

https://gerrit.wikimedia.org/r/936374

Change 936376 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs: metricsinfra: fix command usage

https://gerrit.wikimedia.org/r/936376

Change 936377 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/metricsinfra/prometheus-configurator@master] prometheus: add image id filter

https://gerrit.wikimedia.org/r/936377

Change 936377 merged by jenkins-bot:

[cloud/metricsinfra/prometheus-configurator@master] prometheus: add image id filter

https://gerrit.wikimedia.org/r/936377

Change 936373 merged by David Caro:

[operations/puppet@production] openstack: wmcs-image-create: tag created images

https://gerrit.wikimedia.org/r/936373

Change 936376 merged by David Caro:

[operations/puppet@production] P:wmcs: metricsinfra: fix command usage

https://gerrit.wikimedia.org/r/936376

taavi claimed this task.