Page MenuHomePhabricator

Add the conftool pooled/depooled status and weight into prometheus for each service
Closed, DeclinedPublic

Description

We have conftool-data in puppet, which controls which hosts service the backend for each service.

However, whether a host is pooled, depooled, or inactive is not stored in puppet, but in Etcd
Similarly the weight of a particular host in support of a service is recorded in Etcd and not puppet.

It would be useful to have these service status values from Etcd refelcted in Prometheus, so that we could for example exclude hosts that have been depooled from alertmanager rules.

We have a recommended mechanism for achieving this, which is the mini_textfile_exporter.

By using a confd file template with the service catalog we will be able to write a file to each prometheus server that contains a list of each service, along with its member hosts and their status values. This file will be updated in realtime as a result of confctl commands and will be read by the prometheus mini-textfile-exporter.

Event Timeline

I now have a draft CR for this, thanks to @jbond for his help.

However, whilst working on this, John identified a potential performance imporvement to wmflib::service::fetch function.
This is in https://gerrit.wikimedia.org/r/c/operations/puppet/+/799342 and is awaiting approval. @Joe has been added as a reviewer.

I'll waith and push my latest version once this dependent change has been approved and merged.

I've pushed what I believe will be a working confd template for this, but I'm unsure what to do about the rspec tests that I was using to develop locally.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/776225

There is a slight hiccup with this change, as pointed out by @fgiunchedi on https://gerrit.wikimedia.org/r/c/operations/puppet/+/776225

I can say for sure though that 'confd' package isn't in Bullseye so this change will fail

I'm seeking advice as to the best way to proceed now. Presumably we'll need to have confd on bullseye at some point so I have offered to help with that, but I'll wait to see what others say.

Change 776225 had a related patch set uploaded (by Jbond; author: Btullis):

[operations/puppet@production] Add a host's confctl pooled status and weight per service to prometheus

https://gerrit.wikimedia.org/r/776225

@Joe has suggested a better alert trigger than the one I had previously used, so we no longer need this confctl/prometheus integration. I'll abandon the change and decline this ticket.

Change 776225 abandoned by Btullis:

[operations/puppet@production] Add a host's conftool pooled status and weight per service to prometheus

Reason:

No longer necessary

https://gerrit.wikimedia.org/r/776225