Page MenuHomePhabricator

Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia
Closed, DeclinedPublic

Description

We have a situation where prometheus-elasticsearch-exporter running on bullseye hosts is incompatible with the systemd unit file configuration that we ship with puppet.

The result is that the prometheus-elasticsearch-exporter service fails to start on these nodes and puppet runs repeatedly fail while trying to start the service.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter -es.uri=http://localhost:9200 -web.listen-address=:9108
prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help

Debian bullseye includes this package at version 1.1.0+ds-2

For buster and stretch we host version 10.0.4+ds-1 ourselves.

btullis@apt1001:~$ sudo -i reprepro ls prometheus-elasticsearch-exporter
prometheus-elasticsearch-exporter | 1.0.4+ds-1 | stretch-wikimedia | amd64, source
prometheus-elasticsearch-exporter | 1.0.2+ds-1 | stretch-wikimedia | amd64
prometheus-elasticsearch-exporter | 1.0.4+ds-1 |  buster-wikimedia | amd64, source

I think that the best way to solve this issue is to backport version 1.1.0+ds-2 (or similar) to buster and deploy the upgraded package to all hosts where it is currently running.

We will need to coordinate this with a change to this file:
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/prometheus/templates/initscripts/prometheus-elasticsearch-exporter.systemd.erb$8

...since the new version requires long-format options.
e.g.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter --help
usage: prometheus-elasticsearch-exporter [<flags>]

Flags:
  -h, --help                 Show context-sensitive help (also try --help-long and --help-man).
      --web.listen-address=":9114"
                             Address to listen on for web interface and telemetry.
      --web.telemetry-path="/metrics"
                             Path under which to expose metrics.
      --es.uri="http://localhost:9200"
                             HTTP API address of an Elasticsearch node.

Event Timeline

BTullis added a subscriber: herron.

I'm happy to do this work myself, but given that it touches so many other servers, I'd like to make sure that we get agreement and oversight first.

Other options to address the current issue include:

  • Downgrading the datahubsearch servers to buster and reimaging
  • Adding a conditional within the template based on operatingsystemmajversion - which includes extra dashes

I'd prefer to backport then choose either of these, but other people may have stronger feelings.

We will also need to check the Changelog to see if there will be any other unintended consequences of an upgrade to version 1.1.0.

My 2 cents: can't we just add a conditional in puppet so that the configuration is generated correctly on all hosts based on the OS version?

My 2 cents: can't we just add a conditional in puppet so that the configuration is generated correctly on all hosts based on the OS version?

Yes we could, I just thought it would be preferable to converge on one version of the exporter, rather than maintain two different versions.
There are some changes to the metrics mentioned in the changelog, so this might cause an issue if we have different versions in use across a single cluster.

However, perhaps a conditional would fix the short-term issue and we could come back to the backport at another time.

Tagging @EBernhardson and @RKemper for further review. I know Erik added more metrics to prometheus exporter recently and I wanted him to check if this issue might be relevant.

Thanks for the task @BTullis!

I'm happy to do this work myself, but given that it touches so many other servers, I'd like to make sure that we get agreement and oversight first.

Other options to address the current issue include:

  • Downgrading the datahubsearch servers to buster and reimaging
  • Adding a conditional within the template based on operatingsystemmajorversion - which includes extra dashes

I'd prefer to backport then choose either of these, but other people may have stronger feelings.

FWIW I think the second option will be a quicker path to address the issue on datahubsearch, but I do agree with you overall and am happy to help support either approach (or both).

Change 770005 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

OK, I've added a conditional to the systemd unit template:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/770005

If we're happy to go ahead with this it will fix the issue with datahubsearch and we can remove the parent ticket (T302818) from this one.
Whether or not we decide to go forward with a backport can then be decided at another time.

Change 770005 merged by Btullis:

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

I have fixed the immediate issue with the datahub servers, so I'll remove the parent task and the Data Catalog tag.
I'll leave the ticket open though, in case anyone else thinks that the backport is worth it. If not, feel free to decline and close.

colewhite edited projects, added Observability-Logging; removed observability.

From the discussion this morning, we would prefer to upgrade the exporter when upgrading to Bullseye unless there is some other issue that would necessitate a backport.

Thanks for adding puppet support!