Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	BTullis
	Mar 11 2022, 12:08 PM

Description

We have a situation where prometheus-elasticsearch-exporter running on bullseye hosts is incompatible with the systemd unit file configuration that we ship with puppet.

The result is that the prometheus-elasticsearch-exporter service fails to start on these nodes and puppet runs repeatedly fail while trying to start the service.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter -es.uri=http://localhost:9200 -web.listen-address=:9108
prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help

Debian bullseye includes this package at version 1.1.0+ds-2

For buster and stretch we host version 10.0.4+ds-1 ourselves.

btullis@apt1001:~$ sudo -i reprepro ls prometheus-elasticsearch-exporter
prometheus-elasticsearch-exporter | 1.0.4+ds-1 | stretch-wikimedia | amd64, source
prometheus-elasticsearch-exporter | 1.0.2+ds-1 | stretch-wikimedia | amd64
prometheus-elasticsearch-exporter | 1.0.4+ds-1 |  buster-wikimedia | amd64, source

I think that the best way to solve this issue is to backport version 1.1.0+ds-2 (or similar) to buster and deploy the upgraded package to all hosts where it is currently running.

We will need to coordinate this with a change to this file:
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/prometheus/templates/initscripts/prometheus-elasticsearch-exporter.systemd.erb$8

...since the new version requires long-format options.
e.g.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter --help
usage: prometheus-elasticsearch-exporter [<flags>]

Flags:
  -h, --help                 Show context-sensitive help (also try --help-long and --help-man).
      --web.listen-address=":9114"
                             Address to listen on for web interface and telemetry.
      --web.telemetry-path="/metrics"
                             Path under which to expose metrics.
      --es.uri="http://localhost:9200"
                             HTTP API address of an Elasticsearch node.

Details

	Subject	Repo	Branch	Lines +/-
	Fix the prometheus elasticsearch exporter on bullseye	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects

Mentioned Here: T302818: Complete monitoring setup of datahubsearch nodes

Event Timeline

BTullis created this task.Mar 11 2022, 12:08 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 11 2022, 12:08 PM

BTullis added a parent task: T302818: Complete monitoring setup of datahubsearch nodes.Mar 11 2022, 12:08 PM

I'm happy to do this work myself, but given that it touches so many other servers, I'd like to make sure that we get agreement and oversight first.

Other options to address the current issue include:

Downgrading the datahubsearch servers to buster and reimaging
Adding a conditional within the template based on operatingsystemmajversion - which includes extra dashes

I'd prefer to backport then choose either of these, but other people may have stronger feelings.

We will also need to check the Changelog to see if there will be any other unintended consequences of an upgrade to version 1.1.0.

My 2 cents: can't we just add a conditional in puppet so that the configuration is generated correctly on all hosts based on the OS version?

In T303599#7769833, @Volans wrote:

My 2 cents: can't we just add a conditional in puppet so that the configuration is generated correctly on all hosts based on the OS version?

Yes we could, I just thought it would be preferable to converge on one version of the exporter, rather than maintain two different versions.
There are some changes to the metrics mentioned in the changelog, so this might cause an issue if we have different versions in use across a single cluster.

However, perhaps a conditional would fix the short-term issue and we could come back to the backport at another time.

Tagging @EBernhardson and @RKemper for further review. I know Erik added more metrics to prometheus exporter recently and I wanted him to check if this issue might be relevant.

Thanks for the task @BTullis!

In T303599#7769718, @BTullis wrote:

I'm happy to do this work myself, but given that it touches so many other servers, I'd like to make sure that we get agreement and oversight first.

Other options to address the current issue include:

Downgrading the datahubsearch servers to buster and reimaging

Adding a conditional within the template based on operatingsystemmajorversion - which includes extra dashes

I'd prefer to backport then choose either of these, but other people may have stronger feelings.

FWIW I think the second option will be a quicker path to address the issue on datahubsearch, but I do agree with you overall and am happy to help support either approach (or both).

Change 770005 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

gerritbot added a project: Patch-For-Review.Mar 11 2022, 4:58 PM

OK, I've added a conditional to the systemd unit template:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/770005

If we're happy to go ahead with this it will fix the issue with datahubsearch and we can remove the parent ticket (T302818) from this one.
Whether or not we decide to go forward with a backport can then be decided at another time.

colewhite subscribed.Mar 11 2022, 10:22 PM

Change 770005 merged by Btullis:

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

Maintenance_bot removed a project: Patch-For-Review.Mar 14 2022, 10:10 AM

BTullis removed a parent task: T302818: Complete monitoring setup of datahubsearch nodes.Mar 14 2022, 10:11 AM

I have fixed the immediate issue with the datahub servers, so I'll remove the parent task and the Data Catalog tag.
I'll leave the ticket open though, in case anyone else thinks that the backport is worth it. If not, feel free to decline and close.

herron awarded a token.Mar 16 2022, 2:03 PM

From the discussion this morning, we would prefer to upgrade the exporter when upgrading to Bullseye unless there is some other issue that would necessitate a backport.

Thanks for adding puppet support!

Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimediaClosed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia
Closed, DeclinedPublic
Actions