Complete monitoring setup of datahubsearch nodes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• razzi
	Mar 1 2022, 5:14 PM

Description

Noticed an alert like the following on alerts.wikimedia.org on each of the 3 datahubsearch nodes:

Rate of JVM GC Old generation-s runs - datahubsearch1002-datahub

Details

	Subject	Repo	Branch	Lines +/-
	Fix the prometheus elasticsearch exporter on bullseye	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	BTullis	T299910 Data Catalog MVP
Resolved	• razzi	T301382 Set up opensearch cluster for datahub
Resolved	BTullis	T302818 Complete monitoring setup of datahubsearch nodes

Event Timeline

• razzi created this task.Mar 1 2022, 5:14 PM

This is due to elastisearch puppet code: icinga::monitor::elasticsearch::old_jvm_gc_checks

The prometheus masters are not configured to poll data from the datahub instances, so the monitor doesn't find any metric and it reports NaN.

Interesting. So the Datahub role includes profile::opensearch::server, but the other clusters don't do it, they use profile::opensearch::logstash that includes it. The logstash profile also defines the profile::prometheus::elasticsearch_exporter instances, exposing metrics, that are not running on datahub instances atm (the Prometheus masters are configured to poll any host including profile::prometheus::elasticsearch_exporter afaics).

The code may be refactored to move the profile::prometheus::elasticsearch_exporter definition to profile::opensearch::server, not sure if they need to stay in the logstash profile.

Right. Thanks @elukey. I think we may also to need to include profile::opensearch::monitoring::base_checks which sets up the rest of the Icinga monitoring.

I wonder if we should move the prometheus exporter part to a new class e.g. profile::opensearch::monitoring::prometheus which can then be included from both the logstash and datahub implementations.

I would move this section:

file { '/usr/share/opensearch/plugins':
    ensure => 'directory',
    force  => true,
    owner  => 'root',
    group  => 'root',
    mode   => '0755',
} -> Class['opensearch']

...from profile::opensearch::logstash to the end of profile::opensearch::server.

What do you think @razzi ?

I'd be tempted just to rename this ticket to be something like: "Complete monitoring setup of datahubsearch nodes" since there are quite a few changes to make before we're finished.

BTullis added projects: Data Pipelines, Data-Engineering.Mar 2 2022, 10:02 AM

BTullis mentioned this in T301382: Set up opensearch cluster for datahub.Mar 4 2022, 4:30 PM

I would move this section:
file { '/usr/share/opensearch/plugins':
    ensure => 'directory',
    force  => true,
    owner  => 'root',
    group  => 'root',
    mode   => '0755',
} -> Class['opensearch']
...from profile::opensearch::logstash to the end of profile::opensearch::server.

I've done this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/768702

Awaiting a review from @colewhite although I think it's probably OK. There is one small question outstanding on the CR.

BTullis renamed this task from datahubsearch nodes alerting with "Rate of JVM GC Old generation-s runs" to Complete monitoring setup of datahubsearch nodes.Mar 7 2022, 2:58 PM

BTullis triaged this task as Medium priority.

• EChetty moved this task from Backlog to Next Up on the Data-Catalog board.Mar 8 2022, 6:10 PM

FYI, prometheus-elasticsearch-exporter-9200.service is failing at the moment due to a missconfiguration:

✔️ root@datahubsearch1001:~$ journalctl -u prometheus-elasticsearch-exporter-9200.service | tail
Mar 11 08:03:50 datahubsearch1001 systemd[1]: Stopped Prometheus exporter for Elasticsearch.
Mar 11 08:03:50 datahubsearch1001 systemd[1]: Started Prometheus exporter for Elasticsearch.
Mar 11 08:03:50 datahubsearch1001 prometheus-elasticsearch-exporter[396273]: prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help
Mar 11 08:03:50 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Main process exited, code=exited, status=1/FAILURE
Mar 11 08:03:50 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Failed with result 'exit-code'.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Scheduled restart job, restart counter is at 5.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: Stopped Prometheus exporter for Elasticsearch.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Start request repeated too quickly.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Failed with result 'exit-code'.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: Failed to start Prometheus exporter for Elasticsearch.

Causing the service to try restarting 5 times on every puppet run (every 30 minutes).

Thanks @jcrespo - The cause of it seems to be a difference in the version of the prometheus-elasticsearch-exporter between buster and bullseye.

The datahubsearch100* nodes are running bullseys and they show this when trying to run the exporter.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter -es.uri=http://localhost:9200 -web.listen-address=:9108
prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help

This version wants to use long-form options.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter --help 2>&1|grep uri
      --es.uri="http://localhost:9200"

Whereas the version that is installed on other hosts, e.g. logstash1024 wants to use the short-form options:

btullis@logstash1024:~$ /usr/bin/prometheus-elasticsearch-exporter --help 2>&1|grep uri
  -es.uri string

The version on buster is 1.0.4+ds-1 and the version for bullseye is: 1.1.0+ds-2+b5

btullis@logstash1024:~$ apt-cache policy prometheus-elasticsearch-exporter
prometheus-elasticsearch-exporter:
  Installed: 1.0.4+ds-1
  Candidate: 1.0.4+ds-1
  Version table:
 *** 1.0.4+ds-1 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

btullis@datahubsearch1001:~$ apt-cache policy prometheus-elasticsearch-exporter
prometheus-elasticsearch-exporter:
  Installed: 1.1.0+ds-2+b5
  Candidate: 1.1.0+ds-2+b5
  Version table:
 *** 1.1.0+ds-2+b5 500
        500 http://mirrors.wikimedia.org/debian bullseye/main amd64 Packages
        100 /var/lib/dpkg/status

I think that the best option is probably to look at backporting 1.1.0+ds-2+b5 to buster and updating https://github.com/wikimedia/puppet/blob/production/modules/prometheus/templates/initscripts/prometheus-elasticsearch-exporter.systemd.erb#L8

BTullis added a subtask: T303599: Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia.Mar 11 2022, 12:08 PM

Change 770005 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

gerritbot added a project: Patch-For-Review.Mar 11 2022, 4:58 PM

BTullis mentioned this in T303599: Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia.Mar 11 2022, 5:19 PM

Change 770005 merged by Btullis:

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

Maintenance_bot removed a project: Patch-For-Review.Mar 14 2022, 10:11 AM

BTullis removed a subtask: T303599: Backport prometheus-elasticsearch-exporter version 1.1.0 to buster-wikimedia.Mar 14 2022, 10:11 AM