Page MenuHomePhabricator

Complete monitoring setup of datahubsearch nodes
Closed, ResolvedPublic

Description

Noticed an alert like the following on alerts.wikimedia.org on each of the 3 datahubsearch nodes:

Rate of JVM GC Old generation-s runs - datahubsearch1002-datahub

image.png (978×1 px, 217 KB)

Event Timeline

This is due to elastisearch puppet code: icinga::monitor::elasticsearch::old_jvm_gc_checks

The prometheus masters are not configured to poll data from the datahub instances, so the monitor doesn't find any metric and it reports NaN.

Interesting. So the Datahub role includes profile::opensearch::server, but the other clusters don't do it, they use profile::opensearch::logstash that includes it. The logstash profile also defines the profile::prometheus::elasticsearch_exporter instances, exposing metrics, that are not running on datahub instances atm (the Prometheus masters are configured to poll any host including profile::prometheus::elasticsearch_exporter afaics).

The code may be refactored to move the profile::prometheus::elasticsearch_exporter definition to profile::opensearch::server, not sure if they need to stay in the logstash profile.

Right. Thanks @elukey. I think we may also to need to include profile::opensearch::monitoring::base_checks which sets up the rest of the Icinga monitoring.

I wonder if we should move the prometheus exporter part to a new class e.g. profile::opensearch::monitoring::prometheus which can then be included from both the logstash and datahub implementations.

I would move this section:

file { '/usr/share/opensearch/plugins':
    ensure => 'directory',
    force  => true,
    owner  => 'root',
    group  => 'root',
    mode   => '0755',
} -> Class['opensearch']

...from profile::opensearch::logstash to the end of profile::opensearch::server.

What do you think @razzi ?

I'd be tempted just to rename this ticket to be something like: "Complete monitoring setup of datahubsearch nodes" since there are quite a few changes to make before we're finished.

I would move this section:

file { '/usr/share/opensearch/plugins':
    ensure => 'directory',
    force  => true,
    owner  => 'root',
    group  => 'root',
    mode   => '0755',
} -> Class['opensearch']

...from profile::opensearch::logstash to the end of profile::opensearch::server.

I've done this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/768702

Awaiting a review from @colewhite although I think it's probably OK. There is one small question outstanding on the CR.

BTullis renamed this task from datahubsearch nodes alerting with "Rate of JVM GC Old generation-s runs" to Complete monitoring setup of datahubsearch nodes.Mar 7 2022, 2:58 PM
BTullis triaged this task as Medium priority.

FYI, prometheus-elasticsearch-exporter-9200.service is failing at the moment due to a missconfiguration:

✔️ root@datahubsearch1001:~$ journalctl -u prometheus-elasticsearch-exporter-9200.service | tail
Mar 11 08:03:50 datahubsearch1001 systemd[1]: Stopped Prometheus exporter for Elasticsearch.
Mar 11 08:03:50 datahubsearch1001 systemd[1]: Started Prometheus exporter for Elasticsearch.
Mar 11 08:03:50 datahubsearch1001 prometheus-elasticsearch-exporter[396273]: prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help
Mar 11 08:03:50 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Main process exited, code=exited, status=1/FAILURE
Mar 11 08:03:50 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Failed with result 'exit-code'.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Scheduled restart job, restart counter is at 5.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: Stopped Prometheus exporter for Elasticsearch.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Start request repeated too quickly.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Failed with result 'exit-code'.
Mar 11 08:03:51 datahubsearch1001 systemd[1]: Failed to start Prometheus exporter for Elasticsearch.

Causing the service to try restarting 5 times on every puppet run (every 30 minutes).

Thanks @jcrespo - The cause of it seems to be a difference in the version of the prometheus-elasticsearch-exporter between buster and bullseye.

The datahubsearch100* nodes are running bullseys and they show this when trying to run the exporter.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter -es.uri=http://localhost:9200 -web.listen-address=:9108
prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help

This version wants to use long-form options.

btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter --help 2>&1|grep uri
      --es.uri="http://localhost:9200"

Whereas the version that is installed on other hosts, e.g. logstash1024 wants to use the short-form options:

btullis@logstash1024:~$ /usr/bin/prometheus-elasticsearch-exporter --help 2>&1|grep uri
  -es.uri string

The version on buster is 1.0.4+ds-1 and the version for bullseye is: 1.1.0+ds-2+b5

btullis@logstash1024:~$ apt-cache policy prometheus-elasticsearch-exporter
prometheus-elasticsearch-exporter:
  Installed: 1.0.4+ds-1
  Candidate: 1.0.4+ds-1
  Version table:
 *** 1.0.4+ds-1 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status
btullis@datahubsearch1001:~$ apt-cache policy prometheus-elasticsearch-exporter
prometheus-elasticsearch-exporter:
  Installed: 1.1.0+ds-2+b5
  Candidate: 1.1.0+ds-2+b5
  Version table:
 *** 1.1.0+ds-2+b5 500
        500 http://mirrors.wikimedia.org/debian bullseye/main amd64 Packages
        100 /var/lib/dpkg/status

I think that the best option is probably to look at backporting 1.1.0+ds-2+b5 to buster and updating https://github.com/wikimedia/puppet/blob/production/modules/prometheus/templates/initscripts/prometheus-elasticsearch-exporter.systemd.erb#L8

Change 770005 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

Change 770005 merged by Btullis:

[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye

https://gerrit.wikimedia.org/r/770005

BTullis removed a project: Data Pipelines.

All checks are green, now that the prometheus exporter has been fixed. Marking this ticket as done.