Noticed an alert like the following on alerts.wikimedia.org on each of the 3 datahubsearch nodes:
Rate of JVM GC Old generation-s runs - datahubsearch1002-datahub
Noticed an alert like the following on alerts.wikimedia.org on each of the 3 datahubsearch nodes:
Rate of JVM GC Old generation-s runs - datahubsearch1002-datahub
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Fix the prometheus elasticsearch exporter on bullseye | operations/puppet | production | +4 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | BTullis | T299910 Data Catalog MVP | |||
Resolved | • razzi | T301382 Set up opensearch cluster for datahub | |||
Resolved | BTullis | T302818 Complete monitoring setup of datahubsearch nodes |
This is due to elastisearch puppet code: icinga::monitor::elasticsearch::old_jvm_gc_checks
The prometheus masters are not configured to poll data from the datahub instances, so the monitor doesn't find any metric and it reports NaN.
Interesting. So the Datahub role includes profile::opensearch::server, but the other clusters don't do it, they use profile::opensearch::logstash that includes it. The logstash profile also defines the profile::prometheus::elasticsearch_exporter instances, exposing metrics, that are not running on datahub instances atm (the Prometheus masters are configured to poll any host including profile::prometheus::elasticsearch_exporter afaics).
The code may be refactored to move the profile::prometheus::elasticsearch_exporter definition to profile::opensearch::server, not sure if they need to stay in the logstash profile.
Right. Thanks @elukey. I think we may also to need to include profile::opensearch::monitoring::base_checks which sets up the rest of the Icinga monitoring.
I wonder if we should move the prometheus exporter part to a new class e.g. profile::opensearch::monitoring::prometheus which can then be included from both the logstash and datahub implementations.
I would move this section:
file { '/usr/share/opensearch/plugins': ensure => 'directory', force => true, owner => 'root', group => 'root', mode => '0755', } -> Class['opensearch']
...from profile::opensearch::logstash to the end of profile::opensearch::server.
What do you think @razzi ?
I'd be tempted just to rename this ticket to be something like: "Complete monitoring setup of datahubsearch nodes" since there are quite a few changes to make before we're finished.
I would move this section:
file { '/usr/share/opensearch/plugins': ensure => 'directory', force => true, owner => 'root', group => 'root', mode => '0755', } -> Class['opensearch']...from profile::opensearch::logstash to the end of profile::opensearch::server.
I've done this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/768702
Awaiting a review from @colewhite although I think it's probably OK. There is one small question outstanding on the CR.
FYI, prometheus-elasticsearch-exporter-9200.service is failing at the moment due to a missconfiguration:
✔️ root@datahubsearch1001:~$ journalctl -u prometheus-elasticsearch-exporter-9200.service | tail Mar 11 08:03:50 datahubsearch1001 systemd[1]: Stopped Prometheus exporter for Elasticsearch. Mar 11 08:03:50 datahubsearch1001 systemd[1]: Started Prometheus exporter for Elasticsearch. Mar 11 08:03:50 datahubsearch1001 prometheus-elasticsearch-exporter[396273]: prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help Mar 11 08:03:50 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Main process exited, code=exited, status=1/FAILURE Mar 11 08:03:50 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Failed with result 'exit-code'. Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Scheduled restart job, restart counter is at 5. Mar 11 08:03:51 datahubsearch1001 systemd[1]: Stopped Prometheus exporter for Elasticsearch. Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Start request repeated too quickly. Mar 11 08:03:51 datahubsearch1001 systemd[1]: prometheus-elasticsearch-exporter-9200.service: Failed with result 'exit-code'. Mar 11 08:03:51 datahubsearch1001 systemd[1]: Failed to start Prometheus exporter for Elasticsearch.
Causing the service to try restarting 5 times on every puppet run (every 30 minutes).
Thanks @jcrespo - The cause of it seems to be a difference in the version of the prometheus-elasticsearch-exporter between buster and bullseye.
The datahubsearch100* nodes are running bullseys and they show this when trying to run the exporter.
btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter -es.uri=http://localhost:9200 -web.listen-address=:9108 prometheus-elasticsearch-exporter: error: unknown short flag '-e', try --help
This version wants to use long-form options.
btullis@datahubsearch1001:~$ /usr/bin/prometheus-elasticsearch-exporter --help 2>&1|grep uri --es.uri="http://localhost:9200"
Whereas the version that is installed on other hosts, e.g. logstash1024 wants to use the short-form options:
btullis@logstash1024:~$ /usr/bin/prometheus-elasticsearch-exporter --help 2>&1|grep uri -es.uri string
The version on buster is 1.0.4+ds-1 and the version for bullseye is: 1.1.0+ds-2+b5
btullis@logstash1024:~$ apt-cache policy prometheus-elasticsearch-exporter prometheus-elasticsearch-exporter: Installed: 1.0.4+ds-1 Candidate: 1.0.4+ds-1 Version table: *** 1.0.4+ds-1 1001 1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages 100 /var/lib/dpkg/status
btullis@datahubsearch1001:~$ apt-cache policy prometheus-elasticsearch-exporter prometheus-elasticsearch-exporter: Installed: 1.1.0+ds-2+b5 Candidate: 1.1.0+ds-2+b5 Version table: *** 1.1.0+ds-2+b5 500 500 http://mirrors.wikimedia.org/debian bullseye/main amd64 Packages 100 /var/lib/dpkg/status
I think that the best option is probably to look at backporting 1.1.0+ds-2+b5 to buster and updating https://github.com/wikimedia/puppet/blob/production/modules/prometheus/templates/initscripts/prometheus-elasticsearch-exporter.systemd.erb#L8
Change 770005 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye
Change 770005 merged by Btullis:
[operations/puppet@production] Fix the prometheus elasticsearch exporter on bullseye
All checks are green, now that the prometheus exporter has been fixed. Marking this ticket as done.