Page MenuHomePhabricator

"ElasticSearch shard size check" icinga warnings on cloudelastic servers
Closed, ResolvedPublic

Description

cloudelastic100[1-6] are all showing this icinga warning:

"ElasticSearch shard size check - 9200"

They've been in this warning state for almost 10 days.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
EBernhardson triaged this task as Medium priority.
EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/634391 was deployed (forgot to include the Bug: line linking back to this ticket).

However the above seems to not have completely resolved the issue. The cloudelastic warnings went away, but we had a critical go off for eqiad at the old threshold. Currently investigating into that.

Change 636811 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: fix shard_size thresholds

https://gerrit.wikimedia.org/r/636811

So to review, the initial patch with this ticket fixed the alerts for ElasticSearch shard size check - 9200 on cloudelastic100[1-6].

There are two remaining criticals for ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet and search.svc.eqiad.wmnet. https://gerrit.wikimedia.org/r/636811 should fix those and is about to be deployed right now.

Change 636811 merged by Ryan Kemper:
[operations/puppet@production] cirrus: fix shard_size thresholds

https://gerrit.wikimedia.org/r/636811

Following the deploy of the above change, the criticals have not resolved after forcing an active check. I'll need to circle back tomorrow (Fri Nov 6) to look at why the change didn't take effect.

54309:Nov 6 04:46:31 icinga1001 puppet-agent[74283]: (/Stage[main]/Icinga::Monitor::Elasticsearch::Cirrus_cluster_checks/Icinga::Monitor::Elasticsearch::Base_checks[search.svc.eqiad.wmnet]/Monitoring::Service[elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243]/Nagios_service[icinga1001 elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243]/check_command) check_command changed 'check_elasticsearch_shard_size!https!9243!50!60!4' to 'check_elasticsearch_shard_size!https!9243!80!100!4'

Looks like the new changes did take effect.

By the way, here's what the nagios service definition looks like on icinga1001:

ryankemper@icinga1001:/etc$ vi nagios/nagios_service.cfg

define service {
        ## --PUPPET_NAME-- (called '_naginator_name' in the manifest)                icinga1001 elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243
        active_checks_enabled          1
        check_command                  check_elasticsearch_shard_size!https!9243!80!100!4
        check_freshness                0
        check_interval                 1440
        check_period                   24x7
        contact_groups                 admins,team-discovery
        host_name                      search.svc.eqiad.wmnet
        is_volatile                    0
        max_check_attempts             3
        notes_url                      https://wikitech.wikimedia.org/wiki/Search#If_it_has_been_indexed
        notification_interval          0
        notification_options           c,r,f
        notification_period            24x7
        notifications_enabled          1
        passive_checks_enabled         1
        retry_interval                 180
        service_description            ElasticSearch shard size check - 9243
        servicegroups                  alerting_eqiad

The last thing to figure out is why https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243 is showing:

UNKNOWN - int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Turns out the above failure was a transient failure, running it again succeeded. For example:

ryankemper@icinga1001:/var/log$ /usr/lib/nagios/plugins/check_elasticsearch_shard_size.py --url https://search.svc.eqiad.wmnet:9243 --shard-size-warning 80 --shard-size-critical 100 --timeout 4
OK - All good!

So this ticket is done; the new warning/critical thresholds are in effect and properly working.

all cloudelastic servers are showing up in Icinga again. For example

cloudelastic1004
	
	
ElasticSearch shard size check - 9200
	
View Extra Service Notes
	WARNING 	2020-12-13 22:47:41 	2d 2h 32m 50s 	3/3 	WARNING - commonswiki_file_1594711825(81.90625gb)

Change 650021 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/650021

Change 650021 merged by Ryan Kemper:
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/650021

Change 654917 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/654917

Change 654917 merged by Ryan Kemper:
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/654917