cloudelastic100[1-6] are all showing this icinga warning:
"ElasticSearch shard size check - 9200"
They've been in this warning state for almost 10 days.
cloudelastic100[1-6] are all showing this icinga warning:
"ElasticSearch shard size check - 9200"
They've been in this warning state for almost 10 days.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/634391 was deployed (forgot to include the Bug: line linking back to this ticket).
However the above seems to not have completely resolved the issue. The cloudelastic warnings went away, but we had a critical go off for eqiad at the old threshold. Currently investigating into that.
Change 636811 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: fix shard_size thresholds
So to review, the initial patch with this ticket fixed the alerts for ElasticSearch shard size check - 9200 on cloudelastic100[1-6].
There are two remaining criticals for ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet and search.svc.eqiad.wmnet. https://gerrit.wikimedia.org/r/636811 should fix those and is about to be deployed right now.
Change 636811 merged by Ryan Kemper:
[operations/puppet@production] cirrus: fix shard_size thresholds
Following the deploy of the above change, the criticals have not resolved after forcing an active check. I'll need to circle back tomorrow (Fri Nov 6) to look at why the change didn't take effect.
54309:Nov 6 04:46:31 icinga1001 puppet-agent[74283]: (/Stage[main]/Icinga::Monitor::Elasticsearch::Cirrus_cluster_checks/Icinga::Monitor::Elasticsearch::Base_checks[search.svc.eqiad.wmnet]/Monitoring::Service[elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243]/Nagios_service[icinga1001 elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243]/check_command) check_command changed 'check_elasticsearch_shard_size!https!9243!50!60!4' to 'check_elasticsearch_shard_size!https!9243!80!100!4'
Looks like the new changes did take effect.
By the way, here's what the nagios service definition looks like on icinga1001:
ryankemper@icinga1001:/etc$ vi nagios/nagios_service.cfg define service { ## --PUPPET_NAME-- (called '_naginator_name' in the manifest) icinga1001 elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243 active_checks_enabled 1 check_command check_elasticsearch_shard_size!https!9243!80!100!4 check_freshness 0 check_interval 1440 check_period 24x7 contact_groups admins,team-discovery host_name search.svc.eqiad.wmnet is_volatile 0 max_check_attempts 3 notes_url https://wikitech.wikimedia.org/wiki/Search#If_it_has_been_indexed notification_interval 0 notification_options c,r,f notification_period 24x7 notifications_enabled 1 passive_checks_enabled 1 retry_interval 180 service_description ElasticSearch shard size check - 9243 servicegroups alerting_eqiad
The last thing to figure out is why https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243 is showing:
UNKNOWN - int() argument must be a string, a bytes-like object or a number, not 'NoneType'
Turns out the above failure was a transient failure, running it again succeeded. For example:
ryankemper@icinga1001:/var/log$ /usr/lib/nagios/plugins/check_elasticsearch_shard_size.py --url https://search.svc.eqiad.wmnet:9243 --shard-size-warning 80 --shard-size-critical 100 --timeout 4 OK - All good!
So this ticket is done; the new warning/critical thresholds are in effect and properly working.
all cloudelastic servers are showing up in Icinga again. For example
cloudelastic1004 ElasticSearch shard size check - 9200 View Extra Service Notes WARNING 2020-12-13 22:47:41 2d 2h 32m 50s 3/3 WARNING - commonswiki_file_1594711825(81.90625gb)
Change 650021 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: bump es shard size alert thresholds
Change 650021 merged by Ryan Kemper:
[operations/puppet@production] cirrus: bump es shard size alert thresholds
Change 654917 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: bump es shard size alert thresholds
Change 654917 merged by Ryan Kemper:
[operations/puppet@production] cirrus: bump es shard size alert thresholds