"ElasticSearch shard size check" icinga warnings on cloudelastic servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Oct 19 2020, 1:54 PM

Description

cloudelastic100[1-6] are all showing this icinga warning:

"ElasticSearch shard size check - 9200"

They've been in this warning state for almost 10 days.

Details

Subject	Repo	Branch	Lines +/-
cirrus: bump es shard size alert thresholds	operations/puppet	production	+2 -2
cirrus: bump es shard size alert thresholds	operations/puppet	production	+2 -2
cirrus: fix shard_size thresholds	operations/puppet	production	+25 -6

Customize query in gerrit

Related Objects

Mentioned In: T260083: Reshard commonswiki_file elasticsearch index

Event Timeline

Andrew created this task.Oct 19 2020, 1:54 PM

Restricted Application added a project: cloud-services-team (Kanban). · View Herald TranscriptOct 19 2020, 1:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

EBernhardson assigned this task to RKemper.Oct 19 2020, 3:18 PM

EBernhardson triaged this task as Medium priority.

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

RKemper moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Oct 19 2020, 5:03 PM

https://gerrit.wikimedia.org/r/c/operations/puppet/+/634391 was deployed (forgot to include the Bug: line linking back to this ticket).

However the above seems to not have completely resolved the issue. The cloudelastic warnings went away, but we had a critical go off for eqiad at the old threshold. Currently investigating into that.

Change 636811 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: fix shard_size thresholds

https://gerrit.wikimedia.org/r/636811

gerritbot added a project: Patch-For-Review.Oct 28 2020, 5:34 AM

So to review, the initial patch with this ticket fixed the alerts for ElasticSearch shard size check - 9200 on cloudelastic100[1-6].

There are two remaining criticals for ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet and search.svc.eqiad.wmnet. https://gerrit.wikimedia.org/r/636811 should fix those and is about to be deployed right now.

Change 636811 merged by Ryan Kemper:
[operations/puppet@production] cirrus: fix shard_size thresholds

https://gerrit.wikimedia.org/r/636811

RKemper mentioned this in T260083: Reshard commonswiki_file elasticsearch index.Nov 6 2020, 4:31 AM

Following the deploy of the above change, the criticals have not resolved after forcing an active check. I'll need to circle back tomorrow (Fri Nov 6) to look at why the change didn't take effect.

Maintenance_bot removed a project: Patch-For-Review.Nov 6 2020, 5:10 AM

54309:Nov 6 04:46:31 icinga1001 puppet-agent[74283]: (/Stage[main]/Icinga::Monitor::Elasticsearch::Cirrus_cluster_checks/Icinga::Monitor::Elasticsearch::Base_checks[search.svc.eqiad.wmnet]/Monitoring::Service[elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243]/Nagios_service[icinga1001 elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243]/check_command) check_command changed 'check_elasticsearch_shard_size!https!9243!50!60!4' to 'check_elasticsearch_shard_size!https!9243!80!100!4'

Looks like the new changes did take effect.

By the way, here's what the nagios service definition looks like on icinga1001:

ryankemper@icinga1001:/etc$ vi nagios/nagios_service.cfg

define service {
        ## --PUPPET_NAME-- (called '_naginator_name' in the manifest)                icinga1001 elasticsearch_shard_size_check_search.svc.eqiad.wmnet:9243
        active_checks_enabled          1
        check_command                  check_elasticsearch_shard_size!https!9243!80!100!4
        check_freshness                0
        check_interval                 1440
        check_period                   24x7
        contact_groups                 admins,team-discovery
        host_name                      search.svc.eqiad.wmnet
        is_volatile                    0
        max_check_attempts             3
        notes_url                      https://wikitech.wikimedia.org/wiki/Search#If_it_has_been_indexed
        notification_interval          0
        notification_options           c,r,f
        notification_period            24x7
        notifications_enabled          1
        passive_checks_enabled         1
        retry_interval                 180
        service_description            ElasticSearch shard size check - 9243
        servicegroups                  alerting_eqiad

The last thing to figure out is why https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=search.svc.eqiad.wmnet&service=ElasticSearch+shard+size+check+-+9243 is showing:

UNKNOWN - int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Turns out the above failure was a transient failure, running it again succeeded. For example:

ryankemper@icinga1001:/var/log$ /usr/lib/nagios/plugins/check_elasticsearch_shard_size.py --url https://search.svc.eqiad.wmnet:9243 --shard-size-warning 80 --shard-size-critical 100 --timeout 4
OK - All good!

So this ticket is done; the new warning/critical thresholds are in effect and properly working.

Gehel closed this task as Resolved.Nov 16 2020, 1:57 PM

all cloudelastic servers are showing up in Icinga again. For example

cloudelastic1004
	
	
ElasticSearch shard size check - 9200
	
View Extra Service Notes
	WARNING 	2020-12-13 22:47:41 	2d 2h 32m 50s 	3/3 	WARNING - commonswiki_file_1594711825(81.90625gb)

CBogen moved this task from Needs Reporting to In Progress on the Discovery-Search (Current work) board.Dec 14 2020, 7:43 PM

Change 650021 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/650021

gerritbot added a project: Patch-For-Review.Dec 17 2020, 5:50 AM

RKemper moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Dec 17 2020, 6:53 AM

Change 650021 merged by Ryan Kemper:
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/650021

RKemper moved this task from Needs review to In Progress on the Discovery-Search (Current work) board.Jan 4 2021, 4:31 PM

Change 654917 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/654917

Change 654917 merged by Ryan Kemper:
[operations/puppet@production] cirrus: bump es shard size alert thresholds

https://gerrit.wikimedia.org/r/654917

RKemper moved this task from In Progress to Needs Reporting on the Discovery-Search (Current work) board.Jan 14 2021, 8:30 PM

Gehel closed this task as Resolved.Jan 20 2021, 8:27 AM

"ElasticSearch shard size check" icinga warnings on cloudelastic serversClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

"ElasticSearch shard size check" icinga warnings on cloudelastic servers
Closed, ResolvedPublic
Actions