Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Mar 17 2016, 10:48 PM

Description

Elasticsearch logs warnings when trying to allocate and free disk space is below cluster.routing.allocation.disk.watermark.low. This creates much unnecessary logging traffic to logstash and is the sign of an unablanced cluster, which leads to much variance in query time. In that case, cluster should be rebalanced by lowering cluster.routing.allocation.disk.watermark.high to force moving shards or by manually moving shards around.

Having an early warning as an Icinga alert would enable us to react appropriately.

Details

	Subject	Repo	Branch	Lines +/-
	icinga/cirrus: lower disk space crit threshold to 12%	operations/puppet	production	+1 -1
	elastic: change disk space monitoring to alert at 15%	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects

Mentioned Here: T230746: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet
T136702: Increase time before alert for elasticsearch disk space issues
T126158: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number

Event Timeline

Gehel created this task.Mar 17 2016, 10:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 17 2016, 10:48 PM

It currently alerts by default at 6% / 3%

# nrpe_check_disk_options   - Default options for checking disks.  Defaults to checking
#                             all disks and warning at < 6% and critical at < 3% free.

class base::monitoring::host

$nrpe_check_disk_options = '-w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]" --exclude-type=tracefs',

This can be overridden in hiera, for example: labs nfs servers have different values from this:

hieradata/role/common/labs/nfs/fileserver.yaml:base::monitoring::host::nrpe_check_disk_options: -w 10% -c 5% -l -e -A -i /run/lock/storage-replicate-.*/snapshot -i /exp/.*

Dzahn renamed this task from Icinga should alert on free disk space < 15% to Icinga should alert on free disk space < 15% on Elasticsearch hosts.Mar 29 2016, 10:56 PM

Dzahn claimed this task.

Change 280343 had a related patch set uploaded (by Dzahn):
elastic: change disk space monitoring to alert at 15%

https://gerrit.wikimedia.org/r/280343

gerritbot added a project: Patch-For-Review.Mar 29 2016, 11:04 PM

also see T126158 btw

Thanks @Dzahn ! I had this on my not too urgent todo list, but you've been much faster than me! Great!

Change 280343 merged by Dzahn:
elastic: change disk space monitoring to alert at 15%

https://gerrit.wikimedia.org/r/280343

needs confirmation on neon itself, in the generated icinga config.. don't see it just yet

Dzahn triaged this task as Medium priority.Mar 30 2016, 7:34 PM

Dzahn removed a project: Patch-For-Review.

on the elastic hosts, the local NRPE command is adjusted:

root@elastic1001:/etc/nagios/nrpe.d# cat check_disk_space.cfg 
# File generated by puppet. DO NOT edit by hand
command[check_disk_space]=/usr/lib/nagios/plugins/check_disk -w 18% -c 15% ....

on neon we have, generated in /etc/icinga/puppet_services.cfg

99341 define service {
99342 # --PUPPET_NAME-- elastic1001 disk_space
99343     active_checks_enabled          1
99344     check_command                  nrpe_check!check_disk_space!10

10 is the timeout. check_disk_space is the command on the host itself, above

Dzahn closed this task as Resolved.Mar 31 2016, 1:01 AM

< ebernhardson> mutante: thanks for the ping, but in general you don't have to worry about elasticsearch diskspace too much, the clustering will handle shuffling data from full machine to less full machines, the thing is the alert and the moment it starts auto-fixing are both 85%...

< ebernhardson> there was T136702 to change it, but never made a decision...

In that context above, i found this ticket again where i already set it to 15% specifically.

Re-opening. Should we simly change the numbers here?

it's -w 18% -c 15% now, what should we set it to instead to fix "the alert and the moment it starts auto-fixing are both 85%..."

Restricted Application added a project: Discovery-Search. · View Herald TranscriptNov 11 2016, 12:59 AM

Change 321913 had a related patch set uploaded (by Dzahn):
icinga/cirrus: lower disk space crit threshold to 12%

https://gerrit.wikimedia.org/r/321913

gerritbot added a project: Patch-For-Review.Nov 16 2016, 7:02 PM

Change 321913 merged by Dzahn:
icinga/cirrus: lower disk space crit threshold to 12%

https://gerrit.wikimedia.org/r/321913

Lowered alert threshold from 15% to 12% to avoid "the thing is the alert and the moment it starts auto-fixing are both 85%...". and re-closing again.

reopening as I think this is happening again, low disk space alert fired and auto resolved 10 min later

12:40 -icinga-wm:#wikimedia-operations- PROBLEM - Disk space on elastic1025 is 
          CRITICAL: DISK CRITICAL - free space: /srv 18812 MB (3% inode=99%): 
          https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space 
https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
12:50 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on elastic1025 is 
          OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space 
https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops

2019-08-27-145807_1188x646_scrot.png (646×1 px, 68 KB)

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Aug 29 2019, 4:48 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Should we up it to 20% or what is a reasonable number?

Still ongoing from time to time (e.g. in september)

#wikimedia-operations_2019-09.log:2019-09-07T22:29:13 -icinga-wm:#wikimedia-operations- PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 23774 MB (4% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-07T22:46:35 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-24T20:25:28 -icinga-wm:#wikimedia-operations- PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27490 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-24T21:09:26 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-25T11:54:45 -icinga-wm:#wikimedia-operations- PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 28154 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-25T12:13:51 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-26T12:15:26 -icinga-wm:#wikimedia-operations- PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27452 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-26T12:33:02 -icinga-wm:#wikimedia-operations- RECOVERY - Disk space on elastic1025 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops
#wikimedia-operations_2019-09.log:2019-09-26T13:11:30 -icinga-wm:#wikimedia-operations- PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27115 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1025&var-datasource=eqiad+prometheus/ops

From the paste above it looks like this was already set to the default alerting threshold at 5%.

We really need to figure out what to do with the elastic1025 alert, it has been alerting even more aggressively lately I think

All we can really do is wait for the servers that are replacing these. They were racked up just this week and should hopefully be added to the cluster soon: T230746

We can just set a really long downtime on them if we don't want to see the alerts until the server s are decom'ed.

These servers (elastic1017-31) no longer have any data on them and are being decomissioned. The elasticsearch servers are now all under 50% disk used

	F30130078: 2019-08-27-145807_1188x646_scrot.png
	Aug 27 2019, 12:58 PM

Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hostsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts
Closed, ResolvedPublic
Actions