we got paged because "nova-conductor" process on "labtestcontrol2001" isn't running.
i don't think a host with "test" in its name should send SMS ever, kind of by definition it can't be critical
let's disable that
we got paged because "nova-conductor" process on "labtestcontrol2001" isn't running.
i don't think a host with "test" in its name should send SMS ever, kind of by definition it can't be critical
let's disable that
Change 259073 had a related patch set uploaded (by Dzahn):
labtest: don't send SMS for test machines
Change 259073 abandoned by Dzahn:
labtest: don't send SMS for test machines
Reason:
dropping in favor of a more global solution at https://gerrit.wikimedia.org/r/#/c/259319/1
Change 260527 had a related patch set uploaded (by Dzahn):
disable paging for labtestcontrol2001 in hiera
Change 260529 had a related patch set uploaded (by Dzahn):
disable paging for labtest[neutron|services]2001
merged the hiera change above. ran puppet on neon, it would still add it, used puppetstoredconfigclean.rb to remove stored config. ran puppet again on neon, it removed all monitoring checks for the host. puppet on labtestcontrol2001 is disabled. wasn't sure if i can just enable it
after re-enabling puppet on labtestcontrol2001 and running it on neon, it now works as intended.
the same check for the nova-conductor process has the "sms" contact group for labcontrol2001 but does not have it for labtestcontrol2001 and the difference is the hiera setting. resolved
labtestcontrol2001 host and many of it's services are in scheduled downtime on Icinga. If this was fixed my understanding is that the downtime should be removed. Is that correct?
Similar situation of scheduled downtime for the host and some of the services also for:
If they are properly setup to not page and the current status is OK I don't see why having them in scheduled downtime.
Yes, it should be fine to re-enable these. The whole ticket was about not sending SMS, that is about the special "sms" contact group not being added, re-activating them should just re-enable non-SMS notifications, so email and IRC.
How to check directly on einsteinium in the actually generated results which checks are paging via the "sms" contact group.
[einsteinium:/etc/icinga] $ grep -B5 "sms,admins" puppet_services.cfg | grep check_command | sort | cut -d\! -f1,2 | uniq
check_command check_ircd check_command check_ldap!dc=corp,dc=wikimedia,dc=org check_command nrpe_check!check_check_nova_conductor_process check_command nrpe_check!check_check_nova_network_process check_command nrpe_check!check_hadoop-hdfs-journalnode check_command nrpe_check!check_hadoop-hdfs-namenode check_command nrpe_check!check_hadoop-yarn-resourcemanager check_command nrpe_check!check_kafka check_command nrpe_check!check_mariadb_disk_space check_command nrpe_check!check_mariadb_slave_io_state_es1 ...
proof that the same service "nova_conductor" gets different contact_groups whether it's on a "test" host or not:
[einsteinium:/etc/icinga] $ grep -A5 check_nova_conductor puppet_services.cfg | egrep '(contact_groups)|(host_name)' contact_groups admins,sms,admins host_name labcontrol1001 contact_groups admins,admins host_name labtestcontrol2001
Mentioned in SAL (#wikimedia-operations) [2016-11-29T00:20:25Z] <mutante> re-enabled icinga notifications for labtest* services (first double checked they are _not_ paging anymore) (T120047)
@Volans This should resolve it, i enabled the notifications again, for the services on these hosts and the hosts itself.