Page MenuHomePhabricator

labtestcontrol2001 should not make Icinga page us
Closed, ResolvedPublic


we got paged because "nova-conductor" process on "labtestcontrol2001" isn't running.

i don't think a host with "test" in its name should send SMS ever, kind of by definition it can't be critical

let's disable that

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added projects: SRE, Icinga, Cloud-VPS.
Dzahn subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Dzahn set Security to None.
chasemp triaged this task as Medium priority.Dec 3 2015, 2:23 PM
chasemp subscribed.

all silenced for now but yes agreed.

Change 259073 had a related patch set uploaded (by Dzahn):
labtest: don't send SMS for test machines

Change 259073 abandoned by Dzahn:
labtest: don't send SMS for test machines

dropping in favor of a more global solution at

Change 260527 had a related patch set uploaded (by Dzahn):
disable paging for labtestcontrol2001 in hiera

Change 260527 merged by Dzahn:
disable paging for labtestcontrol2001 in hiera

Change 260529 had a related patch set uploaded (by Dzahn):
disable paging for labtest[neutron|services]2001

Change 260529 merged by Dzahn:
disable paging for labtest[neutron|services]2001

merged the hiera change above. ran puppet on neon, it would still add it, used puppetstoredconfigclean.rb to remove stored config. ran puppet again on neon, it removed all monitoring checks for the host. puppet on labtestcontrol2001 is disabled. wasn't sure if i can just enable it

@chasemp @Andrew can i enable puppet on labtestcontrol2001 (even if just for a while) or does that mess with testing?

after re-enabling puppet on labtestcontrol2001 and running it on neon, it now works as intended.

the same check for the nova-conductor process has the "sms" contact group for labcontrol2001 but does not have it for labtestcontrol2001 and the difference is the hiera setting. resolved

Volans subscribed.

labtestcontrol2001 host and many of it's services are in scheduled downtime on Icinga. If this was fixed my understanding is that the downtime should be removed. Is that correct?

Similar situation of scheduled downtime for the host and some of the services also for:

  • labtestnet2001: all ok
  • labtestmetal2001: all ok
  • labtestvirt2001: puppet in warning (not running), all other checks ok
  • labtestservices2001: all ok, also notifications are disabled on the host and some services
  • labtestneutron2001: all ok
  • labtestmetal2001: all ok

If they are properly setup to not page and the current status is OK I don't see why having them in scheduled downtime.

Yes, it should be fine to re-enable these. The whole ticket was about not sending SMS, that is about the special "sms" contact group not being added, re-activating them should just re-enable non-SMS notifications, so email and IRC.

How to check directly on einsteinium in the actually generated results which checks are paging via the "sms" contact group.

[einsteinium:/etc/icinga] $ grep -B5 "sms,admins" puppet_services.cfg | grep check_command | sort | cut -d\! -f1,2 | uniq
	check_command                  check_ircd
	check_command                  check_ldap!dc=corp,dc=wikimedia,dc=org
	check_command                  nrpe_check!check_check_nova_conductor_process
	check_command                  nrpe_check!check_check_nova_network_process
	check_command                  nrpe_check!check_hadoop-hdfs-journalnode
	check_command                  nrpe_check!check_hadoop-hdfs-namenode
	check_command                  nrpe_check!check_hadoop-yarn-resourcemanager
	check_command                  nrpe_check!check_kafka
	check_command                  nrpe_check!check_mariadb_disk_space
	check_command                  nrpe_check!check_mariadb_slave_io_state_es1

proof that the same service "nova_conductor" gets different contact_groups whether it's on a "test" host or not:

[einsteinium:/etc/icinga] $ grep -A5 check_nova_conductor puppet_services.cfg  | egrep '(contact_groups)|(host_name)'

	contact_groups                 admins,sms,admins
	host_name                      labcontrol1001

	contact_groups                 admins,admins
	host_name                      labtestcontrol2001

Mentioned in SAL (#wikimedia-operations) [2016-11-29T00:20:25Z] <mutante> re-enabled icinga notifications for labtest* services (first double checked they are _not_ paging anymore) (T120047)

@Volans This should resolve it, i enabled the notifications again, for the services on these hosts and the hosts itself.

@chasemp we might see labtest* notifications again in email and IRC but not via SMS