Page MenuHomePhabricator

labtestcontrol2001 should not make Icinga page us
Closed, ResolvedPublic

Description

we got paged because "nova-conductor" process on "labtestcontrol2001" isn't running.

i don't think a host with "test" in its name should send SMS ever, kind of by definition it can't be critical

let's disable that

Details

Related Gerrit Patches:
operations/puppet : productiondisable paging for labtest[neutron|services]2001
operations/puppet : productiondisable paging for labtestcontrol2001 in hiera
operations/puppet : productionlabtest: don't send SMS for test machines

Event Timeline

Dzahn created this task.Dec 2 2015, 1:12 AM
Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added projects: Operations, Icinga, Cloud-VPS.
Dzahn added a subscriber: Dzahn.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 2 2015, 1:12 AM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Dzahn updated the task description. (Show Details)Dec 2 2015, 1:14 AM
Dzahn set Security to None.
chasemp triaged this task as Medium priority.Dec 3 2015, 2:23 PM
chasemp added a subscriber: chasemp.

all silenced for now but yes agreed.

Change 259073 had a related patch set uploaded (by Dzahn):
labtest: don't send SMS for test machines

https://gerrit.wikimedia.org/r/259073

Change 259073 abandoned by Dzahn:
labtest: don't send SMS for test machines

Reason:
dropping in favor of a more global solution at https://gerrit.wikimedia.org/r/#/c/259319/1

https://gerrit.wikimedia.org/r/259073

Dzahn claimed this task.Dec 17 2015, 1:38 AM

Change 260527 had a related patch set uploaded (by Dzahn):
disable paging for labtestcontrol2001 in hiera

https://gerrit.wikimedia.org/r/260527

Change 260527 merged by Dzahn:
disable paging for labtestcontrol2001 in hiera

https://gerrit.wikimedia.org/r/260527

Change 260529 had a related patch set uploaded (by Dzahn):
disable paging for labtest[neutron|services]2001

https://gerrit.wikimedia.org/r/260529

Change 260529 merged by Dzahn:
disable paging for labtest[neutron|services]2001

https://gerrit.wikimedia.org/r/260529

Dzahn added a comment.Dec 22 2015, 2:57 AM

merged the hiera change above. ran puppet on neon, it would still add it, used puppetstoredconfigclean.rb to remove stored config. ran puppet again on neon, it removed all monitoring checks for the host. puppet on labtestcontrol2001 is disabled. wasn't sure if i can just enable it

Dzahn added a subscriber: Andrew.Dec 22 2015, 3:01 AM

@chasemp @Andrew can i enable puppet on labtestcontrol2001 (even if just for a while) or does that mess with testing?

@Dzahn sure man go ahead thanks

Dzahn closed this task as Resolved.Dec 22 2015, 3:33 PM

after re-enabling puppet on labtestcontrol2001 and running it on neon, it now works as intended.

the same check for the nova-conductor process has the "sms" contact group for labcontrol2001 but does not have it for labtestcontrol2001 and the difference is the hiera setting. resolved

Volans reopened this task as Open.Nov 25 2016, 12:17 PM
Volans added a subscriber: Volans.

labtestcontrol2001 host and many of it's services are in scheduled downtime on Icinga. If this was fixed my understanding is that the downtime should be removed. Is that correct?

Similar situation of scheduled downtime for the host and some of the services also for:

  • labtestnet2001: all ok
  • labtestmetal2001: all ok
  • labtestvirt2001: puppet in warning (not running), all other checks ok
  • labtestservices2001: all ok, also notifications are disabled on the host and some services
  • labtestneutron2001: all ok
  • labtestmetal2001: all ok

If they are properly setup to not page and the current status is OK I don't see why having them in scheduled downtime.

Dzahn added a comment.Nov 28 2016, 9:46 PM

Yes, it should be fine to re-enable these. The whole ticket was about not sending SMS, that is about the special "sms" contact group not being added, re-activating them should just re-enable non-SMS notifications, so email and IRC.

Dzahn added a comment.EditedNov 28 2016, 10:08 PM

How to check directly on einsteinium in the actually generated results which checks are paging via the "sms" contact group.

[einsteinium:/etc/icinga] $ grep -B5 "sms,admins" puppet_services.cfg | grep check_command | sort | cut -d\! -f1,2 | uniq
	check_command                  check_ircd
	check_command                  check_ldap!dc=corp,dc=wikimedia,dc=org
	check_command                  nrpe_check!check_check_nova_conductor_process
	check_command                  nrpe_check!check_check_nova_network_process
	check_command                  nrpe_check!check_hadoop-hdfs-journalnode
	check_command                  nrpe_check!check_hadoop-hdfs-namenode
	check_command                  nrpe_check!check_hadoop-yarn-resourcemanager
	check_command                  nrpe_check!check_kafka
	check_command                  nrpe_check!check_mariadb_disk_space
	check_command                  nrpe_check!check_mariadb_slave_io_state_es1
...

proof that the same service "nova_conductor" gets different contact_groups whether it's on a "test" host or not:

[einsteinium:/etc/icinga] $ grep -A5 check_nova_conductor puppet_services.cfg  | egrep '(contact_groups)|(host_name)'

	contact_groups                 admins,sms,admins
	host_name                      labcontrol1001

	contact_groups                 admins,admins
	host_name                      labtestcontrol2001

Mentioned in SAL (#wikimedia-operations) [2016-11-29T00:20:25Z] <mutante> re-enabled icinga notifications for labtest* services (first double checked they are _not_ paging anymore) (T120047)

Dzahn closed this task as Resolved.Nov 29 2016, 12:22 AM

@Volans This should resolve it, i enabled the notifications again, for the services on these hosts and the hosts itself.

@chasemp we might see labtest* notifications again in email and IRC but not via SMS