During the execution of the sre.dns.roll-restart-reboot-wikimedia-dns cookbook (sudo cookbook sre.dns.roll-restart-reboot-wikimedia-dns --query 'A:wikidough' --reason 'gnutls update' --task-id 'T353057' --grace-sleep 120 restart_daemons) Icinga downtime cannot be removed on some hosts (2 of 12):
This happened for doh2001 and doh5002 but as is probably related to some race condition in how Icinga checks/downtime are performed, it could've happened to any host, I presume. Some logs from the cookbook run:
... ----- OUTPUT of 'enable-puppet "r...n1001 - T353057"' ----- ================ PASS |██████████████████████████████████████████████████████████| 100% (1/1) [00:05<00:00, 5.90s/hosts] FAIL | | 0% (0/1) [00:05<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "r...n1001 - T353057"'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. [1/15, retrying in 3.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [2/15, retrying in 6.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [3/15, retrying in 9.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [4/15, retrying in 12.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [5/15, retrying in 15.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [6/15, retrying in 18.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [7/15, retrying in 21.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [8/15, retrying in 24.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [9/15, retrying in 27.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [10/15, retrying in 30.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [11/15, retrying in 33.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [12/15, retrying in 36.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [13/15, retrying in 39.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon [14/15, retrying in 42.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon ==> Failed to downtime hosts: Not all services are recovered: doh5002:Bird Internet Routing Daemon Type "go" to proceed or "abort" to interrupt the execution ...
Some example spicerack logs can be found on cumin1001:/home/fabfur/doh2001.log