Page MenuHomePhabricator

sre.dns.roll-restart-reboot-wikimedia-dns cookbook sometimes cannot remove downtime
Closed, ResolvedPublic

Description

During the execution of the sre.dns.roll-restart-reboot-wikimedia-dns cookbook (sudo cookbook sre.dns.roll-restart-reboot-wikimedia-dns --query 'A:wikidough' --reason 'gnutls update' --task-id 'T353057' --grace-sleep 120 restart_daemons) Icinga downtime cannot be removed on some hosts (2 of 12):

This happened for doh2001 and doh5002 but as is probably related to some race condition in how Icinga checks/downtime are performed, it could've happened to any host, I presume. Some logs from the cookbook run:

...
----- OUTPUT of 'enable-puppet "r...n1001 - T353057"' -----
================
PASS |██████████████████████████████████████████████████████████| 100% (1/1) [00:05<00:00,  5.90s/hosts]
FAIL |                                                                  |   0% (0/1) [00:05<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'enable-puppet "r...n1001 - T353057"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
[1/15, retrying in 3.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[2/15, retrying in 6.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[3/15, retrying in 9.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[4/15, retrying in 12.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[5/15, retrying in 15.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[6/15, retrying in 18.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[7/15, retrying in 21.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[8/15, retrying in 24.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[9/15, retrying in 27.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[10/15, retrying in 30.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[11/15, retrying in 33.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[12/15, retrying in 36.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[13/15, retrying in 39.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
[14/15, retrying in 42.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.<locals>.check' raised: Not all services are recovered: doh5002:Bird Internet Routing Daemon
==> Failed to downtime hosts: Not all services are recovered: doh5002:Bird Internet Routing Daemon
Type "go" to proceed or "abort" to interrupt the execution
...

Some example spicerack logs can be found on cumin1001:/home/fabfur/doh2001.log

Event Timeline

The cookbook defines the restart of bird in the post_action() that is called after action(), but the check for icinga being optimal is part of the action() one.

Does puppet start bird if it's stopped?

The cookbook sets disable_puppet_on_restart=True and disable_puppet_on_reboot=True, as such it does force a puppet run on reboot but doesn't force a puppet run on restart daemons.

Depending on who should start bird (the cookbook or puppet) we can adjust the code to make it do it.

BCornwall changed the task status from Open to In Progress.Dec 20 2023, 5:24 PM
BCornwall claimed this task.
BCornwall triaged this task as Medium priority.
BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Some time ago we discussed stopping/starting bird.service via systemd dependencies - T336792 is related.

@ssingh Do you recall why we didn't implement BindsTo=/After= for bird.service? There'd be no reason to have this cookbook at all if that were implemented IMO since we could create a target and then run 'cumin doh* "systemctl restart wdns.target"' or similar. But I think we've been back and forth about this before too :)

BCornwall changed the task status from In Progress to Open.Dec 20 2023, 10:18 PM

Some time ago we discussed stopping/starting bird.service via systemd dependencies - T336792 is related.

@ssingh Do you recall why we didn't implement BindsTo=/After= for bird.service? There'd be no reason to have this cookbook at all if that were implemented IMO since we could create a target and then run 'cumin doh* "systemctl restart wdns.target"' or similar. But I think we've been back and forth about this before too :)

@BCornwall: Can you share where you see bird.service not having BindsTo/After on doh*?

systemctl list-dependencies bird.service 
bird.service
● ├─anycast-healthchecker.service

anycast-hc then depends on dnsdist.service and pdns-recursor.service, the two components that make up Wikimedia DNS.

systemctl list-dependencies anycast-healthchecker.service 
anycast-healthchecker.service
● ├─dnsdist.service
● ├─pdns-recursor.service

@Volans: Puppet starts bird if it is stopped so I think we can just do that here.

Change 991637 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/cookbooks@master] dns: Don't disable puppet/bird on restarting wdns

https://gerrit.wikimedia.org/r/991637

@ssingh I think we shouldn't even bother with pre_action, post_action, and the disable_puppet_on_* at all. We already have the systemd ordering, so restarts/reboots should gracefully handled by systemd. Anything else is just complicating things.

BCornwall changed the task status from Open to In Progress.Jan 18 2024, 7:33 PM

Change 991637 merged by Ssingh:

[operations/cookbooks@master] dns: Don't disable puppet on restarting wdns

https://gerrit.wikimedia.org/r/991637