Page MenuHomePhabricator

Decommission cookbook failing to update DNS
Closed, ResolvedPublic

Description

I have run the decommission cookbook for es1015 (T268810) but it fails on:

Generating the DNS records from Netbox data. It will take a couple of minutes.
2020-11-30 07:19:24,355 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2020-11-30 07:25:00,031 [INFO] Gathered 2192 devices from Netbox
2020-11-30 07:25:00,031 [INFO] Generating DNS records
2020-11-30 07:25:07,488 [INFO] Generated 12058 direct and reverse records (6029 each) in 25 direct zones and 168 reverse zones
2020-11-30 07:25:07,489 [INFO] Cloning /srv/netbox-exports/dns.git/ to /tmp/dns-c25pcHBldHM-6wpw1nde ...
2020-11-30 07:25:07,669 [INFO] Generating zonefile snippets to directory /tmp/dns-c25pcHBldHM-6wpw1nde
2020-11-30 07:25:08,340 [INFO] Committed changes: c8f31a4dd26fadd3c40127c331b5f84260ecc48f
2020-11-30 07:25:08,359 [INFO] Validating generated data
2020-11-30 07:25:08,359 [INFO] Commit details: {'insertions': 0, 'deletions': 2, 'lines': 2, 'files': 2}
commit c8f31a4dd26fadd3c40127c331b5f84260ecc48f
Author: generate-dns-snippets <noc@wikimedia.org>
Date:   Mon Nov 30 07:25:08 2020 +0000

    marostegui@cumin1001: es1015.eqiad.wmnet decommissioned, removing all IPs except the asset tag one

diff --git a/4.65.10.in-addr.arpa b/4.65.10.in-addr.arpa
index 1e6f271..ff06baf 100644
--- a/4.65.10.in-addr.arpa
+++ b/4.65.10.in-addr.arpa
@@ -21,7 +21,6 @@
 35  1H IN PTR wmf4703.mgmt.eqiad.wmnet.
 38  1H IN PTR es1013.mgmt.eqiad.wmnet.
 38  1H IN PTR wmf4706.mgmt.eqiad.wmnet.
-40  1H IN PTR es1015.mgmt.eqiad.wmnet.
 40  1H IN PTR wmf4708.mgmt.eqiad.wmnet.
 41  1H IN PTR wmf4709.mgmt.eqiad.wmnet.
 42  1H IN PTR es1017.mgmt.eqiad.wmnet.
diff --git a/mgmt.eqiad.wmnet b/mgmt.eqiad.wmnet
index 139b5b9..a55def5 100644
--- a/mgmt.eqiad.wmnet
+++ b/mgmt.eqiad.wmnet
@@ -348,7 +348,6 @@ elastic1065                              1H IN A 10.65.7.106
 elastic1066                              1H IN A 10.65.7.107
 elastic1067                              1H IN A 10.65.7.108
 es1013                                   1H IN A 10.65.4.38
-es1015                                   1H IN A 10.65.4.40
 es1017                                   1H IN A 10.65.4.42
 es1018                                   1H IN A 10.65.4.43
 es1019                                   1H IN A 10.65.4.44
METADATA: {"path": "/tmp/dns-c25pcHBldHM-6wpw1nde", "sha1": "c8f31a4dd26fadd3c40127c331b5f84260ecc48f", "insertions": 0, "deletions": 2, "lines": 2, "files": 2}
Have you checked that the diff is OK?
Type "done" to proceed
> done
2020-11-30 07:25:13,336 [INFO] Pushed with bitflags 256: 218dc28..c8f31a4
2020-11-30 07:25:13,403 [INFO] Temporary directory /tmp/dns-c25pcHBldHM-6wpw1nde removed.
Updating the Netbox passive copies of the repository on netbox2001.wikimedia.org
Updating the authdns copies of the repository on authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
Deploying the updated zonefiles on authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
Failed to run the sre.dns.netbox cookbook
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 351, in run
    dns_netbox_run(dns_netbox_args, spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 132, in run
    git=AUTHDNS_DNS_CHECKOUT_PATH, netbox=AUTHDNS_NETBOX_CHECKOUT_PATH))
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 476, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 646, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
**Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2)
ERROR: some step failed, check the task updates.
Updated Phabricator task T268810
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)

After chatting with @elukey - it looks like any of the dns servers get updated. Tried running it again to see if it was a one time thing, but it failed on the same step.
All the steps before the DNS part were ok.

Event Timeline

From /var/log/spicerack/sre/hosts/decommission.log on cumin1001:

2020-11-30 07:25:21,024 marostegui 6173 [INFO] Deploying the updated zonefiles on authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
2020-11-30 07:25:21,027 marostegui 6173 [INFO] Executing commands [cumin.transports.Command('cd /srv/authdns/git && utils/deploy-check.py -g /srv/git/netbox_dns_snippets --deploy')] on '12' hosts: authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
2020-11-30 07:25:50,835 marostegui 6173 [INFO] Completed command 'cd /srv/authdns/git && utils/deploy-check.py -g /srv/git/netbox_dns_snippets --deploy'
2020-11-30 07:25:50,854 marostegui 6173 [ERROR] 100.0% (12/12) of nodes failed to execute command 'cd /srv/authdns/...nippets --deploy': authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org
2020-11-30 07:25:50,854 marostegui 6173 [CRITICAL] 0.0% (0/12) success ratio (< 100.0% threshold) for command: 'cd /srv/authdns/...nippets --deploy'. Aborting.
2020-11-30 07:25:50,854 marostegui 6173 [CRITICAL] 0.0% (0/12) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
2020-11-30 07:25:50,854 marostegui 6173 [ERROR] Failed to run the sre.dns.netbox cookbook
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 351, in run
    dns_netbox_run(dns_netbox_args, spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 132, in run
    git=AUTHDNS_DNS_CHECKOUT_PATH, netbox=AUTHDNS_NETBOX_CHECKOUT_PATH))
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 476, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 646, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
2020-11-30 07:25:50,856 marostegui 6173 [ERROR] **Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2)

Ran the following from dns5001:

root@dns5001:/srv/authdns/git# utils/deploy-check.py
[..]
error: CNAME 'es2-master.eqiad.wmnet.' points to known same-zone NXDOMAIN 'es1015.eqiad.wmnet.'
fatal: Initial load of zone data failed

I think that it would be really useful to have the cumin's output in the debugging logs, adding a note in here to implement it in case @Volans agrees :)

Change 644086 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update esX-master cnames

https://gerrit.wikimedia.org/r/644086

Change 644086 merged by Marostegui:
[operations/dns@master] wmnet: Update esX-master cnames

https://gerrit.wikimedia.org/r/644086

After deploying the above changeset to address what @elukey found, I ran it again and even though it failed for other reasons, I think it has worked:

$ sudo cookbook sre.hosts.decommission es1015.eqiad.wmnet -t T268810
START - Cookbook sre.hosts.decommission
ATTENTION: the query does not match any host in PuppetDB or failed
Hostname expansion matches 1 hosts: es1015.eqiad.wmnet
Do you want to proceed anyway?
Type "done" to proceed
> done
ATTENTION: destructive action for 1 hosts: es1015.eqiad.wmnet
Are you sure to proceed?
Type "done" to proceed
> done
Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet
conftool-data/node/eqiad.yaml:    kubernetes1015.eqiad.wmnet: ["recommendation-api"]
conftool-data/node/eqiad.yaml:    kubernetes1015.eqiad.wmnet: [kubesvc]
hieradata/role/eqiad/kubernetes/worker.yaml:- kubernetes1015.eqiad.wmnet
modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200:    fixed-address kubernetes1015.eqiad.wmnet;
modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200:    fixed-address es1015.eqiad.wmnet;
Looking for matches in puppetmaster1001.eqiad.wmnet:/srv/private
Looking for matches in deploy1001.eqiad.wmnet:/srv/mediawiki-staging
Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway?
Type "done" to proceed
> done
Looking for Kerberos credentials on KDC kadmin node.
No Kerberos credentials found.
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['es1015.eqiad.wmnet']
**Failed downtime host on Icinga (likely already removed)**
Management Password:
Host steps raised exception
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 339, in run
    _decommission_host(fqdn, spicerack, reason)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 99, in _decommission_host
    mgmt = spicerack.management().get_fqdn(fqdn)
  File "/usr/lib/python3/dist-packages/spicerack/management.py", line 46, in get_fqdn
    mgmt = self._internal_mgmt_fqdn(hostname)
  File "/usr/lib/python3/dist-packages/spicerack/management.py", line 86, in _internal_mgmt_fqdn
    raise ManagementError('Invalid management FQDN {mgmt} for {host}'.format(mgmt=mgmt, host=hostname))
spicerack.management.ManagementError: Invalid management FQDN es1015.mgmt.eqiad.wmnet for es1015.eqiad.wmnet
**Host steps raised exception**: Invalid management FQDN es1015.mgmt.eqiad.wmnet for es1015.eqiad.wmnet
Generating the DNS records from Netbox data. It will take a couple of minutes.
2020-11-30 07:41:45,403 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2020-11-30 07:45:05,310 [INFO] Gathered 2192 devices from Netbox
2020-11-30 07:45:05,310 [INFO] Generating DNS records
2020-11-30 07:45:13,875 [INFO] Generated 12058 direct and reverse records (6029 each) in 25 direct zones and 168 reverse zones
2020-11-30 07:45:13,876 [INFO] Cloning /srv/netbox-exports/dns.git/ to /tmp/dns-c25pcHBldHM-xn7s4vof ...
2020-11-30 07:45:14,063 [INFO] Generating zonefile snippets to directory /tmp/dns-c25pcHBldHM-xn7s4vof
2020-11-30 07:45:14,791 [INFO] Nothing to commit!
2020-11-30 07:45:15,080 [INFO] Temporary directory /tmp/dns-c25pcHBldHM-xn7s4vof removed.
METADATA: {"no_changes": true}
No changes to deploy.
ERROR: some step failed, check the task updates.
Updated Phabricator task T268810
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)

I will leave it up to @elukey and @Volans to decide if this task is ok to be closed as resolved already. I have two more hosts to decom in the next few days, so if needed we can test with them
Thanks @elukey for troubleshooting this so fast!

Volans claimed this task.
Volans triaged this task as High priority.

Just for the record, what "fixed" the deploy was the manual authdns-update run after merging the change with the updated CNAME. The previous deploy failed because of the CNAME pointing to a non-existing name and gdnsd failed the checks pre-reload.

In the general case of a transient failure on a subset of the authdns hosts it's possible to force a re-deploy following https://wikitech.wikimedia.org/wiki/DNS/Netbox#Force_update_generated_records

I'm marking this as resolved as we're just about to re-enable cumin's output in spicerack in general (tracked in T212783 )