Page MenuHomePhabricator

errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook)
Closed, ResolvedPublic

Description

I was running the decom cookbook and noticed a pending diff for a DNS update. Not sure I can proceed with this in this state since I don't own these changes and they might have some side effects:

diff --git a/0.192.10.in-addr.arpa b/0.192.10.in-addr.arpa
index 4151788..e9fdefa 100644
--- a/0.192.10.in-addr.arpa
+++ b/0.192.10.in-addr.arpa
@@ -219,3 +219,4 @@
 219 1H IN PTR aqs2003-b.codfw.wmnet.
 220 1H IN PTR aqs2004-a.codfw.wmnet.
 221 1H IN PTR aqs2004-b.codfw.wmnet.
+222 1H IN PTR ml-cache2001-a.codfw.wmnet.
diff --git a/16.192.10.in-addr.arpa b/16.192.10.in-addr.arpa
index a3889b8..bac3cf0 100644
--- a/16.192.10.in-addr.arpa
+++ b/16.192.10.in-addr.arpa
@@ -187,6 +187,7 @@
 187 1H IN PTR aqs2007-b.codfw.wmnet.
 188 1H IN PTR aqs2008-a.codfw.wmnet.
 189 1H IN PTR aqs2008-b.codfw.wmnet.
+190 1H IN PTR ml-cache2002-a.codfw.wmnet.
 191 1H IN PTR elastic2028.codfw.wmnet.
 192 1H IN PTR elastic2029.codfw.wmnet.
 193 1H IN PTR elastic2030.codfw.wmnet.
diff --git a/32.192.10.in-addr.arpa b/32.192.10.in-addr.arpa
index 86bef75..01ae7b6 100644
--- a/32.192.10.in-addr.arpa
+++ b/32.192.10.in-addr.arpa
@@ -69,6 +69,7 @@
 69  1H IN PTR elastic2066.codfw.wmnet.
 70  1H IN PTR elastic2071.codfw.wmnet.
 71  1H IN PTR restbase2025.codfw.wmnet.
+72  1H IN PTR ml-cache2003-a.codfw.wmnet.
 73  1H IN PTR restbase2025-a.codfw.wmnet.
 74  1H IN PTR restbase2025-b.codfw.wmnet.
 75  1H IN PTR restbase2025-c.codfw.wmnet.
diff --git a/codfw.wmnet b/codfw.wmnet
index e27eb7c..a61ec5c 100644
--- a/codfw.wmnet
+++ b/codfw.wmnet
@@ -612,10 +612,13 @@ miscweb2002                              1H IN A 10.192.16.211
 miscweb2002                              1H IN AAAA 2620:0:860:102:10:192:16:211
 ml-cache2001                             1H IN A 10.192.0.208
 ml-cache2001                             1H IN AAAA 2620:0:860:101:10:192:0:208
+ml-cache2001-a                           1H IN A 10.192.0.222
 ml-cache2002                             1H IN A 10.192.16.144
 ml-cache2002                             1H IN AAAA 2620:0:860:102:10:192:16:144
+ml-cache2002-a                           1H IN A 10.192.16.190
 ml-cache2003                             1H IN A 10.192.32.90
 ml-cache2003                             1H IN AAAA 2620:0:860:103:10:192:32:90
+ml-cache2003-a                           1H IN A 10.192.32.72
 ml-etcd2001                              1H IN A 10.192.16.44
 ml-etcd2001                              1H IN AAAA 2620:0:860:102:10:192:16:44
 ml-etcd2002                              1H IN A 10.192.32.48

Event Timeline

Volans claimed this task.
Volans triaged this task as Medium priority.
Volans subscribed.

This has been committed as 392e48a.

18:40 < volans> the decom cookbook is currently broken for VMs as I mentioned it in the SRE meeting earlier
18:41 < volans> it will be fixed tomorrow at this point, sorry for the trouble

18:59 < elukey> I indeed forgot to run the DNS cookbook, my bad, all is good to be added etc..
18:59 < volans> I'm running the cookbook now
19:00 < volans> was about to anyway (just run the test run to check if there were multiple diffs)
..

^ so the "pending diff" part should be gone. but wait for the unrelated cookbook issue to be fixed tomorrow regardless

Dzahn renamed this task from pending diff in sre.dns.netbox cookbook to errors decom'ing VMs (was: pending diff in sre.dns.netbox cookbook).Jul 5 2022, 6:34 PM
Dzahn reopened this task as Open.

Since Arnold is out I tried to run the cookbook again on gitlab2001. It failed to remove the VM from ganeti:

Downtimed host on Icinga/Alertmanager
Found Ganeti VM
Shutting down VM gitlab2001.codfw.wmnet in cluster codfw
----- OUTPUT of 'gnt-instance shu...2001.codfw.wmnet' -----
Failure: prerequisites not met for this operation:
error type: wrong_input, error details:
Selection filter does not match any instances
================
PASS |                                                                                  |   0% (0/1) [00:00<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.21hosts/s]
100.0% (1/1) of nodes failed to execute command 'gnt-instance shu...2001.codfw.wmnet': ganeti2022.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'gnt-instance shu...2001.codfw.wmnet'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
**Failed to shutdown VM, manually run gnt-instance remove on the Ganeti master for the codfw cluster**: Cumin execution failed (exit_code=2)
----- OUTPUT of 'systemctl start ...dfw_sync.service' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.14hosts/s]
FAIL |                                                                                  |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...dfw_sync.service'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Started forced sync of VMs in Ganeti cluster codfw to Netbox
Sleeping for 20s to avoid race conditions...
Host gitlab2001.codfw.wmnet already missing on Debmonitor
Removed from DebMonitor
----- OUTPUT of 'puppet node clea...2001.codfw.wmnet' -----
gitlab2001.codfw.wmnet
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...2001.codfw.wmnet'.
----- OUTPUT of 'puppet node deac...2001.codfw.wmnet' -----
Submitted 'deactivate node' for gitlab2001.codfw.wmnet with UUID 8f96ec7b-ca44-4209-9313-7b2458d62b9e
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...2001.codfw.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Removed from Puppet master and PuppetDB
Issuing Ganeti remove command, it can take up to 15 minutes...
Removing VM gitlab2001.codfw.wmnet in cluster codfw. This may take a few minutes.
----- OUTPUT of 'gnt-instance rem...2001.codfw.wmnet' -----
Failure: prerequisites not met for this operation:
error type: unknown_entity, error details:
Instance 'gitlab2001.codfw.wmnet' not known
================
PASS |                                                                                  |   0% (0/1) [00:01<?, ?hosts/s]
FAIL |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.46s/hosts]
100.0% (1/1) of nodes failed to execute command 'gnt-instance rem...2001.codfw.wmnet': ganeti2022.codfw.wmnet
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'gnt-instance rem...2001.codfw.wmnet'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
**Failed to remove VM, manually run gnt-instance remove on the Ganeti master for the codfw cluster**: Cumin execution failed (exit_code=2)
----- OUTPUT of 'systemctl start ...dfw_sync.service' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.28hosts/s]
FAIL |                                                                                  |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...dfw_sync.service'.

So while it does "Found Ganeti VM" it _also_ says "Selection filter does not match any instances".

edit: this is because it's gitlab2001.wikimedia.org and not gitlab2001.codfw.wmnet

Dzahn changed the task status from Open to In Progress.Jul 5 2022, 6:38 PM
Dzahn claimed this task.

using the .wikimedia.org name correctly it is now:

Found Ganeti VM
Shutting down VM gitlab2001.wikimedia.org in cluster codfw
...
VM removed

After that where are unexpected diffs in DNS for codfw.mgmt entries. but confirmed with Papaul they were expected and merged it.


END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
Updated Phabricator task T307142
END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab2001.wikimedia.org