Page MenuHomePhabricator

cloudcephosd1xxxx.private.eqiad.wikimedia.cloud
Closed, ResolvedPublic

Description

I've re-imaged quite a lot of cloudcephosd hosts over the last few days: cloudcephosd1016 through 1024, not including (for some reason) 1021.

After the reimages, a diff has appeared in netbox for several private addresses:

2025-06-14 22:05:28,412 [INFO] Commit details: {'insertions': 0, 'deletions': 16, 'lines': 16, 'files': 5}                                                   
commit 1058996d79cf215e13b2f41748357c9be3c3f0af                                                                                                              
Author: generate-dns-snippets <noc@wikimedia.org>                                                                                                            
Date:   Sat Jun 14 22:05:28 2025 +0000                                                                                                                       
                                                                                                                                                             
    root@cumin1002: test                                                                                                                                     
                                                                                                                                                             
diff --git a/1.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa b/1.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa                                                             
index ec2efd53..86aeb714 100644                                                                                                                              
--- a/1.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa                                                                                                               
+++ b/1.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa                                                                                                               
@@ -8,6 +8,5 @@                                                                                                                                              
 0.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1008.private.eqiad.wikimedia.cloud.                                                                   
 1.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1009.private.eqiad.wikimedia.cloud.                                                                   
 5.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1021.private.eqiad.wikimedia.cloud.                                                                   
-6.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1022.private.eqiad.wikimedia.cloud.                                                                   
 7.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudrabbit1001.private.eqiad.wikimedia.cloud.                                                                    
 5.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcontrol1011.private.eqiad.wikimedia.cloud.                                                                   
diff --git a/1.20.172.in-addr.arpa b/1.20.172.in-addr.arpa                                                                                                   
index e137327c..465917e8 100644                                                                                                                              
--- a/1.20.172.in-addr.arpa                                                                                                                                  
+++ b/1.20.172.in-addr.arpa                                                                                                                                  
@@ -8,6 +8,5 @@                                                                                                                                              
 10  1H IN PTR cloudcephosd1008.private.eqiad.wikimedia.cloud.                                                                                               
 11  1H IN PTR cloudcephosd1009.private.eqiad.wikimedia.cloud.                                                                                               
 15  1H IN PTR cloudcephosd1021.private.eqiad.wikimedia.cloud.                                                                                               
-16  1H IN PTR cloudcephosd1022.private.eqiad.wikimedia.cloud.                                                                                               
 17  1H IN PTR cloudrabbit1001.private.eqiad.wikimedia.cloud.                                                                                                
 25  1H IN PTR cloudcontrol1011.private.eqiad.wikimedia.cloud.                                                                                               
diff --git a/2.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa b/2.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa                                                             
index a899a02d..99ad618f 100644                                                                                                                              
--- a/2.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa                                                                                                               
+++ b/2.0.2.0.0.0.0.a.0.8.c.e.2.0.a.2.ip6.arpa                                                                                                               
@@ -6,9 +6,6 @@                                                                                                                                              
 7.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1011.private.eqiad.wikimedia.cloud.                                                                   
 8.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1012.private.eqiad.wikimedia.cloud.                                                                   
 9.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1013.private.eqiad.wikimedia.cloud.                                                                   
-3.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1020.private.eqiad.wikimedia.cloud.                                                                   
-4.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1023.private.eqiad.wikimedia.cloud.                                                                   
-5.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudcephosd1024.private.eqiad.wikimedia.cloud.                                                                   
 3.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudvirt1040.private.eqiad.wikimedia.cloud.                                                                      
 4.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudvirt1041.private.eqiad.wikimedia.cloud.                                                                      
 5.2.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR cloudvirt1042.private.eqiad.wikimedia.cloud.                                                                      
diff --git a/2.20.172.in-addr.arpa b/2.20.172.in-addr.arpa                                                                                                   
index 16a9cb9a..9bacdffd 100644                                                                                                                              
--- a/2.20.172.in-addr.arpa                                                                                                                                  
+++ b/2.20.172.in-addr.arpa                                                                                                                                  
@@ -6,9 +6,6 @@                                                                                                                                              
 7   1H IN PTR cloudcephosd1011.private.eqiad.wikimedia.cloud.                                                                                               
 8   1H IN PTR cloudcephosd1012.private.eqiad.wikimedia.cloud.                                                                                               
 9   1H IN PTR cloudcephosd1013.private.eqiad.wikimedia.cloud.                                                                                               
-13  1H IN PTR cloudcephosd1020.private.eqiad.wikimedia.cloud.                                                                                               
-14  1H IN PTR cloudcephosd1023.private.eqiad.wikimedia.cloud.                                                                                               
-15  1H IN PTR cloudcephosd1024.private.eqiad.wikimedia.cloud.                                                                                               
 23  1H IN PTR cloudvirt1040.private.eqiad.wikimedia.cloud.                                                                                                  
 24  1H IN PTR cloudvirt1041.private.eqiad.wikimedia.cloud.                                                                                                  
 25  1H IN PTR cloudvirt1042.private.eqiad.wikimedia.cloud.                                                                                                  
diff --git a/wikimedia.cloud-eqiad b/wikimedia.cloud-eqiad                                                                                                   
index e6847359..76f9d6d7 100644                                                                                                                              
--- a/wikimedia.cloud-eqiad                                                                                                                                  
+++ b/wikimedia.cloud-eqiad                                                                                                                                  
@@ -26,16 +26,8 @@ cloudcephosd1012.private.eqiad           1H IN A 172.20.2.8                                                                               
 cloudcephosd1012.private.eqiad           1H IN AAAA 2a02:ec80:a000:202::8                                                                                   
 cloudcephosd1013.private.eqiad           1H IN A 172.20.2.9                                                                                                 
 cloudcephosd1013.private.eqiad           1H IN AAAA 2a02:ec80:a000:202::9                                                                                   
-cloudcephosd1020.private.eqiad           1H IN A 172.20.2.13                                                                                                
-cloudcephosd1020.private.eqiad           1H IN AAAA 2a02:ec80:a000:202::13                                                                                  
 cloudcephosd1021.private.eqiad           1H IN A 172.20.1.15                                                                                                
 cloudcephosd1021.private.eqiad           1H IN AAAA 2a02:ec80:a000:201::15                                                                                  
-cloudcephosd1022.private.eqiad           1H IN A 172.20.1.16                                                                                                
-cloudcephosd1022.private.eqiad           1H IN AAAA 2a02:ec80:a000:201::16                                                                                  
-cloudcephosd1023.private.eqiad           1H IN A 172.20.2.14                                                                                                
-cloudcephosd1023.private.eqiad           1H IN AAAA 2a02:ec80:a000:202::14                                                                                  
-cloudcephosd1024.private.eqiad           1H IN A 172.20.2.15                                                                                                
-cloudcephosd1024.private.eqiad           1H IN AAAA 2a02:ec80:a000:202::15                                                                                  
 cloudcephosd1025.private.eqiad           1H IN A 172.20.3.5                                                                                                 
 cloudcephosd1025.private.eqiad           1H IN AAAA 2a02:ec80:a000:203::5                                                                                   
 cloudcephosd1026.private.eqiad           1H IN A 172.20.3.6                                                                                                 
METADATA: {"path": "/tmp/dns-c25pcHBldHM-tqgmrjor", "sha1": "1058996d79cf215e13b2f41748357c9be3c3f0af", "insertions": 0, "deletions": 16, "lines": 16, "files": 5}                                                                                                                                                        
================ 

This raises several mysteries:

  1. Why is reimaging messing with those addresses at all? @Volans says that it's because of syncing with puppetdb, but I don't see evidence that those addresses were ever referenced in puppet.
  1. Why just those four hosts, when I re-imaged more than four? Why isn't there a diff for e.g. cloudcephosd1019 as well?
  1. What even are those .private.eqiad.wmcloud addresses in this context? I know we use addresses like that for routing within the cloud private network, but as far as I can tell they don't actually route within the ceph network. I can't ping any of them, for example, not even the ones that still have dns refs to other osd nodes, like cloudcephosd1034. And not all ceph osd nodes even have them:
root@cloudnet1005:~# dig +short cloudcephosd1034.private.eqiad.wikimedia.cloud
172.20.4.9
root@cloudnet1005:~# ping 172.20.4.9
PING 172.20.4.9 (172.20.4.9) 56(84) bytes of data.
^C
--- 172.20.4.9 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5108ms

root@cloudnet1005:~# dig +short cloudcephosd1035.private.eqiad.wikimedia.cloud
  1. What will happen when those records are removed? I'm about to find out the answer to that one, for a few hosts at least :(

Event Timeline

Bonus question: Is there some reason why it is good, actually, for the reimage script to make netbox changes and neither merge them nor alert the user about them?

Mentioned in SAL (#wikimedia-operations) [2025-06-14T22:23:57Z] <andrew@cumin1002> START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: T396940 - andrew@cumin1002"

Mentioned in SAL (#wikimedia-operations) [2025-06-14T22:24:02Z] <andrew@cumin1002> END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: T396940 - andrew@cumin1002"

Why is reimaging messing with those addresses at all? @Volans says that it's because of syncing with puppetdb, but I don't see evidence that those addresses were ever referenced in puppet.

After a reimage Netbox syncs the network interfaces present on the host and their related IPs using the host networking fact from PuppetDB.

To be noted that this operation doesn't set any DNS record (at most checks that the primary IP has the correct DNS record for the hostname).

The IPs and DNS records above in the example log were manually created: for example for 172.20.2.13 here and for 2a02:ec80:a000:202::13 here.

Bonus question: Is there some reason why it is good, actually, for the reimage script to make netbox changes and neither merge them nor alert the user about them?

Under normal operations nothing changed by the PuppetDB import script affects DNS and hence require to run the sre.dns.netbox cookbook. This is quite a special case in which additional DNS entries to IPs were added and assigned to the hosts (not VIPs) but they are not part of their puppettization.
Potentially to be on the safe side we could add a check that if any IP that has a DNS name is deleted to alert the user or run the sre.dns.netbox cookbook automatically.

All the changes are reported to the user, for example this is the log of the changes applied for the reimage of cloudcephosd1020:

2025-06-14 05:10:21,165 andrew 1897838 [INFO] [info] Processing baremetal device cloudcephosd1020 (WMF5529)
2025-06-14 05:10:21,166 andrew 1897838 [INFO] [info] Creating interface enp175s0f0np0 for device cloudcephosd1020 (WMF5529)
2025-06-14 05:10:21,167 andrew 1897838 [INFO] [info] cloudcephosd1020.eqiad.wmnet assign_name: 10.64.20.17/24 already has correct DNS name.
2025-06-14 05:10:21,167 andrew 1897838 [INFO] [info] 10.64.20.17/24 is already primary for cloudcephosd1020 (WMF5529)
2025-06-14 05:10:21,167 andrew 1897838 [INFO] [info] cloudcephosd1020.eqiad.wmnet assign_name: 2620:0:861:118:10:64:20:17/64 already has correct DNS name.
2025-06-14 05:10:21,168 andrew 1897838 [INFO] [info] 2620:0:861:118:10:64:20:17/64 is already primary for cloudcephosd1020 (WMF5529)
2025-06-14 05:10:21,168 andrew 1897838 [INFO] [success] cloudsw1-d5-eqiad (WMF4808): remove cable from xe-0/0/5
2025-06-14 05:10:21,168 andrew 1897838 [INFO] [success] cloudcephosd1020 (WMF5529): created cable 53530
2025-06-14 05:10:21,169 andrew 1897838 [INFO] [warning] cloudcephosd1020 (WMF5529): assuming Black dac-passive because 10gbase-x-sfpp
2025-06-14 05:10:21,169 andrew 1897838 [INFO] [info] Creating interface enp175s0f1np1 for device cloudcephosd1020 (WMF5529)
2025-06-14 05:10:21,170 andrew 1897838 [INFO] [failure] Can't find parent prefix for 192.168.4.20/24.
2025-06-14 05:10:21,170 andrew 1897838 [INFO] [success] cloudsw1-d5-eqiad (WMF4808): remove cable from xe-0/0/47
2025-06-14 05:10:21,171 andrew 1897838 [INFO] [success] cloudcephosd1020 (WMF5529): created cable 5391
2025-06-14 05:10:21,171 andrew 1897838 [INFO] [warning] cloudcephosd1020 (WMF5529): assuming Black dac-passive because 10gbase-x-sfpp
2025-06-14 05:10:21,172 andrew 1897838 [INFO] [info] cloudcephosd1020: removing child interface no longer in puppet vlan1152
2025-06-14 05:10:21,172 andrew 1897838 [INFO] [info] cloudcephosd1020: removing interface no longer in puppet ens3f0np0
2025-06-14 05:10:21,172 andrew 1897838 [INFO] [info] cloudcephosd1020: removing interface no longer in puppet ens3f1np1
2025-06-14 05:10:21,173 andrew 1897838 [INFO] [info] cloudcephosd1020 (WMF5529) done.

Why just those four hosts, when I re-imaged more than four?

If I had to guess I'd say these allocations were made in Netbox when we first created the cloud-private networks, and we assigned an IP for every host existing at that point.

Further I think the puppetization for cloud-private on ceph hosts was never done (not needed?), so when reimaged they are not creating those interfaces, after which the puppetdb import deletes them.

What even are those .private.eqiad.wmcloud addresses in this context?

The private.eqiad.wikimedia.cloudzone is used for host interfaces on the cloud-private networks.

What will happen when those records are removed?

Porbably nothing? My guess is cloudceph are not configured for the private vlans. It probably would be better if they did have connectivity to those vlans, so all cloud hosts had it, but if it's not needed it's not needed at the end of the day.

dcaro triaged this task as Medium priority.Jun 25 2025, 2:19 PM
taavi removed dcaro as the assignee of this task.