Page MenuHomePhabricator

gdnsd failures when converting services from active/passive to active/active
Open, Stalled, MediumPublic

Description

We are trying to convert the netbox service from an active active to an active passive services. I created a A change to update the services and a a change to update the dns. Then following the instructions for adding a new services (there where none for converting). I first deployed the service catalog change, ran puppet on all the dns servers which resulted in the following diff

diff
--- /etc/gdnsd/discovery-geo-resources  2022-12-01 10:45:25.534169539 +0000
+++ /tmp/puppet-file20230220-11718-uhf72k       2023-02-20 14:03:09.370481312 +0000
@@ -295,7 +295,7 @@
     }
 }
 
-disc-geo-netbox => {
+disc-netbox => {
     map => discovery-map,
     service_types => discovery-state-netbox,
     dcmap => {

Info: Computing checksum on file /etc/gdnsd/discovery-geo-resources
Info: /Stage[main]/Profile::Dns::Auth::Discovery/File[/etc/gdnsd/discovery-geo-resources]: Filebucketed /etc/gdnsd/discovery-geo-resources to puppet with sum 025570ce5eea62c5a927c5c9e48c39de                                          
Notice: /Stage[main]/Profile::Dns::Auth::Discovery/File[/etc/gdnsd/discovery-geo-resources]/content: content changed '{md5}025570ce5eea62c5a927c5c9e48c39de' to '{md5}6c2c530297335c722c222615df772b10'
Info: /Stage[main]/Profile::Dns::Auth::Discovery/File[/etc/gdnsd/discovery-geo-resources]: Scheduling refresh of Service[gdnsd]                                                                                                         
Notice: /Stage[main]/Profile::Dns::Auth::Discovery/File[/etc/gdnsd/discovery-metafo-resources]/content: 
--- /etc/gdnsd/discovery-metafo-resources       2022-11-23 14:10:06.910243923 +0000
+++ /tmp/puppet-file20230220-11718-cq1ed8       2023-02-20 14:03:09.430481805 +0000
@@ -61,13 +61,6 @@
         fail => %geoip!disc-failoid,
     },
 }
-disc-netbox => {
-    datacenters => [ geo, fail ],
-    dcmap => {
-        geo => %geoip!disc-geo-netbox,
-        fail => %geoip!disc-failoid,
-    },
-}
 disc-parsoid-php => {
     datacenters => [ geo, fail ],
     dcmap => {

Info: Computing checksum on file /etc/gdnsd/discovery-metafo-resources
Info: /Stage[main]/Profile::Dns::Auth::Discovery/File[/etc/gdnsd/discovery-metafo-resources]: Filebucketed /etc/gdnsd/discovery-metafo-resources to puppet with sum d635d54f0ea96f0f4334d2dcb82cd098                                    
Notice: /Stage[main]/Profile::Dns::Auth::Discovery/File[/etc/gdnsd/discovery-metafo-resources]/content: content changed '{md5}d635d54f0ea96f0f4334d2dcb82cd098' to '{md5}a242567f46b8fd32d2c591f0e8fa06e3'
Info: /Stage[main]/Profile::Dns::Auth::Discovery/File[/etc/gdnsd/discovery-metafo-resources]: Scheduling refresh of Service[gdnsd]                                                                                                      
Notice: /Stage[main]/Profile::Dns::Auth::Discovery/Confd::File[/var/lib/gdnsd/discovery-netbox.state]/File[/etc/confd/conf.d/_var_lib_gdnsd_discovery-netbox.state.toml]/content: 
--- /etc/confd/conf.d/_var_lib_gdnsd_discovery-netbox.state.toml        2022-05-31 13:59:21.463177299 +0000
+++ /tmp/puppet-file20230220-11718-u9si99       2023-02-20 14:03:14.246521376 +0000
@@ -13,5 +13,5 @@
     ]
 
 prefix = "/conftool/v1"
-check_cmd = "/usr/local/bin/confd-lint-wrap /usr/local/bin/authdns-check-active-passive {{.src}}"
+

i then deployed the DNS change but received the following error

sudo authdns-update                                                                      [12:51:34]
Updating authdns1001.wikimedia.org (self)...
Pulling the current revision from https://gerrit.wikimedia.org/r/operations/dns.git
Reviewing a21746632e4b7fb90cb4745ce5fd6b7d678ef492...

 templates/wmnet                           | 2 +-
 utils/mock_etc/discovery-geo-resources    | 1 +
 utils/mock_etc/discovery-metafo-resources | 1 -
 3 files changed, 2 insertions(+), 2 deletions(-)

diff --git templates/wmnet templates/wmnet
index 03333182..7ad7f7f6 100644
--- templates/wmnet
+++ templates/wmnet
@@ -783,7 +783,7 @@ inference       300/10 IN DYNA geoip!disc-inference
 k8s-ingress-staging  300/10 IN DYNA metafo!disc-k8s-ingress-staging
 k8s-ingress-wikikube-ro 300/10 IN DYNA geoip!disc-k8s-ingress-wikikube-ro
 k8s-ingress-wikikube-rw 300/10 IN DYNA metafo!disc-k8s-ingress-wikikube-rw
-netbox                  300/10 IN DYNA metafo!disc-netbox
+netbox                  300/10 IN DYNA geoip!disc-netbox
 ; We don't need a separate discovery address for netbox-extra
 ; however a new cname is useful to configure an internal vhost
 netbox-exports          300 IN CNAME netbox
diff --git utils/mock_etc/discovery-geo-resources utils/mock_etc/discovery-geo-resources
index 9c418ae0..f0737643 100644
--- utils/mock_etc/discovery-geo-resources
+++ utils/mock_etc/discovery-geo-resources
@@ -58,6 +58,7 @@ disc-helm-charts         => { map => mock, dcmap => { mock => 192.0.2.1 } }
 disc-api-gateway         => { map => mock, dcmap => { mock => 192.0.2.1 } }
 disc-similar-users       => { map => mock, dcmap => { mock => 192.0.2.1 } }
 disc-linkrecommendation  => { map => mock, dcmap => { mock => 192.0.2.1 } }
+disc-netbox              => { map => mock, dcmap => { mock => 192.0.2.1 } }
 disc-puppetdb-api        => { map => mock, dcmap => { mock => 192.0.2.1 } }
 disc-puppetboard         => { map => mock, dcmap => { mock => 192.0.2.1 } }
 disc-shellbox            => { map => mock, dcmap => { mock => 192.0.2.1 } }
diff --git utils/mock_etc/discovery-metafo-resources utils/mock_etc/discovery-metafo-resources
index ca41fb4c..ceb840af 100644
--- utils/mock_etc/discovery-metafo-resources
+++ utils/mock_etc/discovery-metafo-resources
@@ -32,4 +32,3 @@ disc-parsoid-php         => { datacenters => mock, dcmap => { mock => 192.0.2.1
 disc-toolhub             => { datacenters => mock, dcmap => { mock => 192.0.2.1 } }
 disc-k8s-ingress-staging => { datacenters => mock, dcmap => { mock => 192.0.2.1 } }
 disc-k8s-ingress-wikikube-rw => { datacenters => mock, dcmap => { mock => 192.0.2.1 } }
-disc-netbox              => { datacenters => mock, dcmap => { mock => 192.0.2.1 } }

Merge these changes? (yes/no)? yes
Updating f7bdb9d5..a2174663
Fast-forward
 templates/wmnet                           | 2 +-
 utils/mock_etc/discovery-geo-resources    | 1 +
 utils/mock_etc/discovery-metafo-resources | 1 -
 3 files changed, 2 insertions(+), 2 deletions(-)
Deploying via utils/deploy-check.py...
Assembling and testing data in /tmp/dns-check.6_bsr0e8
 -- Generating zonefiles from zone templates
 -- Processed 213 zones into directory /tmp/dns-check.6_bsr0e8/zones
OK: No tabs
Summary of violations:
    W001|MISSING_IP_FOR_NAME_AND_PTR: 256
    W002|MISSING_PTR_FOR_NAME_AND_IP: 30
    W105|TOO_MANY_PUBLIC_NAMES: 11
RESULT: 0 Errors, 297 Warnings, 1811 Ignored violations, 43 Ignored lines
 -- Copying automatically generated zone files under target tree
 -- Copying repo-driven real config files and admin_state
 -- Copying puppetized config and GeoIP from /etc/gdnsd
 -- Checking for illegal tabs in zonefiles
 -- Running zone_validator to check WMF rules
 -- Running /usr/sbin/gdnsd checkconf on /tmp/dns-check.6_bsr0e8
Traceback (most recent call last):
  File "utils/deploy-check.py", line 283, in <module>
    main()
  File "utils/deploy-check.py", line 275, in main
    deploy_check(args.deploy, args.skip_reload, args.no_gdnsd, Path(tdir), gdir)
  File "utils/deploy-check.py", line 221, in deploy_check
    safe_cmd([GDNSD_BIN, '-c', str(tdir), 'checkconf'])
  File "utils/deploy-check.py", line 87, in safe_cmd
    p_err.decode('utf-8')))
Exception: Command /usr/sbin/gdnsd -c /tmp/dns-check.6_bsr0e8 checkconf failed with exit code 42, stderr:
info: gdnsd version 3.8.0 @ pid 6244
info: DNS listener threads (8 UDP + 8 TCP) configured for 208.80.154.238:53
info: DNS listener threads (8 UDP + 8 TCP) configured for 208.80.153.231:53
info: DNS listener threads (8 UDP + 8 TCP) configured for 91.198.174.239:53
info: DNS listener threads (8 UDP + 8 TCP) configured for 198.35.27.27:53
info: DNS listener threads (8 TCP PROXY) configured for 127.0.0.1:535
info: DNS listener threads (1 UDP + 1 TCP) configured for 0.0.0.0:5353
info: DNS listener threads (1 UDP + 1 TCP) configured for [::]:5353
info: plugin_geoip: map 'generic-map': Loading GeoIP2 database '/tmp/dns-check.6_bsr0e8/geoip/GeoIP2-City.mmdb': Version: 2.0, Type: GeoIP2-City, IPVersion: 6, Timestamp: 2023-02-17 02:31:14 UTC
info: plugin_geoip: map 'generic-map' runtime db updated. nets: 1214920 dclists: 18
info: plugin_geoip: map 'discovery-map': Loading GeoIP2 database '/tmp/dns-check.6_bsr0e8/geoip/GeoIP2-City.mmdb': Version: 2.0, Type: GeoIP2-City, IPVersion: 6, Timestamp: 2023-02-17 02:31:14 UTC
info: plugin_geoip: map 'discovery-map' runtime db updated. nets: 512 dclists: 2
info: admin_state: checking state file '/tmp/dns-check.6_bsr0e8/state/admin_state'...
error: plugin_geoip: Invalid resource name 'disc-netbox' detected from zonefile lookup
error: Name 'netbox.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-netbox'
fatal: Initial load of zone data failed

i have reverted both changes but received a similar error when running puppet (likely due to the order of applying the changes). It would be useful for someone to check the state of DNS to ensure nothing is broken, then the priority of this task can be lowered

i have recreated the original changes for [[ DNS | https://gerrit.wikimedia.org/r/c/operations/dns/+/890384 ]] and the Service catalogue

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedjbond
ResolvedVolans
DeclinedNone
OpenNone
Resolvedayounsi
Resolvedayounsi
DeclinedNone
Resolvedayounsi
Resolvedayounsi
StalledNone
Resolvedcmooney
Resolvedayounsi
OpenNone
OpenNone
OpenNone
Resolvedayounsi
Resolvedayounsi
Resolvedayounsi
Resolvedayounsi
OpenNone
OpenNone
Resolvedayounsi
StalledNone
StalledNone

Event Timeline

Change 890384 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/dns@master] netbox: update netbox so that its active/active

https://gerrit.wikimedia.org/r/890384

Change 890385 had a related patch set uploaded (by Jbond; author: Jbond):

[operations/puppet@production] netbox: update netbox service to active/active

https://gerrit.wikimedia.org/r/890385

jbond triaged this task as High priority.Feb 20 2023, 2:25 PM
jbond edited projects, added Traffic; removed Traffic-Icebox.
jbond updated the task description. (Show Details)
jbond removed subscribers: crusnov, BBlack.
jbond lowered the priority of this task from High to Medium.Feb 20 2023, 2:46 PM
jbond added subscribers: Vgutierrez, BBlack.

lowering priority @Vgutierrez confirmed there are no immediate issues with dns. They also suggested that we will likely need to preform two dns changes to roll this out. one to remove the metafo resource then another to add the geo resources however @BBlack would need to give the authoritative answer.

from @BBlack via irc

it *seems* like that error in the ticket would've only happened if the puppet agent hadn't run (for the related change) on all the DNS servers before the authdns-update? But even then, I'm not 100% sure. either way, I think the important details that help are:

  1. The namespaces for geoip and metafo (a/a vs a/p) are independent. You can have the same name existing in both places at the same time.
  2. It's probably simpler (and would work around anything I missed above) to add-then-remove, instead of doing it all in one go.

by that I mean:

  1. puppet change to add the new variant (without emoving the old)
  2. DNS change to switch the record to point at the new one
  3. puppet change to remove the now-unused old one.

maybe the above even is a little simplistic, due to the DNS CI "mock" stuff. so the sequence is really more like 5 commits total:

  1. puppet change to add new a/a service
  2. DNS change to add matching mock_etc entry
  3. DNS change to switch the record for lookups
  4. DNS change to remove the old mock_etc entry
  5. Puppet change to remove the old a/p service

[and the puppet change from step 1, needs to be agent-applied in all authdns boxes before (2)]

BCornwall removed a project: SRE.

This ticket could do with a little more clarity: I'm going to Boldly assume this ticket is for identifying/fixing the Exception: Command /usr/sbin/gdnsd -c /tmp/dns-check.6_bsr0e8 checkconf failed with exit code 42, stderr: error. DNS verification seems to have already happened. I'll update the description to more clearly reflect that.

BCornwall renamed this task from Issues converting services from active/passive to active/active to gdnsd failures when converting services from active/passive to active/active.May 1 2023, 4:45 PM
BCornwall updated the task description. (Show Details)
jbond changed the task status from Open to Stalled.May 3 2023, 9:30 AM

Setting to stalled as i need to test the procedure in https://phabricator.wikimedia.org/T330084#8772353

Change 890385 abandoned by Jbond:

[operations/puppet@production] netbox: update netbox service to active/active

Reason:

more work required on netbox side, see tasks

https://gerrit.wikimedia.org/r/890385

Change 890384 abandoned by Jbond:

[operations/dns@master] netbox: update netbox so that its active/active

Reason:

more work required on netbox side see tasks

https://gerrit.wikimedia.org/r/890384