Page MenuHomePhabricator

cloud: failures resolving some `wikimedia.cloud` domains
Closed, ResolvedPublic

Description

The wikimedia.cloud domain fails to resolve in some circumstances inside WMCS VM instances.

One example failure: We have a puppet manifest that runs dnsquery::a() that fails in PCC for the domain private.codfw.wikimedia.cloud.

Code:

user@laptop~/git/wmf/operations/puppet production $ git grep dnsquery | grep cloud
modules/cloudlb/manifests/haproxy/service.pp:                        dnsquery::a($host)[0]
modules/cloudlb/spec/defines/haproxy_service_spec.rb:    function dnsquery::a($fqdn) {
modules/cloudlb/templates/haproxy/conf.d/http-service.cfg.erb:    server <%= server %> <%= scope.call_function('dnsquery::a', [server])[0] %>:<%= @port_backend %> check inter 3s rise 2 fall 4
modules/profile/manifests/wmcs/cloud_private_subnet.pp:    $cloud_private_address = dnsquery::a($cloud_private_fqdn)[0]
modules/profile/manifests/wmcs/cloud_private_subnet.pp:    $gw_address = dnsquery::a($gw_fqdn)[0]
modules/profile/spec/classes/profile_wmcs_cloud_private_subnet_spec.rb:        "function dnsquery::a($fqdn) {

In modules/profile/manifests/wmcs/cloud_private_subnet.pp for for host cloudlb2001-dev.codfw.wmnet the manifest contains a dnsquery::a call for cloudlb2001-dev.private.codfw.wikimedia.cloud that would otherwise resolve just fine in the actual puppetmaster but fails in PCC.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
aborrero triaged this task as Medium priority.May 12 2023, 11:35 AM

We have a puppet manifest that runs dnsquery::a() that fails in PCC for the domain private.codfw.wikimedia.cloud.

The domain puppet ends up trying to resolv is cloudsw.private.codfw.wikimedia.cloud which resolves to 172.20.5.1 normally but results in a SERVFAIL when using a cloud host

$ dig cloudsw.private.codfw.wikimedia.cloud                                  

; <<>> DiG 9.11.5-P4-5.1+deb10u8-Debian <<>> cloudsw.private.codfw.wikimedia.cloud
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 3981
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;cloudsw.private.codfw.wikimedia.cloud. IN A

;; Query time: 4 msec
;; SERVER: 208.80.154.143#53(208.80.154.143)
;; WHEN: Fri May 12 11:37:42 UTC 2023
;; MSG SIZE  rcvd: 66

in fact i get SERVFAIL for the wikimedia.cloud domain

jbond@cloudinfra-internal-puppetmaster-01:~$ dig soa wikimedia.cloud

; <<>> DiG 9.11.5-P4-5.1+deb10u8-Debian <<>> soa wikimedia.cloud
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 17773
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;wikimedia.cloud.               IN      SOA

;; Query time: 3 msec
;; SERVER: 208.80.153.47#53(208.80.153.47)
;; WHEN: Fri May 12 11:40:36 UTC 2023
;; MSG SIZE  rcvd: 44

Yes, sorry for the noise, there is nothing wrong with PCC and it may be the Openstack Designate setup:

aborrero@tools-sgebastion-11:~$ dig +short cloudsw.private.codfw.wikimedia.cloud
aborrero@tools-sgebastion-11:~$ dig +trace +short cloudsw.private.codfw.wikimedia.cloud
NS k.root-servers.net. from server 208.80.154.143 in 0 ms.
NS c.root-servers.net. from server 208.80.154.143 in 0 ms.
NS g.root-servers.net. from server 208.80.154.143 in 0 ms.
NS i.root-servers.net. from server 208.80.154.143 in 0 ms.
NS l.root-servers.net. from server 208.80.154.143 in 0 ms.
NS a.root-servers.net. from server 208.80.154.143 in 0 ms.
NS m.root-servers.net. from server 208.80.154.143 in 0 ms.
NS b.root-servers.net. from server 208.80.154.143 in 0 ms.
NS j.root-servers.net. from server 208.80.154.143 in 0 ms.
NS e.root-servers.net. from server 208.80.154.143 in 0 ms.
NS h.root-servers.net. from server 208.80.154.143 in 0 ms.
NS f.root-servers.net. from server 208.80.154.143 in 0 ms.
NS d.root-servers.net. from server 208.80.154.143 in 0 ms.
A 172.20.5.1 from server 208.80.153.231 in 30 ms.
aborrero renamed this task from PCC: unable to run dnsquery:a() for some domains to cloud: failures resolving some `wikimedia.cloud` domains.May 12 2023, 11:49 AM
aborrero updated the task description. (Show Details)

There is a mysterious log message:

aborrero@cloudservices1005:~ 3s $ sudo journalctl -u pdns-recursor.service  | grep \'wikimedia.cloud
May 12 10:50:55 cloudservices1005 pdns-recursor[1407057]: Redirecting queries for zone 'wikimedia.cloud' to: 208.80.154.148:53
May 12 11:50:54 cloudservices1005 pdns-recursor[1419986]: Redirecting queries for zone 'wikimedia.cloud' to: 208.80.154.148:53

Change 919341 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:opesntack::pdns: remove manual forwarding rules for wikimedia.cloud

https://gerrit.wikimedia.org/r/919341

Change 919341 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] P:opesntack::pdns: remove manual forwarding rules for wikimedia.cloud

https://gerrit.wikimedia.org/r/919341

taavi claimed this task.