Page MenuHomePhabricator

Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org
Closed, ResolvedPublic

Description

We've had reports that certain internet connected systems are failing to resolve hostnames under toolforge.org. This follows some changes WMCS have made today which have changed the IP address for ns1.openstack.eqiad1.wikimediacloud.org.

The wikimediacloud.org domain is hosted by ns[0-2].wikimedia.org, and this is working normally. It is returning the following two A records right now:

cathal@officepc:~$ dig +noall +answer A ns0.openstack.eqiad1.wikimediacloud.org. @ns0.wikimedia.org 
ns0.openstack.eqiad1.wikimediacloud.org. 300 IN	A 208.80.154.148
cathal@officepc:~$ dig +noall +answer A ns1.openstack.eqiad1.wikimediacloud.org. @ns0.wikimedia.org 
ns1.openstack.eqiad1.wikimediacloud.org. 3600 IN A 185.15.56.163

The first is a manual record directly in the zone file, the second is a Netbox-generated record that's included in it (distinction irrelevant tbh).

I see if I query for the toolforge.org NS records from any of the .ORG TLDs they return the two old A records for these hostnames in the 'additional' section:

cathal@officepc:~$ dig +nsid NS toolforge.org @b2.org.afilias-nst.org. 

; <<>> DiG 9.18.12-0ubuntu0.22.04.2-Ubuntu <<>> +nsid NS toolforge.org @b2.org.afilias-nst.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45215
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 2, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 4c 48 52 35 ("LHR5")
;; QUESTION SECTION:
;toolforge.org.			IN	NS

;; AUTHORITY SECTION:
toolforge.org.		3600	IN	NS	ns0.openstack.eqiad1.wikimediacloud.org.
toolforge.org.		3600	IN	NS	ns1.openstack.eqiad1.wikimediacloud.org.

;; ADDITIONAL SECTION:
ns1.openstack.eqiad1.wikimediacloud.org. 3600 IN A 208.80.154.11
ns0.openstack.eqiad1.wikimediacloud.org. 3600 IN A 208.80.154.135

;; Query time: 40 msec
;; SERVER: 2001:500:48::1#53(b2.org.afilias-nst.org.) (UDP)
;; WHEN: Tue Sep 12 19:04:51 IST 2023
;; MSG SIZE  rcvd: 153

While there is no circular dependency here (the name servers for toolforge.org are not themselves under toolforge.org), it seems to be that the ORG TLDs may be hard-coded with these A records / IPs. And that may be what is causing the problems we seem to be having.

Event Timeline

cmooney triaged this task as Medium priority.Sep 12 2023, 6:06 PM
cmooney created this task.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

taavi raised the priority of this task from Medium to High.Sep 12 2023, 6:07 PM
taavi edited projects, added Cloud-VPS; removed Cloud-Services.
cmooney renamed this task from Certain systems failing to resolve to Certain systems failing to resolve DNS entries under toolforge.org.Sep 12 2023, 6:08 PM
cmooney updated the task description. (Show Details)
cmooney updated the task description. (Show Details)

login.toolforge.org is not working too (even after flushing local dns). So no way to ssh into TS.

taavi renamed this task from Certain systems failing to resolve DNS entries under toolforge.org to Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org.Sep 12 2023, 6:18 PM
taavi renamed this task from Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org to Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org.

@RobH if we can ask the registrar to change the IP they have for ns1.openstack.eqiad1.wikimediacloud.org to 185.15.56.163 I think it should solve it.

@RobH if we can ask the registrar to change the IP they have for ns1.openstack.eqiad1.wikimediacloud.org to 185.15.56.163 I think it should solve it.

To confirm, I can email our MarkMonitor rep and start the process to change the IP assignment for this name server, but want to double check this is the planned IP for future use, and it won't shift again in the near term?

@RobH if we can ask the registrar to change the IP they have for ns1.openstack.eqiad1.wikimediacloud.org to 185.15.56.163 I think it should solve it.

To confirm, I can email our MarkMonitor rep and start the process to change the IP assignment for this name server, but want to double check this is the planned IP for future use, and it won't shift again in the near term?

That IP will not change. We will need a subsequent rerquest when the ns0 IP changes but we can deal with that later, right now it'd be premature to change anything with ns0 as it's still using the 208.80.154.135 IP.

So let's just ask them to change the IP they have for ns1.openstack.eqiad1.wikimediacloud.org to 185.15.56.163.

Naoya,

We're juggling around some nameservers in our cloud environment over here, and need to update one of them:

ns1.openstack.eqiad1.wikimediacloud.org to 185.15.56.163

Please let me know what authorizations are needed to make this happen and implement as soon as you can. Our sub-team over here already re-allocated that IP to the nameserver without realizing we had to let you know to update on your end.

Thanks!

Update,

It turns out the other nameserver was migrated previously with no IP update either, so its currently incorrect on the MarkMonitor/registar side due to our lack of update to you.
ns0.openstack.eqiad1.wikimediacloud.org to 208.80.154.148
ns1.openstack.eqiad1.wikimediacloud.org to 185.15.56.163

Please let us know when you receive this request as it now turns out this domain is offline until we update this and it propagates out, thank you!

I've not seen any change in what ORG is returning. I have made some routing and host-level iptables changes on cloudservices1006 to get the old IP to respond for now though:

cathal@officepc:~$ dig +nsid SOA toolforge.org @208.80.154.11 

; <<>> DiG 9.18.12-0ubuntu0.22.04.2-Ubuntu <<>> +nsid SOA toolforge.org @208.80.154.11
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31107
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; NSID: 63 6c 6f 75 64 73 65 72 76 69 63 65 73 31 30 30 36 ("cloudservices1006")
;; QUESTION SECTION:
;toolforge.org.			IN	SOA

;; ANSWER SECTION:
toolforge.org.		3600	IN	SOA	ns0.openstack.eqiad1.wikimediacloud.org. root.toolforge.org. 1694540525 3501 600 86400 3600

;; Query time: 104 msec
;; SERVER: 208.80.154.11#53(208.80.154.11) (UDP)
;; WHEN: Tue Sep 12 20:02:04 IST 2023
;; MSG SIZE  rcvd: 140

FWIW I took the following steps:

  1. Routed 208.80.154.11/32 to cloudservices2006 on cloudsw
  2. Updated cloudsw and cr1-eqiad routing policies to export/accept the announcement of this route from cloudsw to WMF core
  3. Updated iptables on cloudservices1006 to NAT inbound traffic for the old ns1 IP to the ns0 IP:
sudo iptables -t nat -A PREROUTING -i vlan1151 -d 208.80.154.11/32 -j DNAT --to 185.15.56.163

At this point I still couldn't work out what's wrong. It isn't ACL on the core routers connecting to the cloud vrf. Turned out there were NOTRACK commands in the RAW netfilter table that prevented that last DNAT from working:

root@cloudservices1006:~# iptables -L -v --line -t raw 
Chain PREROUTING (policy ACCEPT 45M packets, 6524M bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 CT         tcp  --  any    any     anywhere             anywhere             tcp dpt:11213 NOTRACK
2    35701 3224K CT         tcp  --  any    any     anywhere             anywhere             tcp dpt:11000 NOTRACK
3      28M 2122M CT         udp  --  any    any     anywhere             anywhere             udp dpt:domain NOTRACK
4      28M 2122M CT         udp  --  any    any     anywhere             anywhere             udp dpt:domain NOTRACK
5        0     0 CT         tcp  --  any    any     anywhere             anywhere             tcp dpt:11213 NOTRACK

Chain OUTPUT (policy ACCEPT 45M packets, 8796M bytes)
num   pkts bytes target     prot opt in     out     source               destination         
1        0     0 CT         tcp  --  any    any     anywhere             anywhere             tcp spt:11213 NOTRACK
2    20213 7762K CT         tcp  --  any    any     anywhere             anywhere             tcp spt:11000 NOTRACK
3      27M 3626M CT         udp  --  any    any     anywhere             anywhere             udp spt:domain NOTRACK
4      27M 3626M CT         udp  --  any    any     anywhere             anywhere             udp spt:domain NOTRACK
5        0     0 CT         tcp  --  any    any     anywhere             anywhere             tcp spt:11213 NOTRACK

Lines 3 and 4 in the PREROUTING and OUTPUT chains. So I removed them:

iptables -t raw -D PREROUTING 3 
iptables -t raw -D PREROUTING 3
iptables -t raw -D OUTPUT 3
iptables -t raw -D OUTPUT 3

Once the ORG TLD starts returning the new records we can remove this.

I've temporarily reserved 208.80.154.11 in Netbox so it doesn't get used - we should remove that once ORG has updated their records and we've reverted above changes.

I'd not noticed in my initial comment above, but *neither* IP that ORG is returning was working earlier on.

Seems at some stage the IP for ns0 was changed from 208.80.154.135 to 208.80.154.148. So from that point on only ns1 (on 208.80.154.11) was actually working for the affected domains.

Today's change then caused ns1 to also fail, so both were failing, explaining the problems we seen.

I investigated making the same routing change for 208.80.154.135/32 but it's assigned to gerrit1003 so not possible.

I'd like to understand where the requirement for the "glue" A records that org are returning for these comes from. As I understand it a resolver looking for somename.toolforge.org would:

  1. Query the root DNS servers for the NS entries for org (or just the whole name)
    • Root zone returns NS entries for ORG, plus glue A/AAAA records for them
  2. Query the ORG TLD servers for the NS entries for toolforge.org
    • ORG TLD servers return the NS entries for toolforge.org
  3. NS entries for toolforge.org are under wikimediacloud.org, so query the ORG servers for NS entries for that
    • ORG TLD servers return ns[0-2].wikimedia.org, and 'glue' A records for these
  4. Query ns[0-2].wikimedia.org for A record for nsX.openstack.eqiad1.wikimediacloud.org

I'm not seeing any technical requirement here for the ORG TLD servers to have A records for the hostnames under wikimediacloud.org, only ns[0-2].wikimedia.org.

RFC1034 describes glue records as being needed "if the name of the name server is itself in the subzone", adding that "these RRs are only necessary if the name server's name is "below" the cut".

Looking at RFC1912 I also don't see any requirement for this. It states that glue records "are required only in forward zone files for nameservers that are located in the subdomain of the current zone that is being delegated". So when delegating toolforge.org they shouldn't be needed, as the nameservers for it are not in toolforge.org?

I'm not sure if there are other opinions on this or something I'm missing. If these records are not needed in the ORG zone then surely it'd be easier to have them removed, making ns[0-2].wikimedia.org the only place we need to update them in future?

I had a chat with Brandon about this on irc. He confirmed that glue records were not strictly needed for toolforge.org / wikimediacloud.org, but that typically the registrar will include them in this kind of scenario.

I also checked the full org zone, to see what references were there for wikimediacloud.org:

cathal@officepc:~/Desktop/ORG$ grep wikimediacloud.org org_zone.txt 
ns0.openstack.eqiad1.wikimediacloud.org.	3600	in	a	208.80.154.135
ns1.openstack.eqiad1.wikimediacloud.org.	3600	in	a	208.80.154.11
toolforge.org.	3600	in	ns	ns0.openstack.eqiad1.wikimediacloud.org.
toolforge.org.	3600	in	ns	ns1.openstack.eqiad1.wikimediacloud.org.
toolserver.org.	3600	in	ns	ns0.openstack.eqiad1.wikimediacloud.org.
toolserver.org.	3600	in	ns	ns1.openstack.eqiad1.wikimediacloud.org.
wikimediacloud.org.	3600	in	ns	ns0.wikimedia.org.
wikimediacloud.org.	3600	in	ns	ns1.wikimedia.org.
wikimediacloud.org.	3600	in	ns	ns2.wikimedia.org.
wmcloud.org.	3600	in	ns	ns0.openstack.eqiad1.wikimediacloud.org.
wmcloud.org.	3600	in	ns	ns1.openstack.eqiad1.wikimediacloud.org.
wmflabs.org.	3600	in	ns	ns0.openstack.eqiad1.wikimediacloud.org.
wmflabs.org.	3600	in	ns	ns1.openstack.eqiad1.wikimediacloud.org.

It looks like there is nothing there that necessitates the presence of the first two A records. So we should maybe inquire as to whether they can be removed so this headache could be avoided in future.

That said it's of limited benefit I think. Sure for this particular domain we could probably get away without glue records, but for many others we don't. So the lesson is we need to always be conscious of this kind of thing, and in general take this approach:

  1. Create new server / IP and ensure it is working and answering requests as needed
  2. Modify the A records in our own zones to point to that new IP
  3. Modify all 'glue' records with the registrar / TLD operator to point to the new IP
  4. Wait for propagation time (and then some)
  5. Decom the old IP(s) and servers

Decommissioning the old servers/IPs before everything has updated / propagated is always going to be problematic so we need to avoid it.

Decommissioning the old servers/IPs before everything has updated / propagated is always going to be problematic so we need to avoid it.

This is my fault, because I did T346033: cloudservices1004: decomission ahead of time. We had something similar in codfw1dev but forgot to kept it in the radar for eqiad1.

Thanks you all for fixing the mess.

I'd not noticed in my initial comment above, but *neither* IP that ORG is returning was working earlier on.

Seems at some stage the IP for ns0 was changed from 208.80.154.135 to 208.80.154.148. So from that point on only ns1 (on 208.80.154.11) was actually working for the affected domains.

Today's change then caused ns1 to also fail, so both were failing, explaining the problems we seen.

I've done a bit of research of the last few times we have updated cloudservices boxes. For example, when cloudservices1005 was put into service {T303415}, or before that when we reshuffled DNS service names T243766: Cloud DNS: proposal for new DNS service names.

I can confirm that sending updates to markmonitor has been a very inconsistent practice over the years, as you just discovered. The exception has been in cases were the need was obvious T247971: Cloud DNS: update markmonitor entries.
Basically, for the most part, we have contacted them only when problems have happened, like this particular ticket here.

Other than refreshing our DNS docs pointing to markmonitor more prominently, I don't immediately know how to make this whole thing less prone to human errors.

How to continue with this DNS migration is described in T346042#9163506.

I will close this ticket when we see markmonitor latest updates applied:

ns0.openstack.eqiad1.wikimediacloud.org to 208.80.154.148
ns1.openstack.eqiad1.wikimediacloud.org to 185.15.56.163

@aborerro I think it's probably relatively safe to close this in the morning.

Changes went through in the org zone ealier:

cathal@officepc:~$ dig +noall +additional NS toolforge.org @b2.org.afilias-nst.org.
ns0.openstack.eqiad1.wikimediacloud.org. 3600 IN A 208.80.154.148
ns1.openstack.eqiad1.wikimediacloud.org. 3600 IN A 185.15.56.163

Based on a tcpdump I had going in a loop every 10 mins queries to 208.80.154.11 have now slowed to a trickle. So by tomorrow morning should be fine to remove the routing and NAT rule for the old ns1 IP.

cmooney lowered the priority of this task from High to Low.Sep 13 2023, 6:30 PM

Requests are typically only coming in about 5 every 10 mins at this stage.

@aborrero I did notice some queries coming from the cloudgw IP, seems to be related to ACME / Let's Encrypt?

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eno12399np0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:20:55.736564 IP 185.15.56.1.54505 > 208.80.154.11.53: 30725+ TXT? _acme-challenge.toolforge.org. (47)
10:20:55.739874 IP 208.80.154.11.53 > 185.15.56.1.54505: 30725*- 0/1/0 (124)
10:20:57.763594 IP 185.15.56.1.58026 > 208.80.154.11.53: 62244+ TXT? _acme-challenge.toolforge.org. (47)
10:20:57.763818 IP 208.80.154.11.53 > 185.15.56.1.58026: 62244*- 0/1/0 (124)
10:20:59.353645 IP 185.15.56.1.40842 > 208.80.154.11.53: 31332+ TXT? _acme-challenge.tools.wmflabs.org. (51)
10:20:59.358047 IP 208.80.154.11.53 > 185.15.56.1.40842: 31332*- 2/0/0 TXT "iAVdzPkNygFUxAoNYDne3dx49m7Kw02pbDFotyhZBeQ", TXT "mzjm6ugGJB-SRYQ8hmYpUlq47ygkvfP_uECZUGlcxbc" (163)
10:20:59.363674 IP 185.15.56.1.60214 > 208.80.154.11.53: 63040+ TXT? _acme-challenge.tools.wmflabs.org. (51)
10:20:59.363854 IP 208.80.154.11.53 > 185.15.56.1.60214: 63040*- 2/0/0 TXT "iAVdzPkNygFUxAoNYDne3dx49m7Kw02pbDFotyhZBeQ", TXT "mzjm6ugGJB-SRYQ8hmYpUlq47ygkvfP_uECZUGlcxbc" (163)
10:21:06.169391 IP 185.15.56.1.44458 > 208.80.154.11.53: 42585+ TXT? _acme-challenge.mail.tools.wmcloud.org. (56)
10:21:06.175814 IP 208.80.154.11.53 > 185.15.56.1.44458: 42585 NXDomain*- 0/1/0 (141)
10 packets captured
29 packets received by filter
0 packets dropped by kernel

Might it be hardcoded some places still? Instances getting NAT'd to that IP possibly?

Might it be hardcoded some places still? Instances getting NAT'd to that IP possibly?

Yep, that's all of the acme-chiefs in cloud vps. I've updated all of them via Horizon and made https://gerrit.wikimedia.org/r/c/operations/puppet/+/957688/ to make that easier for the next time.

aborrero claimed this task.

thanks everyone involved in the debugging and fix.

Posting the below recently published RFC as it provides a little more clarity,

https://www.rfc-editor.org/rfc/rfc9471.txt

Ultimately does not change the conclusion we reached above, i.e. "A records are not strictly needed in the ORG zone for the wikimediacloud.org NS servers, as wikimedia.org is auth for that zone and there are glue records for ns[0-2].wikimedia.org. Although not technically required they are however present, which is common practice and we need to be mindful of."