Page MenuHomePhabricator

DNS lookups for nih.gov hosts failing from Cloud VPS/Toolforge, services (citoid)
Closed, ResolvedPublic

Description

Filing this based on https://lists.wikimedia.org/pipermail/cloud/2019-June/000720.html

Essentially, this works on about any machine:

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12345'

but on toolforge we get

curl: (6) Could not resolve host: eutils.ncbi.nlm.nih.gov

This website (aka PubMed) is quite essential to some of our tools, e.g. citation-template-filling and sourcemd.

Event Timeline

From outside of the Foundation's network (DNS resolve success):

$ dig eutils.ncbi.nlm.nih.gov +trace

; <<>> DiG 9.9.5-3ubuntu0.19-Ubuntu <<>> eutils.ncbi.nlm.nih.gov +trace
;; global options: +cmd
.                       231637  IN      NS      a.root-servers.net.
.                       231637  IN      NS      b.root-servers.net.
.                       231637  IN      NS      c.root-servers.net.
.                       231637  IN      NS      d.root-servers.net.
.                       231637  IN      NS      e.root-servers.net.
.                       231637  IN      NS      f.root-servers.net.
.                       231637  IN      NS      g.root-servers.net.
.                       231637  IN      NS      h.root-servers.net.
.                       231637  IN      NS      i.root-servers.net.
.                       231637  IN      NS      j.root-servers.net.
.                       231637  IN      NS      k.root-servers.net.
.                       231637  IN      NS      l.root-servers.net.
.                       231637  IN      NS      m.root-servers.net.
.                       231637  IN      RRSIG   NS 8 0 518400 20190702050000 201
90619040000 25266 . 4jjoT5qvgVT+hlTYvCjffg5EnOUpN5ZjZNepaYrUPBKieIJ5g7GBa6dj Z5V
xOnsXgU5eJDu6GdkYKTCvC/gSGNGoAOtPlGs5/DMKBGfLHQ3gTfV6 b0jmRLXdPUz+/1pjGWixpMoxwO
pckbG7h2v0tlmRDKoy1J/yw63Bn9tn jn8xkr/bKPbY+rJA8oW5Qio/gmgI2LRLo8euKpIxh59Vw4KeBDQQiEsG DZr/2ff9A/VLDGdWW2RHSPCC3AUXgCWs135YkVQd+4CFuORuovv38r6R Ax4tD+JuW377MR7
zLPPaHgOZWjsPQ4VFj9NLwHklvHUJlZDlv6hIzkn1 UcrLLw==
;; Received 525 bytes from 8.8.8.8#53(8.8.8.8) in 10 ms

gov.                    172800  IN      NS      b.gov-servers.net.
gov.                    172800  IN      NS      d.gov-servers.net.
gov.                    172800  IN      NS      a.gov-servers.net.
gov.                    172800  IN      NS      c.gov-servers.net.
gov.                    86400   IN      DS      7698 8 2 6BC949E638442EAD0BDAF09
35763C8D003760384FF15EBBD5CE86BB5 559561F0
gov.                    86400   IN      DS      7698 8 1 6F109B46A80CEA9613DC86D
5A3E065520505AAFE
gov.                    86400   IN      RRSIG   DS 8 1 86400 20190702170000 2019
0619160000 25266 . E8/qSVWJ+pEm079Nv1hGFkUdtGCD9PrLRK4NXX0LhwxcRlclx+rLX4YB 9FSE
4b03R0BCaqX/h1eO1+x5GaHD4FwH4sIg1TSUM/wFbJBGNazNc/pr gcs3FPiPJ0wKcqQ8+QYL6p/VP0C
scwjK30nkFPyh2Aihpbh0Pyo9YOky Z0HP9v0U/2JTlJxQyZ8BYXM8jiPJPIAso2OavG7IWcZh0Q9CJl
5VDGWN lumdmR6Tlwme/BalITMc2H5xTUky01h2FBAQ5p1r25aS5lcWAg8Jhxun KfKIgGcvYSEHoHUF
pFAFcpy0g7IXyM/ZeY9zr0KbD/MEhB7kCz9RgqLJ HWOE6g==
;; Received 678 bytes from 2001:500:200::b#53(b.root-servers.net) in 184 ms

nih.gov.                86400   IN      NS      ns.nih.gov.
nih.gov.                86400   IN      NS      ns3.nih.gov.
nih.gov.                86400   IN      NS      ns2.nih.gov.
nih.gov.                3600    IN      DS      34839 7 1 ABB7C3B8674986AC735410
A34BA16768E653D4A6
nih.gov.                3600    IN      DS      34839 7 2 99F550CF4568E7A8FACB6E
E8A1CBF32D94207265CFD53C5525E905ED E5FB76AC
nih.gov.                3600    IN      RRSIG   DS 8 2 3600 20190626161007 20190
619161007 43583 gov. gmTq50Gxd/46ZL/sjzFxnWrB3TOixIrFeSke2FQPf1hGR+QswNATSkjN 0J
yVLHvAXUqGHC3jTx+xRHFpmT+XOaHYH3Butgn0XEacHzt6m1zf3sR+ U4egoKZiV5tyz903hcVh/tyJm
IaR++S1JkXueswHgoTAQQjVbCGBA5ZL sGc=
;; Received 484 bytes from 209.112.123.30#53(b.gov-servers.net) in 1763 ms

eutils.ncbi.nlm.nih.gov. 86400  IN      CNAME   eutils.wip.ncbi.nlm.nih.gov.
eutils.ncbi.nlm.nih.gov. 86400  IN      RRSIG   CNAME 7 5 86400 20191130185748 2
0190603185748 52670 ncbi.nlm.nih.gov. Dgo7Xc2drwmJoMzBD68sC+2flgpdURG+Ag5MtlybE3
jM49evqSIsh1kR Kf/WAjom8U3oBGwarzkXLTwyJtC0/Nca3+v+3XH5OyahGZbBrPWxbLxY 0JSwAtef
Ml4SzZpNCEjICqtbYTfhJlW0fS2QPMABGFKXbD5Nhgcznvxn YzAIfDL693TytQl/OPgI5ij4UlZaP92
wI232dGrSozrwvxlOk+UseYhF 85akowdcPuvIy6u/Dn5ZDAiEbIdlxXKvPBTPRatEIVLnY3PASi53p1
zu 8WG49HhOaL1oNfLfR/eKbyfRYYcRBDujTy1HeghEChOl2p0qu/ufXZwv L1i6xg==
wip.ncbi.nlm.nih.gov.   7200    IN      NS      gslb02.nlm.nih.gov.
wip.ncbi.nlm.nih.gov.   7200    IN      NS      gslb01.nlm.nih.gov.
wip.ncbi.nlm.nih.gov.   7200    IN      NS      gslb03.nlm.nih.gov.
wip.ncbi.nlm.nih.gov.   86400   IN      DS      59154 7 1 54BA34660FCCB95981E402
4C4D7E8C00295041E9
wip.ncbi.nlm.nih.gov.   86400   IN      RRSIG   DS 7 5 86400 20191130185748 2019
0603185748 52670 ncbi.nlm.nih.gov. gnn+LbzT2Gm7Sjhfxr1C43gW+01a1nDR5BzoibuqPOExf
A8mIC1zZnt1 QJ7PYYS2qdhKiUE8YIGYZ7pAq8Y1sCJLjwhntog8BV5IESaTv56IUbav dM5ZQY86YwR
8lcNdxarBmQRwgm2XFQLxVAmpjLUC5nDETuQQcKQjWP09 pWqYLnbGy0g9dyp+7Z3GmfPodRSMwkCeVe
nd14/dbme96+QElxb2e3I0 oLrvoixxFhQ8qXBdrPO+Mgu6rZUu0yUD9m0s7ZYP+sHTa+VWSdjduYsC
nZcW7DmGkfxc9aCkalrdRDzsh82u8B5MQYihGHh14+BmsMPvkCypojPs gpRJew==
;; Received 1942 bytes from 2607:f220:402:1801::a570:4e6#53(ns3.nih.gov) in 204
ms

From a Toolforge bastion (DNS resolve timeout contacting the last server in the chain):

$ dig eutils.ncbi.nlm.nih.gov +trace

; <<>> DiG 9.10.3-P4-Debian <<>> eutils.ncbi.nlm.nih.gov +trace
;; global options: +cmd
.                       85156   IN      NS      b.root-servers.net.
.                       85156   IN      NS      m.root-servers.net.
.                       85156   IN      NS      c.root-servers.net.
.                       85156   IN      NS      k.root-servers.net.
.                       85156   IN      NS      l.root-servers.net.
.                       85156   IN      NS      f.root-servers.net.
.                       85156   IN      NS      e.root-servers.net.
.                       85156   IN      NS      j.root-servers.net.
.                       85156   IN      NS      h.root-servers.net.
.                       85156   IN      NS      d.root-servers.net.
.                       85156   IN      NS      g.root-servers.net.
.                       85156   IN      NS      a.root-servers.net.
.                       85156   IN      NS      i.root-servers.net.
.                       85156   IN      RRSIG   NS 8 0 518400 20190702170000 20190619160000 25266 . pchuv38vtcbfIkCIXw60luD1hhigpsFbT5RWAnKq6RcYzyUDXvL15GHd RwmRObYumKMRaWmQZ9AJ9j7bpRZlQCNlsHCAwGPCI3nfXjmTJC4VJJ/t 622QXpfwKAP9gJMze2kAL68TiK6hJ+dzQPItrOMWGKbiEC9fFET8UnEd MneXhj0g0U2a2xx+cEjHWFjO0VfzLtq9tLsCNO9WVSpmNo0V9771Kgcb yQ7l/ZNxQZjM5zoFsLWxIP3BylcywjH+Onx/T9SksDvYRlwqNUh+4zNP yiMo/tpURnPXS9sDa7Q1KGdIj65sCOA21U3x6TdbrNm6/eH5EFhdimBJ rn1o4A==
;; Received 525 bytes from 208.80.154.143#53(208.80.154.143) in 5 ms

gov.                    172800  IN      NS      d.gov-servers.net.
gov.                    172800  IN      NS      b.gov-servers.net.
gov.                    172800  IN      NS      a.gov-servers.net.
gov.                    172800  IN      NS      c.gov-servers.net.
gov.                    86400   IN      DS      7698 8 1 6F109B46A80CEA9613DC86D5A3E065520505AAFE
gov.                    86400   IN      DS      7698 8 2 6BC949E638442EAD0BDAF0935763C8D003760384FF15EBBD5CE86BB5 559561F0
gov.                    86400   IN      RRSIG   DS 8 1 86400 20190702170000 20190619160000 25266 . E8/qSVWJ+pEm079Nv1hGFkUdtGCD9PrLRK4NXX0LhwxcRlclx+rLX4YB 9FSE4b03R0BCaqX/h1eO1+x5GaHD4FwH4sIg1TSUM/wFbJBGNazNc/pr gcs3FPiPJ0wKcqQ8+QYL6p/VP0CscwjK30nkFPyh2Aihpbh0Pyo9YOky Z0HP9v0U/2JTlJxQyZ8BYXM8jiPJPIAso2OavG7IWcZh0Q9CJl5VDGWN lumdmR6Tlwme/BalITMc2H5xTUky01h2FBAQ5p1r25aS5lcWAg8Jhxun KfKIgGcvYSEHoHUFpFAFcpy0g7IXyM/ZeY9zr0KbD/MEhB7kCz9RgqLJ HWOE6g==
;; Received 681 bytes from 192.36.148.17#53(i.root-servers.net) in 0 ms

nih.gov.                86400   IN      NS      ns.nih.gov.
nih.gov.                86400   IN      NS      ns3.nih.gov.
nih.gov.                86400   IN      NS      ns2.nih.gov.
nih.gov.                3600    IN      DS      34839 7 1 ABB7C3B8674986AC735410A34BA16768E653D4A6
nih.gov.                3600    IN      DS      34839 7 2 99F550CF4568E7A8FACB6EE8A1CBF32D94207265CFD53C5525E905ED E5FB76AC
nih.gov.                3600    IN      RRSIG   DS 8 2 3600 20190626161007 20190619161007 43583 gov. gmTq50Gxd/46ZL/sjzFxnWrB3TOixIrFeSke2FQPf1hGR+QswNATSkjN 0JyVLHvAXUqGHC3jTx+xRHFpmT+XOaHYH3Butgn0XEacHzt6m1zf3sR+ U4egoKZiV5tyz903hcVh/tyJmIaR++S1JkXueswHgoTAQQjVbCGBA5ZL sGc=
;; Received 484 bytes from 69.36.153.30#53(c.gov-servers.net) in 11 ms

;; connection timed out; no servers could be reached

From a random server inside the Foundation's prod network (DNS resolve timeout contacting the .gov nameserver pool):

$ dig eutils.ncbi.nlm.nih.gov +trace

; <<>> DiG 9.10.3-P4-Debian <<>> eutils.ncbi.nlm.nih.gov +trace
;; global options: +cmd
.                       82386   IN      NS      m.root-servers.net.
.                       82386   IN      NS      e.root-servers.net.
.                       82386   IN      NS      i.root-servers.net.
.                       82386   IN      NS      d.root-servers.net.
.                       82386   IN      NS      c.root-servers.net.
.                       82386   IN      NS      h.root-servers.net.
.                       82386   IN      NS      a.root-servers.net.
.                       82386   IN      NS      b.root-servers.net.
.                       82386   IN      NS      g.root-servers.net.
.                       82386   IN      NS      j.root-servers.net.
.                       82386   IN      NS      k.root-servers.net.
.                       82386   IN      NS      f.root-servers.net.
.                       82386   IN      NS      l.root-servers.net.
.                       82386   IN      RRSIG   NS 8 0 518400 20190702170000 201
90619160000 25266 . pchuv38vtcbfIkCIXw60luD1hhigpsFbT5RWAnKq6RcYzyUDXvL15GHd Rwm
RObYumKMRaWmQZ9AJ9j7bpRZlQCNlsHCAwGPCI3nfXjmTJC4VJJ/t 622QXpfwKAP9gJMze2kAL68TiK6hJ+dzQPItrOMWGKbiEC9fFET8UnEd MneXhj0g0U2a2xx+cEjHWFjO0VfzLtq9tLsCNO9WVSpmNo0V9
771Kgcb yQ7l/ZNxQZjM5zoFsLWxIP3BylcywjH+Onx/T9SksDvYRlwqNUh+4zNP yiMo/tpURnPXS9sDa7Q1KGdIj65sCOA21U3x6TdbrNm6/eH5EFhdimBJ rn1o4A==
;; Received 525 bytes from 208.80.154.254#53(208.80.154.254) in 0 ms

;; connection timed out; no servers could be reached

It works ok from the Foundation's public network servers. Just for notes.

dig @8.8.8.8 eutils.ncbi.nlm.nih.gov works from inside Toolforge. This is looking like the default DNS resolvers for all of the Cloud VPS tenant space (208.80.154.143 and 208.80.154.24) being blocked by the ns*.nih.gov DNS primaries.

Case number is CAS-385915-W6L6G7 opened via the support form at support.nlm.nih.gov:

I manage the Operations team for Wikimedia Cloud Services. We are a department within the Wikimedia Foundation, the 501c3 supporting Wikipedia and related projects, who operate a shared computing environment used by Wikimedia movement volunteers to operate software tools that assist Wikipedia editors in maintaining the encyclopedia and related Open Knowledge projects.

Based on an error report by a volunteer, we are investigating DNS resolution issues when attempting to resolve `eutils.ncbi.nlm.nih.gov.` from inside our network. Various details have been collected at <https://phabricator.wikimedia.org/T226088>, but the summary seems to be that our DNS recursors (208.80.154.143 and 208.80.154.24) are being blocked from receiving responses from the ns.nih.gov, ns2.nih.gov, and ns3.nih.gov authoritative DNS servers.

It is quite possible that some user of our shared environment triggered this block by being overly aggressive in DNS lookups or web API usage of some kind. The first thing we would like to verify is if such a block is being actively maintained on the part of nih.gov or not. If so, the follow up question is what steps can we take to restore your trust and get the block lifted. If the block is not active, meaning it comes from some type of realtime blackhole list subscription being applied to the nih.gov authoritative servers, we would like to know if you can determine which RBL we have landed on.

Thanks for your time,
​​​​​​​Bryan
bd808 renamed this task from DNS from toolforge not working for some host(s) to DNS lookups for nih.gov hosts failing from Cloud VPS/Toolforge.Jun 20 2019, 2:55 PM

I got a reply back from support.nlm.nih.gov. They apparently are not directly connected with the group that manages the nih.gov nameservers. They were kind enough to provide me with a URL to submit a trouble ticket for the nih.gov network operations folks (https://itservicedesk.nih.gov/support/). Sadly, that URL is one that I had found before and seems to be a hostname that is not resolvable outside of the nih.gov network. I have responded to them asking if there is any other contact point for their NOC.

While we don't get a reply, maybe we could just add NIH domains in /etc/hosts

At least the most important domains so citoid works properly:

eutils.ncbi.nlm.nih.gov

www.ncbi.nlm.nih.gov

Thank you!

While we don't get a reply, maybe we could just add NIH domains in /etc/hosts

There are 3 different environments that would need this: Toolforge bastions, grid engine exec nodes, and Kubernetes containers. We could use Puppet to manage /etc/hosts on the bastion and exec nodes, but that won't help us with the Kubernetes containers. This also becomes a potential for random seeming breakage from the tool maintainer's point of view as the upstream resolution for the hosts could change at any time and we would have to notice, update Puppet, and apply the changes to restore functionality. None of this is impossible, but its not trivial.

I think in principle we should not be /etc/hostsing our way around other people's broken nameservers and restrictions.

I only know about the basics of nameservers, the other option I could see would be if there was a way to use google 8.8.8.8 as a fallback to solve problematic domains.

Is this related???
https://community.cloudflare.com/t/cannot-resolve-https-www-ncbi-nlm-nih-gov/15131

It is still true. http://dnsviz.net/d/www.ncbi.nlm.nih.gov/dnssec/ They do not resond until the payload size was reduced.

Responses not being handled by our recursors actually makes a bit more sense to me than our recursors being blocked. The report from dnsviz.net shows several errors of "The server(s) were not responsive to queries over UDP." and warnings of "No response was received until the UDP payload size was decreased, indicating that the server might be attempting to send a payload that exceeds the path maximum transmission unit (PMTU) size."

@Bstorm's finding from T226088#5269608 that the prod and cloud recursors handle this lookup differently is still significant. The production and Cloud VPS recursors both run pdns-recursor with only minimal differences in configuration. One of those differences however is that the Cloud VPS recursors install a lua script that hooks [[https://doc.powerdns.com/recursor/lua-scripting/hooks.html#postresolve|postresolve]].

To try and get more information on what is going on, I enabled tracing with rec_control trace-regex '.*\.nih.gov\.$' on cloudservices1003 (cloud-recursor0.wikimedia.org). Watching /var/log/syslog immediately after this shows this activity:

Jul 07 03:12:04 1 [83608429/1] question for 'eutils.ncbi.nlm.nih.gov|A' from 172.16.3.159
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Wants DNSSEC processing in query for A
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Looking for CNAME cache hit of 'eutils.ncbi.nlm.nih.gov|CNAME'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: No CNAME cache hit of 'eutils.ncbi.nlm.nih.gov|CNAME' found
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: No cache hit for 'eutils.ncbi.nlm.nih.gov|A', trying to find an appropriate NS record
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Checking if we have NS in cache for 'eutils.ncbi.nlm.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: no valid/useful NS in cache for 'eutils.ncbi.nlm.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Checking if we have NS in cache for 'ncbi.nlm.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: no valid/useful NS in cache for 'ncbi.nlm.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Checking if we have NS in cache for 'nlm.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: no valid/useful NS in cache for 'nlm.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Checking if we have NS in cache for 'nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: NS (with ip, or non-glue) in cache for 'nih.gov' -> 'ns.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: within bailiwick: 1,  in cache, ttl=72994
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: NS (with ip, or non-glue) in cache for 'nih.gov' -> 'ns3.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: within bailiwick: 1,  in cache, ttl=72994
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: NS (with ip, or non-glue) in cache for 'nih.gov' -> 'ns2.nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: within bailiwick: 1,  in cache, ttl=72994
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: We have NS in cache for 'nih.gov' (flawedNSSet=0)
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Cache consultations done, have 3 NS to contact
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov.: Nameservers: ns2.nih.gov.(301.08ms), ns.nih.gov.(301.08ms), ns3.nih.gov.(301.08ms)
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying to resolve NS 'ns2.nih.gov' (1/3)
Jul 07 03:12:08 [83608429]    ns2.nih.gov: Wants DNSSEC processing in query for A
Jul 07 03:12:08 [83608429]    ns2.nih.gov: Looking for CNAME cache hit of 'ns2.nih.gov|CNAME'
Jul 07 03:12:08 [83608429]    ns2.nih.gov: No CNAME cache hit of 'ns2.nih.gov|CNAME' found
Jul 07 03:12:08 [83608429]    ns2.nih.gov: Found cache hit for A: 128.231.64.1[ttl=72994]
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Resolved 'nih.gov' NS ns2.nih.gov to: 128.231.64.1
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying IP 128.231.64.1:53, asking 'eutils.ncbi.nlm.nih.gov|A'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: timeout resolving after 1500.14msec
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Max fails reached resolving on 128.231.64.1. Going full throttle for 60 seconds
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying to resolve NS 'ns.nih.gov' (2/3)
Jul 07 03:12:08 [83608429]    ns.nih.gov: Wants DNSSEC processing in query for A
Jul 07 03:12:08 [83608429]    ns.nih.gov: Looking for CNAME cache hit of 'ns.nih.gov|CNAME'
Jul 07 03:12:08 [83608429]    ns.nih.gov: No CNAME cache hit of 'ns.nih.gov|CNAME' found
Jul 07 03:12:08 [83608429]    ns.nih.gov: Found cache hit for A: 128.231.128.251[ttl=72993]
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Resolved 'nih.gov' NS ns.nih.gov to: 128.231.128.251
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying IP 128.231.128.251:53, asking 'eutils.ncbi.nlm.nih.gov|A'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: timeout resolving after 1514.35msec
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Max fails reached resolving on 128.231.128.251. Going full throttle for 60 seconds
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying to resolve NS 'ns3.nih.gov' (3/3)
Jul 07 03:12:08 [83608429]    ns3.nih.gov: Wants DNSSEC processing in query for A
Jul 07 03:12:08 [83608429]    ns3.nih.gov: Looking for CNAME cache hit of 'ns3.nih.gov|CNAME'
Jul 07 03:12:08 [83608429]    ns3.nih.gov: No CNAME cache hit of 'ns3.nih.gov|CNAME' found
Jul 07 03:12:08 [83608429]    ns3.nih.gov: Found cache hit for A: 165.112.4.230[ttl=72991]
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Resolved 'nih.gov' NS ns3.nih.gov to: 165.112.4.230
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying IP 165.112.4.230:53, asking 'eutils.ncbi.nlm.nih.gov|A'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: timeout resolving after 1500.39msec
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Failed to resolve via any of the 3 offered NS at level 'nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: failed (res=-1)

I am not seeing much new information here, but maybe somebody else can spot something interesting?

I seem to be able to get a response using dig if I explicitly disable DNSSEC processing for the lookup:

$ dig +trace +nodnssec @cloud-recursor1.wikimedia.org. eutils.wip.ncbi.nlm.nih.gov. A

; <<>> DiG 9.10.3-P4-Debian <<>> +trace +nodnssec @cloud-recursor1.wikimedia.org
. eutils.wip.ncbi.nlm.nih.gov. A
; (1 server found)
;; global options: +cmd
.                       82836   IN      NS      m.root-servers.net.
.                       82836   IN      NS      b.root-servers.net.
.                       82836   IN      NS      c.root-servers.net.
.                       82836   IN      NS      e.root-servers.net.
.                       82836   IN      NS      i.root-servers.net.
.                       82836   IN      NS      f.root-servers.net.
.                       82836   IN      NS      g.root-servers.net.
.                       82836   IN      NS      d.root-servers.net.
.                       82836   IN      NS      k.root-servers.net.
.                       82836   IN      NS      a.root-servers.net.
.                       82836   IN      NS      j.root-servers.net.
.                       82836   IN      NS      l.root-servers.net.
.                       82836   IN      NS      h.root-servers.net.
;; Received 239 bytes from 208.80.154.24#53(cloud-recursor1.wikimedia.org.) in 0
 ms

gov.                    172800  IN      NS      a.gov-servers.net.
gov.                    172800  IN      NS      b.gov-servers.net.
gov.                    172800  IN      NS      c.gov-servers.net.
gov.                    172800  IN      NS      d.gov-servers.net.
;; Received 311 bytes from 199.7.83.42#53(l.root-servers.net) in 27 ms

nih.gov.                86400   IN      NS      ns.nih.gov.
nih.gov.                86400   IN      NS      ns3.nih.gov.
nih.gov.                86400   IN      NS      ns2.nih.gov.
;; Received 241 bytes from 69.36.153.30#53(c.gov-servers.net) in 3 ms

wip.ncbi.nlm.nih.gov.   7200    IN      NS      gslb02.nlm.nih.gov.
wip.ncbi.nlm.nih.gov.   7200    IN      NS      gslb01.nlm.nih.gov.
wip.ncbi.nlm.nih.gov.   7200    IN      NS      gslb03.nlm.nih.gov.
;; Received 251 bytes from 165.112.4.230#53(ns3.nih.gov) in 50 ms

eutils.wip.ncbi.nlm.nih.gov. 30 IN      A       130.14.29.110
;; Received 72 bytes from 130.14.252.50#53(gslb01.nlm.nih.gov) in 26 ms

Another way to get dig +trace to work from inside Toolforge: dig +trace +tcp eutils.wip.ncbi.nlm.nih.gov. A.

Mvolz renamed this task from DNS lookups for nih.gov hosts failing from Cloud VPS/Toolforge to DNS lookups for nih.gov hosts failing from Cloud VPS/Toolforge, services (citoid).Jul 10 2019, 11:15 AM
Mvolz added projects: Services, Citoid.

This is affecting Citoid as well, see T227415.

If the production Citoid instance is affected as well I believe that means that production DNS recusors are having issues resolving the hosts too. That is in contradiction to @Bstorm's finding in T226088#5269608, but that may have been the dig +trace ... type of check anyway?

The detailed logs I gathered in T226088#5311277 show the resolve failures inside one of the Cloud Services PDNS recursor servers:

Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Resolved 'nih.gov' NS ns2.nih.gov to: 128.231.64.1
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying IP 128.231.64.1:53, asking 'eutils.ncbi.nlm.nih.gov|A'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: timeout resolving after 1500.14msec
...
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Resolved 'nih.gov' NS ns.nih.gov to: 128.231.128.251
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying IP 128.231.128.251:53, asking 'eutils.ncbi.nlm.nih.gov|A'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: timeout resolving after 1514.35msec
...
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Resolved 'nih.gov' NS ns3.nih.gov to: 165.112.4.230
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Trying IP 165.112.4.230:53, asking 'eutils.ncbi.nlm.nih.gov|A'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: timeout resolving after 1500.39msec
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: Failed to resolve via any of the 3 offered NS at level 'nih.gov'
Jul 07 03:12:08 [83608429] eutils.ncbi.nlm.nih.gov: failed (res=-1)

I am relatively confident at this point that the problem is related to the DNSSEC records in the nih.gov zone files. These make the response payloads too large to fit into a single UDP response packet. In theory their server should recognize this, emit a UDP response packet that includes the first 512 bytes of data and also sets the TC bit in the header to indicate that this response was truncated. Our recursor software should receive this response, see that it was truncated, and re-request the lookup from the upstream server via a TCP connection. The switch to TCP allows the full response to be returned by the upstream server. Something somewhere is messing up this process. Instead we are sending out the request, waiting 1.5 seconds for a response, and then giving up and moving on to query the next published NS server for the zone.

We are running pdns-recursor v4.0.4. The upstream documentation about DNSSEC support in the recursor says:

As of 4.0.0, the PowerDNS Recursor has support for DNSSEC processing and experimental support for DNSSEC validation.

Warning
The DNSSEC implementation in the PowerDNS Recursor 4.0.x is known to have deficiencies due to its original design. When doing DNSSEC validation, ensure you are running 4.1.0 or later which has a fully reworked (and correct) DNSSEC implementation.

I take this warning to mean that DNSSEC + pdns-recursor 4.0.4 is an unstable combination. Before T221769: Upgrade cloudservices1003/1004 to stretch/mitaka was completed in early June (a week or so before this bug was filed) we would have been using pdns-recursor 3.x which had no DNSSEC support. Now we are using a version with buggy DNSSEC support. I think that we should try turning off DNSSEC in our recursor.conf file until we upgrade to a newer version of pdns-recursor. This should make things work more like the dig +nodnssec experiments I did in T226088#5311280 by default.

Change 521910 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] cloud: Disable DNSSEC for pdns-recursor

https://gerrit.wikimedia.org/r/521910

Change 521910 merged by Andrew Bogott:
[operations/puppet@production] cloud: Disable DNSSEC for pdns-recursor

https://gerrit.wikimedia.org/r/521910

Change 521921 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Disable DNSSEC for pdns-recursor in prod as well

https://gerrit.wikimedia.org/r/521921

Change 521921 merged by BBlack:
[operations/puppet@production] Disable DNSSEC for pdns-recursor in prod as well

https://gerrit.wikimedia.org/r/521921

bd808 claimed this task.

Lookups for eutils.ncbi.nlm.nih.gov inside Cloud VPS projects (including Toolforge) seem to be fixed!

This comment was removed by bd808.