Page MenuHomePhabricator

cannot resolve tools-static.wmflabs.org to the correct host from within toolforge
Closed, ResolvedPublic

Description

I am running a web-facing tool at https://tools.wmflabs.org/autodesc/ which is written in node.js. It uses translations that it gets via web request, like so:

var server = 'tools-static.wmflabs.org' ;

		request({
			url: 'https://'+server+'/tooltranslate/data/autodesc/toolinfo.json',
			headers: {'user-agent': 'Mozilla/5.0'},
			json: true
		}, function (error, response, d) { ...

This worked fine until a few (2?) days ago. Now I get an error message:

{ Error: connect EHOSTUNREACH 10.68.22.238:443
    at Object.exports._errnoException (util.js:1018:11)
    at exports._exceptionWithHostPort (util.js:1041:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1086:14)
  code: 'EHOSTUNREACH',
  errno: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '10.68.22.238',
  port: 443 }

This looks to me like the kubernetes instance can't see/connect to tools-static.wmflabs.org; I have tried tools.wmflabs.org and 208.80.155.174, to no avail. Please restore the previous behaviour.

Event Timeline

I can confirm this:

root@tools-bastion-05:~# become autodesc
tools.autodesc@tools-bastion-05:~$ kubectl get pod -o wide
NAME                        READY     STATUS    RESTARTS   AGE       IP               NODE
autodesc-3932480877-83r6n   1/1       Running   1          5d        192.168.206.10   tools-worker-1020.tools.eqiad.wmflabs
tools.autodesc@tools-bastion-05:~$ kubectl exec -it autodesc-3932480877-83r6n /bin/bash
<32480877-83r6n:/data/project/autodesc$ ping 208.80.155.174                  
bash: ping: command not found
<ct/autodesc$ cat < /dev/null > /dev/tcp/tools-static.wmflabs.org/80           
bash: connect: No route to host
bash: /dev/tcp/tools-static.wmflabs.org/80: No route to host

This is the same outside the container.

06:51:15 0 ✓ zhuyifei1999@tools-bastion-02: ~$ curl tools-static.wmflabs.org -v
* Rebuilt URL to: tools-static.wmflabs.org/
* Hostname was NOT found in DNS cache
*   Trying 10.68.22.238...
* connect to 10.68.22.238 port 80 failed: No route to host
* Failed to connect to tools-static.wmflabs.org port 80: No route to host
* Closing connection 0
curl: (7) Failed to connect to tools-static.wmflabs.org port 80: No route to host

@Bstorm Could this be resolving to the instances that are now deleted? (for future reference: T182604)

zhuyifei1999 renamed this task from node.js EHOSTUNREACH to cannot resolve tools-static.wmflabs.org to the correct host from within toolforge.Mar 6 2018, 6:54 PM
zhuyifei1999 triaged this task as High priority.
bd808 added a subscriber: bd808.
tools-bastion-02.tools:~
bd808$ host tools-static.wmflabs.org
tools-static.wmflabs.org has address 10.68.22.238
tools-bastion-02.tools:~
bd808$ ping tools-static.wmflabs.org
PING tools-static.wmflabs.org (10.68.22.238) 56(84) bytes of data.
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=1 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=2 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=3 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=4 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=5 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=6 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=7 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=8 Destination Host Unreachable
From tools-bastion-02.tools.eqiad.wmflabs (10.68.16.44) icmp_seq=9 Destination Host Unreachable
^C
--- tools-static.wmflabs.org ping statistics ---
11 packets transmitted, 0 received, +9 errors, 100% packet loss, time 10054ms
pipe 3

I'm on the trail of this, but haven't quite figured out why the DNS is messed up. It is definitely related to the recent move of the public tools-static.wmflabs.org floating IP address from tools-static-10.tools.wmflabs.org to tools-static-12.tools.wmflabs.org. The reverse DNS on the public IP is not updated yet, and the public IP to to private IP mapping that we do in the split horizon resolver seems to be still pointing to the instance IP of the now deleted tools-static-10 instance. It should be returning 10.68.20.97 internally.

$ dig @labs-recursor0.wikimedia.org. tools-static.wmflabs.org

; <<>> DiG 9.9.5-3ubuntu0.17-Ubuntu <<>> @labs-recursor0.wikimedia.org. tools-static.wmflabs.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46692
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tools-static.wmflabs.org.      IN      A

;; ANSWER SECTION:
tools-static.wmflabs.org. 10952 IN      A       10.68.22.238

;; Query time: 1 msec
;; SERVER: 208.80.155.118#53(208.80.155.118)
;; WHEN: Tue Mar 06 20:59:20 UTC 2018
;; MSG SIZE  rcvd: 58

@Andrew 'fixed' this by restarting the DNS recursor on labservices1001.

$ ping tools-static.wmflabs.org
PING tools-static.wmflabs.org (10.68.20.97) 56(84) bytes of data.
64 bytes from tools-static-12.tools.eqiad.wmflabs (10.68.20.97): icmp_seq=1 ttl=64 time=1.02 ms
64 bytes from tools-static-12.tools.eqiad.wmflabs (10.68.20.97): icmp_seq=2 ttl=64 time=0.462 ms
64 bytes from tools-static-12.tools.eqiad.wmflabs (10.68.20.97): icmp_seq=3 ttl=64 time=0.797 ms
64 bytes from tools-static-12.tools.eqiad.wmflabs (10.68.20.97): icmp_seq=4 ttl=64 time=0.439 ms
64 bytes from tools-static-12.tools.eqiad.wmflabs (10.68.20.97): icmp_seq=5 ttl=64 time=0.400 ms
64 bytes from tools-static-12.tools.eqiad.wmflabs (10.68.20.97): icmp_seq=6 ttl=64 time=0.346 ms
^C
--- tools-static.wmflabs.org ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4997ms
rtt min/avg/max/mdev = 0.346/0.578/1.024/0.246 ms

I'm going to dig a bit deeper and see if I can figure out why this change wasn't picked up automatically.

rOPUP6a307eacd4ac: openstack: labs-ip-alias-dump as a cron rather than exec changed the way that the python script generating the lua script (yeah I know) happens. The old method did it inline in the Puppet run every 20 minutes. If the run changed the file then Puppet told the pdns-recursor service to restart. The move to a cron task did not include a similar notification mechanism, so the lookup changes in the script are dormant until something else triggers a pdns-recursor reload.

Change 416852 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] dns labsaliaser: reload lua script whenever it's updated.

https://gerrit.wikimedia.org/r/416852

Change 416852 merged by Andrew Bogott:
[operations/puppet@production] dns labsaliaser: reload lua script whenever it's updated.

https://gerrit.wikimedia.org/r/416852