Page MenuHomePhabricator

cannot resolve to the correct host from within toolforge
Closed, ResolvedPublic


I am running a web-facing tool at which is written in node.js. It uses translations that it gets via web request, like so:

var server = '' ;

			url: 'https://'+server+'/tooltranslate/data/autodesc/toolinfo.json',
			headers: {'user-agent': 'Mozilla/5.0'},
			json: true
		}, function (error, response, d) { ...

This worked fine until a few (2?) days ago. Now I get an error message:

{ Error: connect EHOSTUNREACH
    at Object.exports._errnoException (util.js:1018:11)
    at exports._exceptionWithHostPort (util.js:1041:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1086:14)
  errno: 'EHOSTUNREACH',
  syscall: 'connect',
  address: '',
  port: 443 }

This looks to me like the kubernetes instance can't see/connect to; I have tried and, to no avail. Please restore the previous behaviour.

Event Timeline

I can confirm this:

root@tools-bastion-05:~# become autodesc
tools.autodesc@tools-bastion-05:~$ kubectl get pod -o wide
NAME                        READY     STATUS    RESTARTS   AGE       IP               NODE
autodesc-3932480877-83r6n   1/1       Running   1          5d
tools.autodesc@tools-bastion-05:~$ kubectl exec -it autodesc-3932480877-83r6n /bin/bash
<32480877-83r6n:/data/project/autodesc$ ping                  
bash: ping: command not found
<ct/autodesc$ cat < /dev/null > /dev/tcp/           
bash: connect: No route to host
bash: /dev/tcp/ No route to host

This is the same outside the container.

06:51:15 0 ✓ zhuyifei1999@tools-bastion-02: ~$ curl -v
* Rebuilt URL to:
* Hostname was NOT found in DNS cache
*   Trying
* connect to port 80 failed: No route to host
* Failed to connect to port 80: No route to host
* Closing connection 0
curl: (7) Failed to connect to port 80: No route to host

@Bstorm Could this be resolving to the instances that are now deleted? (for future reference: T182604)

zhuyifei1999 renamed this task from node.js EHOSTUNREACH to cannot resolve to the correct host from within toolforge.Mar 6 2018, 6:54 PM
zhuyifei1999 triaged this task as High priority.
bd808 added a subscriber: bd808.
bd808$ host has address
bd808$ ping
PING ( 56(84) bytes of data.
From ( icmp_seq=1 Destination Host Unreachable
From ( icmp_seq=2 Destination Host Unreachable
From ( icmp_seq=3 Destination Host Unreachable
From ( icmp_seq=4 Destination Host Unreachable
From ( icmp_seq=5 Destination Host Unreachable
From ( icmp_seq=6 Destination Host Unreachable
From ( icmp_seq=7 Destination Host Unreachable
From ( icmp_seq=8 Destination Host Unreachable
From ( icmp_seq=9 Destination Host Unreachable
--- ping statistics ---
11 packets transmitted, 0 received, +9 errors, 100% packet loss, time 10054ms
pipe 3

I'm on the trail of this, but haven't quite figured out why the DNS is messed up. It is definitely related to the recent move of the public floating IP address from to The reverse DNS on the public IP is not updated yet, and the public IP to to private IP mapping that we do in the split horizon resolver seems to be still pointing to the instance IP of the now deleted tools-static-10 instance. It should be returning internally.

$ dig

; <<>> DiG 9.9.5-3ubuntu0.17-Ubuntu <<>>
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46692
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;      IN      A

;; ANSWER SECTION: 10952 IN      A

;; Query time: 1 msec
;; WHEN: Tue Mar 06 20:59:20 UTC 2018
;; MSG SIZE  rcvd: 58

@Andrew 'fixed' this by restarting the DNS recursor on labservices1001.

$ ping
PING ( 56(84) bytes of data.
64 bytes from ( icmp_seq=1 ttl=64 time=1.02 ms
64 bytes from ( icmp_seq=2 ttl=64 time=0.462 ms
64 bytes from ( icmp_seq=3 ttl=64 time=0.797 ms
64 bytes from ( icmp_seq=4 ttl=64 time=0.439 ms
64 bytes from ( icmp_seq=5 ttl=64 time=0.400 ms
64 bytes from ( icmp_seq=6 ttl=64 time=0.346 ms
--- ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4997ms
rtt min/avg/max/mdev = 0.346/0.578/1.024/0.246 ms

I'm going to dig a bit deeper and see if I can figure out why this change wasn't picked up automatically.

rOPUP6a307eacd4ac: openstack: labs-ip-alias-dump as a cron rather than exec changed the way that the python script generating the lua script (yeah I know) happens. The old method did it inline in the Puppet run every 20 minutes. If the run changed the file then Puppet told the pdns-recursor service to restart. The move to a cron task did not include a similar notification mechanism, so the lookup changes in the script are dormant until something else triggers a pdns-recursor reload.

Change 416852 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] dns labsaliaser: reload lua script whenever it's updated.

Change 416852 merged by Andrew Bogott:
[operations/puppet@production] dns labsaliaser: reload lua script whenever it's updated.