labs DHCP server gives only a single DNS resolver (was: CI jobs failing with DNS resolution errors such as "Could not resolve host: gerrit.wikimedia.org")
Closed, ResolvedPublic
Actions

Description

The labs DNS server only returns a single DNS resolver:

A DHCP lease:

lease {
  interface "eth0";
  fixed-address 10.68.16.187;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "ci-jessie-wikimedia-147750";
  option domain-name "eqiad.wmflabs";
  renew 4 2016/06/16 07:55:36;
  rebind 4 2016/06/16 17:04:15;
  expire 4 2016/06/16 20:04:15;
}

What ever DHCP server is answering would probably need the second labs resolver to be filled:

- option domain-name-servers 208.80.155.118;
+ option domain-name-servers 208.80.155.118, 208.80.154.20;

In puppet we have modules/openstack/templates/liberty/nova/dnsmasq-nova.conf.erb :

#Clients should use the designate-backed dns server rather than dnsmasq
dhcp-option=option:dns-server,<%= @recursor_ip %>

Potentially if @recursor_ip is found out to be list/array, we could sort and join with , all the entries.

Old description

Been happening repeatedly over the last 24 hours.

https://gerrit.wikimedia.org/r/#/c/293538/
https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit-jessie/345/console

18:03:07 Traceback (most recent call last): [..]
18:03:07 git.exc.GitCommandError: 'git remote update origin' returned with exit code 1
18:03:07 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/p/mediawiki/vendor/': Could not resolve host: gerrit.wikimedia.org
18:03:07 error: Could not fetch origin'
18:03:07 Build step 'Execute shell' marked build as failure
18:03:07 [PostBuildScript] - Execution post build scripts.
18:03:07 [PostBuildScript] Build is not success : do not execute script
18:03:07 Archiving artifacts
18:03:07 Finished: FAILURE

Details

	Subject	Repo	Branch	Lines +/-
	nova-network: Refresh service if config files change.	operations/puppet	production	+2 -0
	Send secondary DNS recursor IP from labs DHCP	operations/puppet	production	+3 -2

Customize query in gerrit

Related Objects

Mentioned In: T152370: Do contintcloud and other CI boxes know about labs-ns1?
T139011: puppet function ipresolve unable to look up instance on labs-puppetmaster

Event Timeline

Krinkle created this task.Jun 9 2016, 6:33 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 9 2016, 6:33 PM

The Nodepool instances receives their DNS configuration from DHCP. For a Jessie one /etc/resolv.conf is:

domain eqiad.wmflabs
search eqiad.wmflabs
nameserver 208.80.155.118

That IP is labs-recursor0.wikimedia.org. I guess it has/had some troubles.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 9 2016, 7:40 PM

hashar moved this task from Untriaged to Externally Blocked on the Continuous-Integration-Infrastructure board.Jun 9 2016, 7:40 PM

Paladox subscribed.Jun 9 2016, 7:43 PM

I'm seeing this pretty often now...

Occurred again https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27-jessie/10238/console again on a Jessie node on a Nodepool instance:

00:00:28.375 fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/puppet/mariadb/': Could not resolve host: gerrit.wikimedia.org

Is there a way to re-trigger the CI jobs other than rebasing the entire patch in Gerrit?

In T137460#2376125, @MoritzMuehlenhoff wrote:

Is there a way to re-trigger the CI jobs other than rebasing the entire patch in Gerrit?

Yup on the Gerrit change you can comment recheck which will enqueue the patchset again as if it has been just created.

Worth adding to Scrum of Scrums?

I'm wondering if this is only happening on nodepool.

The resolv.conf for different instances:

Nodepool Trusty:

nameserver 208.80.155.118
search eqiad.wmflabs

Nodepool Jessie

domain eqiad.wmflabs
search eqiad.wmflabs
nameserver 208.80.155.118

Permanent slaves using the whole puppet.git provisioning have:

domain integration.eqiad.wmflabs
search integration.eqiad.wmflabs eqiad.wmflabs
nameserver 208.80.155.118
nameserver 208.80.154.20
options timeout:2 ndots:2

I have looked at a Jessie Nodepool instance. It has two dhclient running:

dhclient -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0

The lease:

lease {
  interface "eth0";
  fixed-address 10.68.16.187;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "ci-jessie-wikimedia-147750";
  option domain-name "eqiad.wmflabs";
  renew 4 2016/06/16 07:55:36;
  rebind 4 2016/06/16 17:04:15;
  expire 4 2016/06/16 20:04:15;
}

What ever DHCP server is answering would probably need the second labs resolver to be filled:

- option domain-name-servers 208.80.155.118;
+ option domain-name-servers 208.80.155.118, 208.80.154.20;

In puppet we have modules/openstack/templates/liberty/nova/dnsmasq-nova.conf.erb :

#Clients should use the designate-backed dns server rather than dnsmasq
dhcp-option=option:dns-server,<%= @recursor_ip %>

Potentially if @recursor_ip is found out to be list/array, we could sort and join with , all the entries.

Forgot, a fully commented dnsmasq.conf http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq.conf.example

And something I thought about: maybe the base Jessie image does not come with any sort of DNS caching, harming the labs recursor uselessly.

hashar mentioned this in T139011: puppet function ipresolve unable to look up instance on labs-puppetmaster.Jul 4 2016, 12:52 PM

Krenair subscribed.Sep 7 2016, 3:52 PM

hashar mentioned this in T152370: Do contintcloud and other CI boxes know about labs-ns1?.Dec 5 2016, 9:00 AM

hashar merged a task: T152370: Do contintcloud and other CI boxes know about labs-ns1?.

hashar added a subscriber: Andrew.

hashar renamed this task from CI jobs failing with DNS resolution errors such as "Could not resolve host: gerrit.wikimedia.org" to labs DHCP server gives only a single DNS resolver (was: CI jobs failing with DNS resolution errors such as "Could not resolve host: gerrit.wikimedia.org").Dec 5 2016, 9:02 AM

hashar updated the task description. (Show Details)

Change 325278 had a related patch set uploaded (by Alex Monk):
Send secondary DNS recursor IP from labs DHCP

https://gerrit.wikimedia.org/r/325278

gerritbot added a project: Patch-For-Review.Dec 5 2016, 10:57 AM

Change 325278 merged by Andrew Bogott:
Send secondary DNS recursor IP from labs DHCP

https://gerrit.wikimedia.org/r/325278

This should be resolved for new CI instances. Hashar, can you verify that new VMs now have labs-ns0 and labs-ns1 IPs in resolv.conf?

fgiunchedi added a project: Wikimedia-Incident.Dec 5 2016, 7:50 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Dec 5 2016, 10:12 PM

Krinkle unsubscribed.Dec 17 2016, 7:50 AM

Sorry I have missed updates on this task. https://gerrit.wikimedia.org/r/#/c/325278/ updated dnsmasq-nova.conf.erb

dhcp-option=option:dns-server,<%= @recursor_ip %>,<%= @recursor_secondary_ip %>

Does not reflect back on the instances (checked Trusty/Jessie instances on contintcloud, deployment-prep and integration tenants).

@Andrew can you double check the content of /etc/dnsmasq-nova.conf and maybe openstack or dnsmasq needs to be reloaded/restarted to account for the change?

That change will definitely only affect new instances, not anything that existed before I made the puppet change. Since we're talking about instances that aren't puppetized (is that right?) I'm not sure how to make them reload.

in /etc/dnsmasq-nova.conf:

dhcp-option=option:dns-server,208.80.155.118,208.80.154.20

I have created instances after that change was merged, and they only have one DNS server listed. The leases expire daily, so all instances should have picked up the second nameserver soon after merging. IMHO nova-network has not been restarted after the merge, thus dnsmasq is still using the old configuration.

Good point. I have just restarted nova-network (and killed the dnsmasq procs so they're forced to respawn). Any better now?

krenair@deployment-salt02:~$ sudo salt '*' cmd.run --out=yaml 'grep 208.80.155.118,208.80.154.20 /var/lib/dhcp/dhclient.*' | grep -v "''"
deployment-redis02.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-puppetmaster02.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-pdf01.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-changeprop.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-conf03.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-elastic06.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'

# salt --output=text  '*' cmd.run 'grep -c nameserver /etc/resolv.conf'|sort -n
buildlog.integration.eqiad.wmflabs: 2
castor.integration.eqiad.wmflabs: 2
integration-publisher.integration.eqiad.wmflabs: 2
integration-puppetmaster01.integration.eqiad.wmflabs: 2
integration-saltmaster.integration.eqiad.wmflabs: 2
integration-slave-docker-1000.integration.eqiad.wmflabs: 2
integration-slave-jessie-1001.integration.eqiad.wmflabs: 2
integration-slave-jessie-1002.integration.eqiad.wmflabs: 2
integration-slave-jessie-android.integration.eqiad.wmflabs: 2
integration-slave-precise-1002.integration.eqiad.wmflabs: 2
integration-slave-precise-1011.integration.eqiad.wmflabs: 2
integration-slave-precise-1012.integration.eqiad.wmflabs: 2
integration-slave-trusty-1001.integration.eqiad.wmflabs: 2
integration-slave-trusty-1003.integration.eqiad.wmflabs: 2
integration-slave-trusty-1004.integration.eqiad.wmflabs: 2
integration-slave-trusty-1006.integration.eqiad.wmflabs: 2
integration-slave-trusty-1011.integration.eqiad.wmflabs: 2
repository.integration.eqiad.wmflabs: 2

Confirmed on Jessie/Trusty Nodepool instances (contintcloud tenant).

So nova-network / dnsmasq had to be restarted after @Andrew configuration change :]

Thank you!

Change 330308 had a related patch set uploaded (by Andrew Bogott):
nova-network: Refresh service if some dependency files change.

https://gerrit.wikimedia.org/r/330308

Change 330308 merged by Andrew Bogott:
nova-network: Refresh service if config files change.