Page MenuHomePhabricator

labs DHCP server gives only a single DNS resolver (was: CI jobs failing with DNS resolution errors such as "Could not resolve host: gerrit.wikimedia.org")
Closed, ResolvedPublic

Description

The labs DNS server only returns a single DNS resolver:

A DHCP lease:

lease {
  interface "eth0";
  fixed-address 10.68.16.187;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "ci-jessie-wikimedia-147750";
  option domain-name "eqiad.wmflabs";
  renew 4 2016/06/16 07:55:36;
  rebind 4 2016/06/16 17:04:15;
  expire 4 2016/06/16 20:04:15;
}

What ever DHCP server is answering would probably need the second labs resolver to be filled:

- option domain-name-servers 208.80.155.118;
+ option domain-name-servers 208.80.155.118, 208.80.154.20;

In puppet we have modules/openstack/templates/liberty/nova/dnsmasq-nova.conf.erb :

#Clients should use the designate-backed dns server rather than dnsmasq
dhcp-option=option:dns-server,<%= @recursor_ip %>

Potentially if @recursor_ip is found out to be list/array, we could sort and join with , all the entries.

Old description

Been happening repeatedly over the last 24 hours.

https://gerrit.wikimedia.org/r/#/c/293538/
https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit-jessie/345/console

18:03:07 Traceback (most recent call last): [..]
18:03:07 git.exc.GitCommandError: 'git remote update origin' returned with exit code 1
18:03:07 stderr: 'fatal: unable to access 'https://gerrit.wikimedia.org/r/p/mediawiki/vendor/': Could not resolve host: gerrit.wikimedia.org
18:03:07 error: Could not fetch origin'
18:03:07 Build step 'Execute shell' marked build as failure
18:03:07 [PostBuildScript] - Execution post build scripts.
18:03:07 [PostBuildScript] Build is not success : do not execute script
18:03:07 Archiving artifacts
18:03:07 Finished: FAILURE

Event Timeline

hashar subscribed.

The Nodepool instances receives their DNS configuration from DHCP. For a Jessie one /etc/resolv.conf is:

domain eqiad.wmflabs
search eqiad.wmflabs
nameserver 208.80.155.118

That IP is labs-recursor0.wikimedia.org. I guess it has/had some troubles.

Legoktm triaged this task as High priority.Jun 10 2016, 8:17 PM
Legoktm subscribed.

I'm seeing this pretty often now...

hashar renamed this task from mediawiki-extensions-qunit failing "Could not resolve host: gerrit.wikimedia.org" to CI jobs failing with DNS resolution errors such as "Could not resolve host: gerrit.wikimedia.org".Jun 13 2016, 1:25 PM

Occurred again https://integration.wikimedia.org/ci/job/operations-puppet-tox-py27-jessie/10238/console again on a Jessie node on a Nodepool instance:

00:00:28.375 fatal: unable to access 'https://gerrit.wikimedia.org/r/operations/puppet/mariadb/': Could not resolve host: gerrit.wikimedia.org

Is there a way to re-trigger the CI jobs other than rebasing the entire patch in Gerrit?

Is there a way to re-trigger the CI jobs other than rebasing the entire patch in Gerrit?

Yup on the Gerrit change you can comment recheck which will enqueue the patchset again as if it has been just created.

Worth adding to Scrum of Scrums?

I'm wondering if this is only happening on nodepool.

The resolv.conf for different instances:

Nodepool Trusty:

nameserver 208.80.155.118
search eqiad.wmflabs

Nodepool Jessie

domain eqiad.wmflabs
search eqiad.wmflabs
nameserver 208.80.155.118

Permanent slaves using the whole puppet.git provisioning have:

domain integration.eqiad.wmflabs
search integration.eqiad.wmflabs eqiad.wmflabs
nameserver 208.80.155.118
nameserver 208.80.154.20
options timeout:2 ndots:2

I have looked at a Jessie Nodepool instance. It has two dhclient running:

dhclient -v -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases eth0

The lease:

lease {
  interface "eth0";
  fixed-address 10.68.16.187;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "ci-jessie-wikimedia-147750";
  option domain-name "eqiad.wmflabs";
  renew 4 2016/06/16 07:55:36;
  rebind 4 2016/06/16 17:04:15;
  expire 4 2016/06/16 20:04:15;
}

What ever DHCP server is answering would probably need the second labs resolver to be filled:

- option domain-name-servers 208.80.155.118;
+ option domain-name-servers 208.80.155.118, 208.80.154.20;

In puppet we have modules/openstack/templates/liberty/nova/dnsmasq-nova.conf.erb :

#Clients should use the designate-backed dns server rather than dnsmasq
dhcp-option=option:dns-server,<%= @recursor_ip %>

Potentially if @recursor_ip is found out to be list/array, we could sort and join with , all the entries.

And something I thought about: maybe the base Jessie image does not come with any sort of DNS caching, harming the labs recursor uselessly.

hashar renamed this task from CI jobs failing with DNS resolution errors such as "Could not resolve host: gerrit.wikimedia.org" to labs DHCP server gives only a single DNS resolver (was: CI jobs failing with DNS resolution errors such as "Could not resolve host: gerrit.wikimedia.org").Dec 5 2016, 9:02 AM
hashar updated the task description. (Show Details)

Change 325278 had a related patch set uploaded (by Alex Monk):
Send secondary DNS recursor IP from labs DHCP

https://gerrit.wikimedia.org/r/325278

Change 325278 merged by Andrew Bogott:
Send secondary DNS recursor IP from labs DHCP

https://gerrit.wikimedia.org/r/325278

This should be resolved for new CI instances. Hashar, can you verify that new VMs now have labs-ns0 and labs-ns1 IPs in resolv.conf?

Sorry I have missed updates on this task. https://gerrit.wikimedia.org/r/#/c/325278/ updated dnsmasq-nova.conf.erb

dhcp-option=option:dns-server,<%= @recursor_ip %>,<%= @recursor_secondary_ip %>

Does not reflect back on the instances (checked Trusty/Jessie instances on contintcloud, deployment-prep and integration tenants).

@Andrew can you double check the content of /etc/dnsmasq-nova.conf and maybe openstack or dnsmasq needs to be reloaded/restarted to account for the change?

That change will definitely only affect new instances, not anything that existed before I made the puppet change. Since we're talking about instances that aren't puppetized (is that right?) I'm not sure how to make them reload.

in /etc/dnsmasq-nova.conf:

dhcp-option=option:dns-server,208.80.155.118,208.80.154.20

I have created instances after that change was merged, and they only have one DNS server listed. The leases expire daily, so all instances should have picked up the second nameserver soon after merging. IMHO nova-network has not been restarted after the merge, thus dnsmasq is still using the old configuration.

Good point. I have just restarted nova-network (and killed the dnsmasq procs so they're forced to respawn). Any better now?

krenair@deployment-salt02:~$ sudo salt '*' cmd.run --out=yaml 'grep 208.80.155.118,208.80.154.20 /var/lib/dhcp/dhclient.*' | grep -v "''"
deployment-redis02.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-puppetmaster02.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-pdf01.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-changeprop.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-conf03.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
deployment-elastic06.deployment-prep.eqiad.wmflabs: '/var/lib/dhcp/dhclient.eth0.leases:  option
  domain-name-servers 208.80.155.118,208.80.154.20;'
hashar assigned this task to Andrew.
# salt --output=text  '*' cmd.run 'grep -c nameserver /etc/resolv.conf'|sort -n
buildlog.integration.eqiad.wmflabs: 2
castor.integration.eqiad.wmflabs: 2
integration-publisher.integration.eqiad.wmflabs: 2
integration-puppetmaster01.integration.eqiad.wmflabs: 2
integration-saltmaster.integration.eqiad.wmflabs: 2
integration-slave-docker-1000.integration.eqiad.wmflabs: 2
integration-slave-jessie-1001.integration.eqiad.wmflabs: 2
integration-slave-jessie-1002.integration.eqiad.wmflabs: 2
integration-slave-jessie-android.integration.eqiad.wmflabs: 2
integration-slave-precise-1002.integration.eqiad.wmflabs: 2
integration-slave-precise-1011.integration.eqiad.wmflabs: 2
integration-slave-precise-1012.integration.eqiad.wmflabs: 2
integration-slave-trusty-1001.integration.eqiad.wmflabs: 2
integration-slave-trusty-1003.integration.eqiad.wmflabs: 2
integration-slave-trusty-1004.integration.eqiad.wmflabs: 2
integration-slave-trusty-1006.integration.eqiad.wmflabs: 2
integration-slave-trusty-1011.integration.eqiad.wmflabs: 2
repository.integration.eqiad.wmflabs: 2

Confirmed on Jessie/Trusty Nodepool instances (contintcloud tenant).

So nova-network / dnsmasq had to be restarted after @Andrew configuration change :]

Thank you!

Change 330308 had a related patch set uploaded (by Andrew Bogott):
nova-network: Refresh service if some dependency files change.

https://gerrit.wikimedia.org/r/330308

Change 330308 merged by Andrew Bogott:
nova-network: Refresh service if config files change.

https://gerrit.wikimedia.org/r/330308