Page MenuHomePhabricator

Do contintcloud and other CI boxes know about labs-ns1?
Closed, DuplicatePublic

Description

According to a comment in T152340, the labs-ns0 outage caused CI tests to fail. That suggests that wherever that test was running, resolv.conf had labs-ns0 but not labs-ns1 in it, hence no failover.

If that's true, it should be an easy fix. If that's /not/ true then... this needs investigation.

Event Timeline

I have looked at it / filled a task about it ages ago but can not find it anymore. The issue is the DHCP server on labs only yield a single DNS resolver, because our OpenStack configuration does not use the DHCP option to set a second alternative resolver. Will need to dig a bit more to find what I wrote about it.

On boot, the Nodepool instances emit a DHCP requests and only get labs-recursor0.wikimedia.org.:

Trusty:

/etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 208.80.155.118
search eqiad.wmflabs

Jessie:

/etc/resolv.conf
domain eqiad.wmflabs
search eqiad.wmflabs
nameserver 208.80.155.118

For reference, the permanent slaves (Precise, Trusty, Jessie) all have the same configuration which is provided by Puppet.

/etc/resolv.conf
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain integration.eqiad.wmflabs
search integration.eqiad.wmflabs eqiad.wmflabs 
nameserver 208.80.155.118
nameserver 208.80.154.20
options timeout:2 ndots:2

Found it. T137460#2383979 and others have all the details. Namely the DHCP lease has:

option domain-name-servers 208.80.155.118;

And in dnsmasq-nova.conf.erb we only have a single IP:

#Clients should use the designate-backed dns server rather than dnsmasq
dhcp-option=option:dns-server,<%= @recursor_ip %>