Page MenuHomePhabricator

CloudVPS DNS: weird behavior in integration-puppetmaster01
Closed, ResolvedPublic

Description

I was reallocating integration-puppetmaster01 to other cloudvirt (T223148: Cloud Services: reallocate workload from rack B5-eqiad) and the VM won't boot up normally.

There are some logs that indicate a DNS configuration issue:

2019-05-15T09:31:28.352482+00:00 integration-puppetmaster01 rc.local[507]: + /sbin/vgdisplay -c vd
2019-05-15T09:31:28.402556+00:00 integration-puppetmaster01 rc.local[507]: vd:r/w:772:-1:0:0:0:-1:0:1:1:520192:4096:127:0:127:iTVghg-HD0y-sjIk-jAqJ-BA9C-Zxr1-Mgx9JI
2019-05-15T09:31:28.413154+00:00 integration-puppetmaster01 rc.local[507]: ++ sed -r 's/\".*$//g'
2019-05-15T09:31:28.417130+00:00 integration-puppetmaster01 rc.local[507]: ++ sed -r 's/^.*project_id\": \"//'
2019-05-15T09:31:28.418297+00:00 integration-puppetmaster01 rc.local[507]: ++ curl http://169.254.169.254/openstack/latest/meta_data.json/
2019-05-15T09:31:28.535396+00:00 integration-puppetmaster01 rc.local[507]: % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2019-05-15T09:31:28.536187+00:00 integration-puppetmaster01 rc.local[507]: Dload  Upload   Total   Spent    Left  Speed
2019-05-15T09:31:29.069992+00:00 integration-puppetmaster01 rc.local[507]: 0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0#015100   908  100   908    0     0   1699      0 --:--:-- --:--:-- --:--:--  1697
2019-05-15T09:31:29.089705+00:00 integration-puppetmaster01 rc.local[507]: + project=integration
2019-05-15T09:31:29.093252+00:00 integration-puppetmaster01 rc.local[507]: ++ curl http://169.254.169.254/1.0/meta-data/local-ipv4
2019-05-15T09:31:29.409027+00:00 integration-puppetmaster01 rc.local[507]: + ip=172.16.3.17
2019-05-15T09:31:29.417222+00:00 integration-puppetmaster01 rc.local[507]: ++ hostname
2019-05-15T09:31:29.421218+00:00 integration-puppetmaster01 rc.local[507]: + hostname=integration-puppetmaster01
2019-05-15T09:31:29.433134+00:00 integration-puppetmaster01 rc.local[507]: ++ sed -r 's/.*\.([^.]+\.[^.]+)$/\1/'
2019-05-15T09:31:29.435177+00:00 integration-puppetmaster01 rc.local[507]: ++ hostname -d
2019-05-15T09:31:29.442799+00:00 integration-puppetmaster01 rc.local[507]: + domain=eqiad.wmflabs
2019-05-15T09:31:29.443006+00:00 integration-puppetmaster01 rc.local[507]: + '[' -z eqiad.wmflabs ']'
2019-05-15T09:31:29.443190+00:00 integration-puppetmaster01 rc.local[507]: + '[' -z eqiad.wmflabs ']'
2019-05-15T09:31:29.443365+00:00 integration-puppetmaster01 rc.local[507]: + fqdn=integration-puppetmaster01.integration.eqiad.wmflabs
2019-05-15T09:31:29.443544+00:00 integration-puppetmaster01 rc.local[507]: + saltfinger=c5:b1:35:45:3e:0a:19:70:aa:5f:3a:cf:bf:a0:61:dd
2019-05-15T09:31:29.443722+00:00 integration-puppetmaster01 rc.local[507]: + '[' eqiad.wmflabs == eqiad.wmflabs ']'
2019-05-15T09:31:29.443893+00:00 integration-puppetmaster01 rc.local[507]: + master=labs-puppetmaster-eqiad.wikimedia.org
2019-05-15T09:31:29.444065+00:00 integration-puppetmaster01 rc.local[507]: + ldaphosts='ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389'
2019-05-15T09:31:29.444239+00:00 integration-puppetmaster01 rc.local[507]: + '[' eqiad.wmflabs == codfw.wmflabs ']'
2019-05-15T09:31:29.444790+00:00 integration-puppetmaster01 rc.local[507]: + '[' eqiad.wmflabs == codfw.labtest ']'
2019-05-15T09:31:29.449247+00:00 integration-puppetmaster01 rc.local[507]: + sed -i s/_PROJECT_/integration/g /etc/security/access.conf
2019-05-15T09:31:29.463958+00:00 integration-puppetmaster01 rc.local[507]: + sed -i s/_PROJECT_/integration/g /etc/ldap/ldap.conf
2019-05-15T09:31:29.481721+00:00 integration-puppetmaster01 rc.local[507]: + sed -i s/_PROJECT_/integration/g /etc/sudo-ldap.conf
2019-05-15T09:31:29.499691+00:00 integration-puppetmaster01 rc.local[507]: + sed -i s/_PROJECT_/integration/g /etc/nslcd.conf
2019-05-15T09:31:29.506568+00:00 integration-puppetmaster01 rc.local[507]: + sed -i 's%^uri.*%uri ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389%g' /etc/nslcd.conf
2019-05-15T09:31:29.520165+00:00 integration-puppetmaster01 rc.local[507]: + sed -i s/_FQDN_/integration-puppetmaster01.integration.eqiad.wmflabs/g /etc/puppet/puppet.conf
2019-05-15T09:31:29.532805+00:00 integration-puppetmaster01 rc.local[507]: + sed -i s/_MASTER_/labs-puppetmaster-eqiad.wikimedia.org/g /etc/puppet/puppet.conf
2019-05-15T09:31:29.542240+00:00 integration-puppetmaster01 rc.local[507]: + sed -i 's%^uri.*%uri             ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389%g' /etc/ldap.conf
2019-05-15T09:31:29.550357+00:00 integration-puppetmaster01 rc.local[507]: + sed -i 's%^URI.*%URI             ldap://ldap-labs.eqiad.wikimedia.org:389 ldap://ldap-labs.codfw.wikimedia.org:389%g' /etc/ldap/ldap.conf
2019-05-15T09:31:29.558665+00:00 integration-puppetmaster01 rc.local[507]: + echo ''
2019-05-15T09:31:29.559267+00:00 integration-puppetmaster01 rc.local[507]: + mkdir /etc/dhcp/dhclient-enter-hooks.d
2019-05-15T09:31:29.569065+00:00 integration-puppetmaster01 rc.local[507]: mkdir: cannot create directory ?/etc/dhcp/dhclient-enter-hooks.d?: File exists
2019-05-15T09:31:29.570009+00:00 integration-puppetmaster01 rc.local[507]: + cat
2019-05-15T09:31:29.584408+00:00 integration-puppetmaster01 rc.local[507]: + '[' eqiad.wmflabs == codfw.labtest ']'
2019-05-15T09:31:29.593489+00:00 integration-puppetmaster01 rc.local[507]: ++ /usr/bin/dig +short labs-recursor0.wikimedia.org
2019-05-15T09:31:29.699955+00:00 integration-puppetmaster01 rc.local[507]: + nameserver=208.80.155.118
2019-05-15T09:31:29.700230+00:00 integration-puppetmaster01 rc.local[507]: + cat
2019-05-15T09:31:29.714119+00:00 integration-puppetmaster01 rc.local[507]: + mkdir /var/lib/puppet/client
2019-05-15T09:31:29.723456+00:00 integration-puppetmaster01 rc.local[507]: mkdir: cannot create directory ?/var/lib/puppet/client?: File exists
2019-05-15T09:31:29.727131+00:00 integration-puppetmaster01 rc.local[507]: + systemctl restart nslcd.service

It seems to me that other VMs are using:

nameserver 208.80.154.143
nameserver 208.80.154.24

Not sure what's going on here.

Event Timeline

saucelabs-01.integration.eqiad.wmflabs seems to have the same issue.

This is likely a result of puppet going out of sync. The old recursors were 208.80.155.118 and 208.80.154.20, replaced with 208.80.154.143 and 208.80.154.24 as per T221183

Andrew subscribed.

...and it looks like puppet caught up?

root@integration-puppetmaster01:~# cat /etc/resolv.conf 
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain integration.eqiad.wmflabs
search integration.eqiad.wmflabs eqiad.wmflabs 
nameserver 208.80.154.143
nameserver 208.80.154.24
options timeout:2 ndots:2
root@saucelabs-01:~# cat /etc/resolv.conf 
## THIS FILE IS MANAGED BY PUPPET
##
## source: modules/base/resolv.conf.labs.erb
## from:   base::resolving

domain integration.eqiad.wmflabs
search integration.eqiad.wmflabs eqiad.wmflabs 
nameserver 208.80.154.143
nameserver 208.80.154.24
options timeout:2 ndots:2

I added those by hand to allow normal operation of the VMs. The original nameserver was being set apparently by the DHCP server or something else triggered by the rc.local script, see trace in the task description.

Ah so I see. There are some VMs which were built from a bad base image and for some reason re-run the firstboot script on every boot. It should be fixed now and I thought I'd caught them all... I'll do a bit more research.

So it looks like any failure at all in the firstboot script means that firstboot will be run again on subsequent boots. I'm pretty sure that this is wrong and we should just only ever run it once even if it throws an error but I'm going to look at more failure cases first.

Meanwhile, cleaning up the VMs that fell through the wholes (and are in danger of re-running firstboot on the next run) should be pretty simple.

Mentioned in SAL (#wikimedia-cloud) [2019-05-15T17:00:57Z] <andrewbogott> touching /root/firstboot_done on all VMs that cumin can reach. This will prevent firstboot.sh from running a second time if/when any of these are rebooted. T223370

Change 510576 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] firstboot: Only run once, even if the firstboot script throws an error

https://gerrit.wikimedia.org/r/510576

integration and deployment-prep have their own cumin setup.

integration-cumin:~$ sudo cumin --force '*' '[ -f /root/firstboot_done ]'
100.0% (25/25) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
hashar@deployment-cumin02:~$ sudo cumin --force '*' '[ -f /root/firstboot_done ]'
ssh: Could not resolve hostname deployment-logstash02.deployment-prep.eqiad.wmflabs: Name or service not known                                
98.6% (73/74) success ratio (< 100.0% threshold) for command: '[ -f /root/firstboot_done ]'.

So they seem covered (with the delta of deployment-logstash02 not being reachable at this point but that is irrelevant to this task anyway.

I'm still puzzled about this. A second look suggests that the only way the firstboot script would fail is if it's unable to resolve the name of the host. But if /that/ happens then I can't imagine things ever working.

So... I think I'm going to throw out https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/510576/ and try another approach: I'll have puppet itself mark the firstboot as finished.

Change 510758 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] profile::wmcs::instance: Ensure that /root/firstboot_done exists

https://gerrit.wikimedia.org/r/510758

Change 510576 abandoned by Andrew Bogott:
firstboot: Only run once, even if the firstboot script throws an error

Reason:
Dropping in favor of https://gerrit.wikimedia.org/r/#/c/operations/puppet/ /510758/ -- I'm happy with equating 'puppet catalog compiled' as 'system is now up'

https://gerrit.wikimedia.org/r/510576

Change 510758 merged by Andrew Bogott:
[operations/puppet@production] profile::wmcs::instance: Ensure that /root/firstboot_done exists

https://gerrit.wikimedia.org/r/510758

This particular thing shouldn't happen any more.

hashar@deployment-cumin02:~$ sudo cumin --force '*' '[ -f /root/firstboot_done ]'
ssh: Could not resolve hostname deployment-logstash02.deployment-prep.eqiad.wmflabs: Name or service not known                                
98.6% (73/74) success ratio (< 100.0% threshold) for command: '[ -f /root/firstboot_done ]'.

So they seem covered (with the delta of deployment-logstash02 not being reachable at this point but that is irrelevant to this task anyway.

(see T222480)