Page MenuHomePhabricator

Instance creation fails before first puppet run around 1% of the time
Closed, ResolvedPublic

Description

This leaves nodes unreachable. It seems like historically they can be fixed by rebooting the instance which retriggers the rc.local logic (and indicates the initial run of this does not reach a place where it would not-rerun). This may not always be the case however as the latest batch does not present in this manner.

We leaked 4 instances again over the weekend:

| c4ba92cc-f05b-4c1e-9862-a576846e4574 | fullstackd-1489919156 | REBOOT | public=10.68.23.230 |
| 02edf4d2-d7b1-4acb-894c-0005322b2bcb | fullstackd-1489833813 | ACTIVE | public=10.68.22.232 |
| f472b816-c35b-4795-a886-ebf4c6e7e365 | fullstackd-1489828105 | ACTIVE | public=10.68.18.226 |
| 69924d23-263b-461d-962f-8776e58cec89 | fullstackd-1489807288 | ACTIVE | public=10.68.17.181 |
root@labnet1001:~# zgrep -i exception /var/log/upstart/nova-fullstack.log*
/var/log/upstart/nova-fullstack.log.1:    raise Exception("SSH for {} timed out".format(address))
/var/log/upstart/nova-fullstack.log.1:Exception: SSH for 10.68.23.230 timed out
/var/log/upstart/nova-fullstack.log.2.gz:    raise Exception("SSH for {} timed out".format(address))
/var/log/upstart/nova-fullstack.log.2.gz:Exception: SSH for 10.68.18.226 timed out
/var/log/upstart/nova-fullstack.log.2.gz:    raise Exception("SSH for {} timed out".format(address))
/var/log/upstart/nova-fullstack.log.2.gz:Exception: SSH for 10.68.22.232 timed out
/var/log/upstart/nova-fullstack.log.3.gz:    raise Exception("creation of {} timed out".format(cserver.id))
/var/log/upstart/nova-fullstack.log.3.gz:Exception: creation of af44fbd0-d4bd-43c2-80f5-c4ef93e33e1a timed out
/var/log/upstart/nova-fullstack.log.3.gz:    raise Exception("SSH for {} timed out".format(address))
/var/log/upstart/nova-fullstack.log.3.gz:Exception: SSH for 10.68.17.181 timed out

This has been happening for awhile and surfaced as we addressed T159459: openstack instance creation sometimes takes >480s

I rebooted fullstackd-1489919156.admin-monitoring.eqiad.wmflabs

nova reboot c4ba92cc-f05b-4c1e-9862-a576846e457462-a576846e4574

Which did not come back and has some very wrong console output:

2017-03-20T12:53:32.142146+00:00 localhost systemd[1]: Startup finished in 22.193s (kernel) + 46.685s (userspace) = 1min 8.878s.

Debian GNU/Linux 8 localhost ttyS0

localhost login:

This could be related to nova-fullstack using a different image https://gerrit.wikimedia.org/r/#/c/343207/

Event Timeline

chasemp renamed this task from Instance creation stalls before first puppet run around 1% of the time to Instance creation fails before first puppet run around 1% of the time.Mar 20 2017, 12:58 PM
chasemp assigned this task to Andrew.
chasemp triaged this task as High priority.
chasemp updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2017-03-21T19:22:03Z] <chasemp> clean out admin-monitoring for nova-fullstack T160908

Change 344656 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] bootstrap_vz: Handle initial hostname a bit better

https://gerrit.wikimedia.org/r/344656

Change 344656 merged by Andrew Bogott:
[operations/puppet@production] bootstrap_vz: Handle initial hostname a bit better

https://gerrit.wikimedia.org/r/344656

Oddly, labs instances seem to be getting their dhcp leases from install1001:

lease {

interface "eth0";
fixed-address 10.68.21.59;
filename "jessie-installer/pxelinux.0";
server-name "install1001.wikimedia.org";
option subnet-mask 255.255.248.0;
option routers 10.68.16.1;
option dhcp-lease-time 86400;
option dhcp-message-type 5;
option domain-name-servers 208.80.155.118,208.80.154.20;
option unknown-209 "pxelinux.cfg/ttyS1-115200";
option dhcp-server-identifier 10.68.16.1;
option dhcp-renewal-time 43200;
option broadcast-address 10.68.23.255;
option dhcp-rebinding-time 75600;
option host-name "fullstackd-1490899441";
option domain-name "eqiad.wmflabs";
renew 2 2017/04/04 05:25:32;
rebind 2 2017/04/04 17:09:00;
expire 2 2017/04/04 20:09:00;

}

It's hard to know where they're getting their original leases from, though -- it should be dnsmasq on labnet.

I shutdown a failed instance and mounted the drive.

andrew@labvirt1008:/tmp/mnt/var/lib/dhcp$ ls
dhclient.eth0.leases  dhclient.leases
andrew@labvirt1008:/tmp/mnt/var/lib/dhcp$ cat dhclient.leases 
andrew@labvirt1008:/tmp/mnt/var/lib/dhcp$ cat dhclient.eth0.leases 
lease {
  interface "eth0";
  fixed-address 10.68.17.31;
  filename "jessie-installer/pxelinux.0";
  server-name "install1001.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118,208.80.154.20;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "fullstackd-1491265537";
  option domain-name "eqiad.wmflabs";
  renew 2 2017/04/04 18:53:10;
  rebind 3 2017/04/05 05:21:03;
  expire 3 2017/04/05 08:21:03;
}
lease {
  interface "eth0";
  fixed-address 10.68.17.31;
  filename "jessie-installer/pxelinux.0";
  server-name "install1001.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118,208.80.154.20;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "fullstackd-1491265537";
  option domain-name "eqiad.wmflabs";
  renew 2 2017/04/04 14:20:34;
  rebind 2 2017/04/04 14:20:34;
  expire 2 2017/04/04 14:20:34;
}

A healthy instance has similar content in dhclient.eth0.leases. So perhaps our retry loop is refreshing the wrong thing? Or, more likely, this is a perfectly good lease, and we just didn't get it until after the init script gave up hope.

Until I can get install1001 to STOP responding to labs dhcp requests, I'm going to assume that that's the problem.

Ok, I no longer thing that install1001 is involved. Instead, it's something to do with an IP getting assigned to two instances at once. The dnsmasq logs are full of this:

not using configured address 10.68.17.31 because it is leased to fa:16:3e:93:57:43

Which is the IP of the troubled instance. fa:16:3e:93:57:43 is a contintcloud instance which has since been deleted.

Change 346318 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] Nova dnsmasq: Reduce lease times and ttls by a lot

https://gerrit.wikimedia.org/r/346318

Change 346318 merged by Andrew Bogott:
[operations/puppet@production] Nova dnsmasq: Reduce lease times and ttls by a lot

https://gerrit.wikimedia.org/r/346318

@Andrew (as discussed on irc) I worry about the impact of https://gerrit.wikimedia.org/r/#/c/346318/2/modules/openstack/templates/liberty/nova/nova.conf.erb in the situation that dnsmasq is offline for a period of time (and a short amount of time at 3 minutes). Currently, our failover logic for labnets could easily exceed 3m and historically that was fine. If there are adverse reactions between lease expiration client side and dnsmasq unavailability due to this change we may be setting ourselves up for trouble here. Can you work through a few test cases to see?

Small note though that we have not dropped an instance since we changed dnsmasq params and this is almost certainly /an/ issue historically.

https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1486132582.09&target=servers.labnet1001.nova.verify.*&lineMode=connected&areaMode=stacked&from=-21d

Screen Shot 2017-04-06 at 9.46.20 AM.png (501×797 px, 137 KB)

Change 347053 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] nova.conf: Reduce fixed_ip_disassociate_timeout to three minutes.

https://gerrit.wikimedia.org/r/347053

Change 347054 had a related patch set uploaded (by Andrew Bogott):
[operations/puppet@production] nova.conf: change dhcp lease times to 12 hours.

https://gerrit.wikimedia.org/r/347054

Change 347053 merged by Andrew Bogott:
[operations/puppet@production] nova.conf: Reduce fixed_ip_disassociate_timeout to three minutes.

https://gerrit.wikimedia.org/r/347053

Change 347054 merged by Andrew Bogott:
[operations/puppet@production] nova.conf: change dhcp lease times to 12 hours.

https://gerrit.wikimedia.org/r/347054

Since you mention DHCP leases. For the Nodepool images I have hunt slowness in the boot process and I got rid of a few ones related to networking setup.

A) A start job is running for LSB: Raise network interf...25s / no limit)

That was due to trying to configure eth1 (since there was a /etc/network/interfaces.d/eth1) when labs instance only have eth0. Deleting the file in the image fixed it. T113342#2838431

B) DHCP delay because a lease was in the image

systemd-analyze blame still reported 20 seconds for networking.service. That was because the image had some DHCP leases left over. Thus on boot it would try to renew the lease a few time before bailing out. The fix was to delete all leases at the end of the image generation:
/var/lib/dhclient/dhclient.*.leases && sync

T113342#2850156

At the end of that sprint I managed to get:

Trusty to boot in 16 seconds.
Jessie to boot in ~ 10 seconds.

(all that based on Nodepool images generated by diskimage-builder, so that is very different from bootstrapvz)

Five days without leaks or failures, so I think this is resolved.