Speed up the time to get a Nodepool instances to achieve READY state
Open, LowPublic

Description

When Nodepool spawn an instances, it iterates till it can ssh to it as jenkins user and then executes a ready.sh script. From T112750#1645247 that takes roughly a minute:

Instance with id 47 creation log shows it took 31 seconds for labs to boot the instance and 30 seconds for it to be ssh reachable:

14:59:02,818 DEBUG nodepool.NodeLauncher: Launching node id: 47
14:59:02,823 INFO nodepool.NodeLauncher: Creating server with hostname ci-jessie-wikimedia-47.contintcloud.eqiad.wmflabs \
                                         in wmflabs-eqiad from image ci-jessie-wikimedia for node id: 47
14:59:33,801 DEBUG nodepool.ProviderManager: Status of server in wmflabs-eqiad 541abb35-062e-4637-b46e-2fd08f4961b7: ACTIVE
14:59:33,801 DEBUG nodepool.NodeLauncher: Node id: 47 testing ssh at ip: 10.68.19.205
15:00:04,271 INFO nodepool.NodeLauncher: Node id: 47 is ready

The ssh delay might be because the OpenSSH daemon starts late in the boot process.

Instance launch time on Grafana: https://grafana.wikimedia.org/dashboard/db/nodepool

hashar created this task.Sep 22 2015, 9:07 AM
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2015, 9:07 AM
hashar edited the task description. (Show Details)Nov 4 2016, 9:05 AM
hashar set Security to None.
hashar added a comment.EditedDec 1 2016, 1:20 PM
13:10:02,299 DEBUG nodepool.ProviderManager: Status of server in wmflabs-eqiad ce709ec4-a85c-4359-be5f-7ee2fb09a6d1: ACTIVE
13:10:02,299 DEBUG nodepool.NodeLauncher: Node id: 435953 is running, ip: 10.68.17.179, ipv6: None
13:10:02,299 DEBUG nodepool.NodeLauncher: Node id: 435953 testing ssh at ip: 10.68.17.179

33 seconds later...

13:10:35,810 INFO nodepool.NodeLauncher: Node id: 435953 is ready
13:10:35,814 DEBUG nodepool.NodeLauncher: Adding node id: 435953 to jenkins

The relevant code is in nodepool/nodepool.py method launchNode().

Paladox added a subscriber: Paladox.Dec 1 2016, 1:21 PM
hashar added a comment.Dec 1 2016, 3:36 PM

I manually spawned an instance at 15:21:49. Did a ping on it and network went up roughly a minute later at 15:22:52. ssh client was apparently available immediately. The boot log P4552, while watching it I noticed a long delay at raising the network, apparently of 25 seconds though the IP was apparently already assigned in OpenStack

[    7.679858] intel_rapl: no valid rapl domains found in package 0
[    7.820581] intel_rapl: no valid rapl domains found in package 0
 A start job is running for LSB: Raise network interf...(5s / no limit)
 A start job is running for LSB: Raise network interf...(5s / no limit)
...
 A start job is running for LSB: Raise network interf...23s / no limit)
 A start job is running for LSB: Raise network interf...23s / no limit)
 A start job is running for LSB: Raise network interf...24s / no limit)
 A start job is running for LSB: Raise network interf...24s / no limit)
 A start job is running for LSB: Raise network interf...25s / no limit)
OK Started LSB: Raise network interfaces..
OK Reached target Network.
...
[   34.838627] random: nonblocking pool is initialized

Or apparently some while loop iterate for 25 seconds

hashar added a comment.Dec 1 2016, 4:18 PM

I noticed we have two interfaces configured:

/etc/network/interfaces.d/eth0
/etc/network/interfaces.d/eth1

That is done by diskimage-builder via:

elements/debootstrap/install.d/10-debian-networking
# It would be eversogreat if we didn't need to do crap like this
echo $DISTRO_NAME > /etc/hostname

# cloud images expect eth0 and eth1 to use dhcp.
mkdir -p /etc/network/interfaces.d
echo "source /etc/network/interfaces.d/*" >> /etc/network/interfaces
for interface in eth0 eth1; do
    cat << EOF | tee /etc/network/interfaces.d/$interface
auto $interface
iface $interface inet dhcp
EOF
done

Maybe there is some timeout / retry occuring while trying to bring up eth1 which is not used.

Change 324744 had a related patch set uploaded (by Hashar):
dib: remove eth1 configuration

https://gerrit.wikimedia.org/r/324744

Change 325051 had a related patch set uploaded (by Hashar):
dib: set grub timeout to 0

https://gerrit.wikimedia.org/r/325051

Change 325052 had a related patch set uploaded (by Hashar):
dib: do not start varnish service on boot

https://gerrit.wikimedia.org/r/325052

I have created a new Jessie image with the patch that drops eth1 and the one setting grub timeout to 0. Boot log is P4560 and shows it is going apparently faster.

I have confirmed locally with qemu that the 5 seconds grub timeout is gone.

dhcp is no more attempted on eth1 and thus A start job is running for LSB: Raise network interf...25s / no limit) is gone:

[[32m  OK  [0m] Started LSB: Raise network interfaces..
[[32m  OK  [0m] Reached target Network.
[[32m  OK  [0m] Reached target Network is Online.

The message random: nonblocking pool is initialized occurs after 11 seconds (was 34 seconds).

Success!

The image for Jessie is on labnodepool1001.eqiad.wmnet in /home/hashar/image-jessie-T113342.qcow2 will deploy it next week.

Will have to build and test a new Trusty image as well.

Change 324744 merged by jenkins-bot:
dib: remove eth1 configuration

https://gerrit.wikimedia.org/r/324744

Change 325051 merged by jenkins-bot:
dib: set grub timeout to 0

https://gerrit.wikimedia.org/r/325051

Mentioned in SAL (#wikimedia-releng) [2016-12-05T08:35:52Z] <hashar> Pushing new Jessie image to Nodepool that is supposedly boot 3x times faster T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-05T08:44:30Z] <hashar> Image ci-jessie-wikimedia-1480926961 in wmflabs-eqiad is ready T113342

hashar added a comment.Dec 5 2016, 9:58 AM

I have deployed the new image. We will see on https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=11&fullscreen whether that has an effect on ci-jessie-wikimedia.

hashar added a comment.EditedDec 5 2016, 11:09 AM

We should also disable the puppet-agent service while at it.

I have booted a new instance and the eth1 file is no more around:

$ find /etc/network/interfaces.d/ -type f
/etc/network/interfaces.d/eth0

However the console log shows that the LSB: Raise network interface is still running for a while, albeit with apparently a 18 seconds timeout, down from 25 seconds or so. Maybe the DHCP server does not honor requests fast enough?

system log for an instance that booted at Dec 06 10:51:00 (according to uptime -since).

From the Nova log, nova.network.manager allocated a network at 10:50:36.

Dec 06 10:51:08 ci-jessie- networking[189]: Configuring network interfaces...Internet Systems Consortium DHCP Client 4.3.1
Dec 06 10:51:08 ci-jessie- networking[189]: Listening on LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:08 ci-jessie- networking[189]: Sending on   LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:08 ci-jessie- networking[189]: Sending on   Socket/fallback
Dec 06 10:51:08 ci-jessie- networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:15 ci-jessie- networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPOFFER from 10.68.16.1
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPACK from 10.68.16.1
Dec 06 10:51:26 hasharpair networking[189]: bound to 10.68.17.220 -- renewal in 42518 seconds.
Dec 06 10:51:26 hasharpair networking[189]: Listening on LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:26 hasharpair networking[189]: Sending on   LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:26 hasharpair networking[189]: Sending on   Socket/fallback
Dec 06 10:51:26 hasharpair networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:26 hasharpair networking[189]: DHCPACK from 10.68.16.1
Dec 06 10:51:26 hasharpair networking[189]: RTNETLINK answers: File exists
Dec 06 10:51:26 hasharpair networking[189]: bound to 10.68.17.220 -- renewal in 38809 seconds.
Dec 06 10:51:26 hasharpair networking[189]: done.

I thought cloud-init would setup the network before dhcp but the instance does not seem to hit the API before 10:51:30.

systemd-analyze blame|head -n10
         19.535s networking.service
          4.883s puppet.service
          4.559s cloud-init.service
          2.703s mysql.service
          2.333s cloud-config.service
          1.917s apache2.service
          1.727s cloud-init-local.service
          1.610s varnishncsa.service
           478ms etcd.service
           441ms cloud-final.service

Boot chart:

Basically networking.service takes a while due to the dhcp requests, that delay the time for systemd to reach network-target including the ssh service.

Change 325561 had a related patch set uploaded (by Hashar):
dib: fix dupe sourcing in interfaces

https://gerrit.wikimedia.org/r/325561

Mentioned in SAL (#wikimedia-releng) [2016-12-06T16:33:33Z] <hashar> Nodepool imported a new Jessie image 'jessie-T113342' with some network configuration hotfix. Will use for debugging. T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-06T16:33:33Z] <hashar> Nodepool imported a new Jessie image 'jessie-T113342' with some network configuration hotfix. Will use for debugging. T113342

hashar added a comment.Dec 6 2016, 4:47 PM

Tried with the new image (that has https://gerrit.wikimedia.org/r/325561) and the image boot in 16 seconds :]

# uptime -s
2016-12-06 16:39:13

# journalctl |grep networking
Dec 06 16:39:20 debian networking[238]: Configuring network interfaces...Internet Systems Consortium DHCP Client 4.3.1
Dec 06 16:39:20 debian networking[238]: Copyright 2004-2014 Internet Systems Consortium.
Dec 06 16:39:20 debian networking[238]: All rights reserved.
Dec 06 16:39:20 debian networking[238]: For info, please visit https://www.isc.org/software/dhcp/
Dec 06 16:39:20 debian networking[238]: Listening on LPF/eth0/fa:16:3e:de:e6:7e
Dec 06 16:39:20 debian networking[238]: Sending on   LPF/eth0/fa:16:3e:de:e6:7e
Dec 06 16:39:20 debian networking[238]: Sending on   Socket/fallback
Dec 06 16:39:20 debian networking[238]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5
Dec 06 16:39:20 debian networking[238]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 16:39:20 debian networking[238]: DHCPOFFER from 10.68.16.1
Dec 06 16:39:20 debian networking[238]: DHCPACK from 10.68.16.1
Dec 06 16:39:20 debian networking[238]: bound to 10.68.23.153 -- renewal in 40201 seconds.
Dec 06 16:39:20 debian networking[238]: done.

Dec 06 16:39:28 hashar-network systemd[1]: Startup finished in 6.543s (kernel) + 9.657s (userspace) = 16.201s.

Mentioned in SAL (#wikimedia-releng) [2016-12-06T21:13:39Z] <hashar> Refresh Nodepool Jessie snapshot which boot 3 times faster. Will help get nodes available faster T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-06T21:20:48Z] <hashar> Image ci-jessie-wikimedia-1481058839 in wmflabs-eqiad is ready T113342

Mentioned in SAL (#wikimedia-operations) [2016-12-06T21:21:22Z] <hashar> CI: pushed a new Jessie image that is faster to boot, should slightly help the current load. T113342

hashar added a comment.Dec 6 2016, 9:48 PM

Deployment of the new image seems all fine. Can't tell whether it has much impact yet.

hashar added a comment.Dec 7 2016, 3:35 PM

Looking at a Jessie instance:

# journalctl |grep 'Startup fini'
Dec 07 15:01:56 hashar-boot systemd[1]: Startup finished in 7.132s (kernel) + 29.603s (userspace) = 36.736s.

# uptime -s
2016-12-07 15:01:19

# journalctl |grep networking
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Configuring network interfaces...Internet Systems Consortium DHCP Client 4.3.1
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Copyright 2004-2014 Internet Systems Consortium.
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: All rights reserved.
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: For info, please visit https://www.isc.org/software/dhcp/
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Listening on LPF/eth0/fa:16:3e:35:f7:c7
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Sending on   LPF/eth0/fa:16:3e:35:f7:c7
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Sending on   Socket/fallback
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 07 15:01:31 ci-jessie-wikimedia-1481120040 networking[185]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPOFFER from 10.68.16.1
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPACK from 10.68.16.1
Dec 07 15:01:42 hashar-boot networking[185]: bound to 10.68.22.131 -- renewal in 38706 seconds.

That boot took just 15 seconds for the DHCP part.

I noticed it fails to bring up eth1 and there is still a /etc/network/interfaces.d/eth1 file :(

# systemd-analyze blame
         15.259s networking.service
          8.080s cloud-init.service
          5.832s puppet.service
          3.029s cloud-config.service
          2.774s mysql.service
          1.831s apache2.service
          1.824s cloud-init-local.service
          1.581s varnishncsa.service
hashar added a comment.Dec 7 2016, 4:31 PM

In the instance (T113342#2850982) dhclient starts straight with a DHCPDISCOVER. That instance has been booted directly from the reference image.

In my previous comment the instance got booted from the snapshot and it does a couple DHCPREQUEST. I believe that is because the snapshot still has the DHCP lease and the dhclient attempts to refresh the lease. That would be rejected and thus it goes back to a DHCPDISCOVER.

The fix is that at the end of the snapshot, we have to delete the lease files in /var/lib/dhcp/ just before the snapshot/sync and maybe explicitly release the lease. That might fix it.

hashar added a comment.Dec 7 2016, 8:46 PM

I have looked at the jessie snapshot.

/etc/network/interfaces.d/eth1 magically appeared bah :(

For the leases:

# ls -l
total 4
-rw-r--r-- 1 root root 1406 Dec  7 14:16 dhclient.eth0.leases
-rw-r--r-- 1 root root    0 Dec  7 14:16 dhclient.eth1.leases
-rw-r--r-- 1 root root    0 Nov 10 09:01 dhclient.leases

There are actually two leases saved there:

# cat dhclient.eth0.leases
lease {
  interface "eth0";
  fixed-address 10.68.20.151;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "ci-jessie-wikimedia-1481120040";
  option domain-name "eqiad.wmflabs";
  renew 4 2016/12/08 01:04:55;
  rebind 4 2016/12/08 11:16:29;
  expire 4 2016/12/08 14:16:29;
}
lease {
  interface "eth0";
  fixed-address 10.68.20.151;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option routers 10.68.16.1;
  option dhcp-lease-time 86400;
  option dhcp-message-type 5;
  option domain-name-servers 208.80.155.118;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option dhcp-server-identifier 10.68.16.1;
  option dhcp-renewal-time 43200;
  option broadcast-address 10.68.23.255;
  option dhcp-rebinding-time 75600;
  option host-name "ci-jessie-wikimedia-1481120040";
  option domain-name "eqiad.wmflabs";
  renew 3 2016/12/07 23:17:22;
  rebind 4 2016/12/08 11:16:29;
  expire 4 2016/12/08 14:16:29;
}

Diff

--- one 2016-12-07 21:40:20.351200355 +0100
+++ two 2016-12-07 21:40:09.287300624 +0100
@@ -15,7 +15,7 @@
   option routers 10.68.16.1;
   option subnet-mask 255.255.248.0;
   option unknown-209 "pxelinux.cfg/ttyS1-115200";
-  renew 4 2016/12/08 01:04:55;
+  renew 3 2016/12/07 23:17:22;
   rebind 4 2016/12/08 11:16:29;
   expire 4 2016/12/08 14:16:29;
 }

I create a new image and booted an instance out of it. Deleted the lease and synced the filesystem: rm /var/lib/dhclient/dhclient.*.leases && sync. Took a snapshot and booted an instance out of it. It is no more affected by the DHCP long idle time.

Change 325852 had a related patch set uploaded (by Hashar):
nodepool: delete DHCP leases from snapshot

https://gerrit.wikimedia.org/r/325852

Change 325561 merged by jenkins-bot:
dib: fix dupe sourcing in interfaces

https://gerrit.wikimedia.org/r/325561

Change 325854 had a related patch set uploaded (by Hashar):
nodepool: delete DHCP leases from snapshot

https://gerrit.wikimedia.org/r/325854

Change 325854 merged by jenkins-bot:
nodepool: delete DHCP leases from snapshot

https://gerrit.wikimedia.org/r/325854

Mentioned in SAL (#wikimedia-releng) [2016-12-08T08:48:56Z] <hashar> Image ci-jessie-wikimedia-1481186016 in wmflabs-eqiad is ready T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-08T10:21:25Z] <hashar> Refreshing Nodepool base image for Trusty. Was blocked on a mariadb upgrade, should also acquire network faster T113342

For Jessie:

$ grep 445011 /var/log/nodepool/debug.log
08:59:06,747 DEBUG nodepool.NodeLauncher: Launching node id: 445011 
08:59:06,755 INFO  nodepool.NodeLauncher: Creating server with hostname ci-jessie-wikimedia-445011 in wmflabs-eqiad from image ci-jessie-wikimedia for node id: 445011
08:59:07,903 DEBUG nodepool.NodeLauncher: Waiting for server 06a57c69-dbda-43b5-a654-daba735f95b4 for node id: 445011 
08:59:49,534 DEBUG nodepool.NodeLauncher: Node id: 445011 is running, ip: 10.68.22.237, ipv6: None 
08:59:49,534 DEBUG nodepool.NodeLauncher: Node id: 445011 testing ssh at ip: 10.68.22.237
09:00:02,810 INFO  nodepool.NodeLauncher: Node id: 445011 is ready
09:00:02,816 DEBUG nodepool.NodeLauncher: Adding node id: 445011 to jenkins 
09:00:03,096 INFO  nodepool.NodeLauncher: Node id: 445011 added to jenkins

Or 53 seconds all included.

For Trusty, cloud-init reports at the end Up 16.53 seconds!

10:41:09,721 DEBUG nodepool.NodeLauncher: Launching node id: 445085 
10:41:09,734 INFO  nodepool.NodeLauncher: Creating server with hostname ci-trusty-wikimedia-445085 in wmflabs-eqiad from image ci-trusty-wikimedia for node id: 445085
10:41:17,640 DEBUG nodepool.NodeLauncher: Waiting for server eaa9ccc0-2e41-4497-b95b-be411ef16b2c for node id: 445085 
10:41:45,670 DEBUG nodepool.NodeLauncher: Node id: 445085 is running, ip: 10.68.19.88, ipv6: None 
10:41:45,671 DEBUG nodepool.NodeLauncher: Node id: 445085 testing ssh at ip: 10.68.19.88
10:42:00,377 INFO  nodepool.NodeLauncher: Node id: 445085 is ready
10:42:00,382 DEBUG nodepool.NodeLauncher: Adding node id: 445085 to jenkins 
10:42:05,954 INFO  nodepool.NodeLauncher: Node id: 445085 added to jenkins

Or 54 seconds.

Bulk of the work complete!

Change 325852 abandoned by Hashar:
nodepool: delete DHCP leases from snapshot

Reason:
Glitch in Gerrit ( T152642 ). Sent and merged via https://gerrit.wikimedia.org/r/#/c/325854/

https://gerrit.wikimedia.org/r/325852

Change 325052 merged by jenkins-bot:
dib: do not start varnish service on boot

https://gerrit.wikimedia.org/r/325052

After bunch of optimizations:

Trusty images boot in 16 seconds.
Jessie ones boot in ~ 10 seconds.

It still takes ~ 50-70 seconds to get them added though. Partly because Nodepool queries to OpenStack API are rate limited, partly due to OpenStack overhead itself (noticable when spawning instances in Horizon).