Speed up the time to get a Nodepool instances to achieve READY state
Closed, ResolvedPublic

Description

When Nodepool spawn an instances, it iterates till it can ssh to it as jenkins user and then executes a ready.sh script. From T112750#1645247 that takes roughly a minute:

Instance with id 47 creation log shows it took 31 seconds for labs to boot the instance and 30 seconds for it to be ssh reachable:

14:59:02,818 DEBUG nodepool.NodeLauncher: Launching node id: 47
14:59:02,823 INFO nodepool.NodeLauncher: Creating server with hostname ci-jessie-wikimedia-47.contintcloud.eqiad.wmflabs \
                                         in wmflabs-eqiad from image ci-jessie-wikimedia for node id: 47
14:59:33,801 DEBUG nodepool.ProviderManager: Status of server in wmflabs-eqiad 541abb35-062e-4637-b46e-2fd08f4961b7: ACTIVE
14:59:33,801 DEBUG nodepool.NodeLauncher: Node id: 47 testing ssh at ip: 10.68.19.205
15:00:04,271 INFO nodepool.NodeLauncher: Node id: 47 is ready

The ssh delay might be because the OpenSSH daemon starts late in the boot process.

Instance launch time on Grafana: https://grafana.wikimedia.org/dashboard/db/nodepool

hashar created this task.Sep 22 2015, 9:07 AM
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2015, 9:07 AM
hashar edited the task description. (Show Details)Nov 4 2016, 9:05 AM
hashar set Security to None.
hashar added a comment.EditedDec 1 2016, 1:20 PM
13:10:02,299 DEBUG nodepool.ProviderManager: Status of server in wmflabs-eqiad ce709ec4-a85c-4359-be5f-7ee2fb09a6d1: ACTIVE
13:10:02,299 DEBUG nodepool.NodeLauncher: Node id: 435953 is running, ip: 10.68.17.179, ipv6: None
13:10:02,299 DEBUG nodepool.NodeLauncher: Node id: 435953 testing ssh at ip: 10.68.17.179

33 seconds later...

13:10:35,810 INFO nodepool.NodeLauncher: Node id: 435953 is ready
13:10:35,814 DEBUG nodepool.NodeLauncher: Adding node id: 435953 to jenkins

The relevant code is in nodepool/nodepool.py method launchNode().

Paladox added a subscriber: Paladox.Dec 1 2016, 1:21 PM
hashar added a comment.Dec 1 2016, 3:36 PM

I manually spawned an instance at 15:21:49. Did a ping on it and network went up roughly a minute later at 15:22:52. ssh client was apparently available immediately. The boot log P4552, while watching it I noticed a long delay at raising the network, apparently of 25 seconds though the IP was apparently already assigned in OpenStack

[    7.679858] intel_rapl: no valid rapl domains found in package 0
[    7.820581] intel_rapl: no valid rapl domains found in package 0
 A start job is running for LSB: Raise network interf...(5s / no limit)
 A start job is running for LSB: Raise network interf...(5s / no limit)
...
 A start job is running for LSB: Raise network interf...23s / no limit)
 A start job is running for LSB: Raise network interf...23s / no limit)
 A start job is running for LSB: Raise network interf...24s / no limit)
 A start job is running for LSB: Raise network interf...24s / no limit)
 A start job is running for LSB: Raise network interf...25s / no limit)
OK Started LSB: Raise network interfaces..
OK Reached target Network.
...
[   34.838627] random: nonblocking pool is initialized

Or apparently some while loop iterate for 25 seconds

hashar added a comment.Dec 1 2016, 4:18 PM

I noticed we have two interfaces configured:

/etc/network/interfaces.d/eth0
/etc/network/interfaces.d/eth1

That is done by diskimage-builder via:

elements/debootstrap/install.d/10-debian-networking
# It would be eversogreat if we didn't need to do crap like this
echo $DISTRO_NAME > /etc/hostname

# cloud images expect eth0 and eth1 to use dhcp.
mkdir -p /etc/network/interfaces.d
echo "source /etc/network/interfaces.d/*" >> /etc/network/interfaces
for interface in eth0 eth1; do
    cat << EOF | tee /etc/network/interfaces.d/$interface
auto $interface
iface $interface inet dhcp
EOF
done

Maybe there is some timeout / retry occuring while trying to bring up eth1 which is not used.

Change 324744 had a related patch set uploaded (by Hashar):
dib: remove eth1 configuration

https://gerrit.wikimedia.org/r/324744

Change 325051 had a related patch set uploaded (by Hashar):
dib: set grub timeout to 0

https://gerrit.wikimedia.org/r/325051

Change 325052 had a related patch set uploaded (by Hashar):
dib: do not start varnish service on boot

https://gerrit.wikimedia.org/r/325052

I have created a new Jessie image with the patch that drops eth1 and the one setting grub timeout to 0. Boot log is P4560 and shows it is going apparently faster.

I have confirmed locally with qemu that the 5 seconds grub timeout is gone.

dhcp is no more attempted on eth1 and thus A start job is running for LSB: Raise network interf...25s / no limit) is gone:

[[32m  OK  [0m] Started LSB: Raise network interfaces..
[[32m  OK  [0m] Reached target Network.
[[32m  OK  [0m] Reached target Network is Online.

The message random: nonblocking pool is initialized occurs after 11 seconds (was 34 seconds).

Success!

The image for Jessie is on labnodepool1001.eqiad.wmnet in /home/hashar/image-jessie-T113342.qcow2 will deploy it next week.

Will have to build and test a new Trusty image as well.

Change 324744 merged by jenkins-bot:
dib: remove eth1 configuration

https://gerrit.wikimedia.org/r/324744

Change 325051 merged by jenkins-bot:
dib: set grub timeout to 0

https://gerrit.wikimedia.org/r/325051

Mentioned in SAL (#wikimedia-releng) [2016-12-05T08:35:52Z] <hashar> Pushing new Jessie image to Nodepool that is supposedly boot 3x times faster T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-05T08:44:30Z] <hashar> Image ci-jessie-wikimedia-1480926961 in wmflabs-eqiad is ready T113342

hashar added a comment.Dec 5 2016, 9:58 AM

I have deployed the new image. We will see on https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=11&fullscreen whether that has an effect on ci-jessie-wikimedia.

hashar added a comment.EditedDec 5 2016, 11:09 AM

We should also disable the puppet-agent service while at it.

I have booted a new instance and the eth1 file is no more around:

$ find /etc/network/interfaces.d/ -type f
/etc/network/interfaces.d/eth0

However the console log shows that the LSB: Raise network interface is still running for a while, albeit with apparently a 18 seconds timeout, down from 25 seconds or so. Maybe the DHCP server does not honor requests fast enough?

system log for an instance that booted at Dec 06 10:51:00 (according to uptime -since).

From the Nova log, nova.network.manager allocated a network at 10:50:36.

Dec 06 10:51:08 ci-jessie- networking[189]: Configuring network interfaces...Internet Systems Consortium DHCP Client 4.3.1
Dec 06 10:51:08 ci-jessie- networking[189]: Listening on LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:08 ci-jessie- networking[189]: Sending on   LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:08 ci-jessie- networking[189]: Sending on   Socket/fallback
Dec 06 10:51:08 ci-jessie- networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:15 ci-jessie- networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPOFFER from 10.68.16.1
Dec 06 10:51:26 ci-jessie- networking[189]: DHCPACK from 10.68.16.1
Dec 06 10:51:26 hasharpair networking[189]: bound to 10.68.17.220 -- renewal in 42518 seconds.
Dec 06 10:51:26 hasharpair networking[189]: Listening on LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:26 hasharpair networking[189]: Sending on   LPF/eth0/fa:16:3e:e7:a6:bb
Dec 06 10:51:26 hasharpair networking[189]: Sending on   Socket/fallback
Dec 06 10:51:26 hasharpair networking[189]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 10:51:26 hasharpair networking[189]: DHCPACK from 10.68.16.1
Dec 06 10:51:26 hasharpair networking[189]: RTNETLINK answers: File exists
Dec 06 10:51:26 hasharpair networking[189]: bound to 10.68.17.220 -- renewal in 38809 seconds.
Dec 06 10:51:26 hasharpair networking[189]: done.

I thought cloud-init would setup the network before dhcp but the instance does not seem to hit the API before 10:51:30.

systemd-analyze blame|head -n10
         19.535s networking.service
          4.883s puppet.service
          4.559s cloud-init.service
          2.703s mysql.service
          2.333s cloud-config.service
          1.917s apache2.service
          1.727s cloud-init-local.service
          1.610s varnishncsa.service
           478ms etcd.service
           441ms cloud-final.service

Boot chart:

Basically networking.service takes a while due to the dhcp requests, that delay the time for systemd to reach network-target including the ssh service.

Change 325561 had a related patch set uploaded (by Hashar):
dib: fix dupe sourcing in interfaces

https://gerrit.wikimedia.org/r/325561

Mentioned in SAL (#wikimedia-releng) [2016-12-06T16:33:33Z] <hashar> Nodepool imported a new Jessie image 'jessie-T113342' with some network configuration hotfix. Will use for debugging. T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-06T16:33:33Z] <hashar> Nodepool imported a new Jessie image 'jessie-T113342' with some network configuration hotfix. Will use for debugging. T113342

hashar added a comment.Dec 6 2016, 4:47 PM

Tried with the new image (that has https://gerrit.wikimedia.org/r/325561) and the image boot in 16 seconds :]

# uptime -s
2016-12-06 16:39:13

# journalctl |grep networking
Dec 06 16:39:20 debian networking[238]: Configuring network interfaces...Internet Systems Consortium DHCP Client 4.3.1
Dec 06 16:39:20 debian networking[238]: Copyright 2004-2014 Internet Systems Consortium.
Dec 06 16:39:20 debian networking[238]: All rights reserved.
Dec 06 16:39:20 debian networking[238]: For info, please visit https://www.isc.org/software/dhcp/
Dec 06 16:39:20 debian networking[238]: Listening on LPF/eth0/fa:16:3e:de:e6:7e
Dec 06 16:39:20 debian networking[238]: Sending on   LPF/eth0/fa:16:3e:de:e6:7e
Dec 06 16:39:20 debian networking[238]: Sending on   Socket/fallback
Dec 06 16:39:20 debian networking[238]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5
Dec 06 16:39:20 debian networking[238]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 06 16:39:20 debian networking[238]: DHCPOFFER from 10.68.16.1
Dec 06 16:39:20 debian networking[238]: DHCPACK from 10.68.16.1
Dec 06 16:39:20 debian networking[238]: bound to 10.68.23.153 -- renewal in 40201 seconds.
Dec 06 16:39:20 debian networking[238]: done.

Dec 06 16:39:28 hashar-network systemd[1]: Startup finished in 6.543s (kernel) + 9.657s (userspace) = 16.201s.

Mentioned in SAL (#wikimedia-releng) [2016-12-06T21:13:39Z] <hashar> Refresh Nodepool Jessie snapshot which boot 3 times faster. Will help get nodes available faster T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-06T21:20:48Z] <hashar> Image ci-jessie-wikimedia-1481058839 in wmflabs-eqiad is ready T113342

Mentioned in SAL (#wikimedia-operations) [2016-12-06T21:21:22Z] <hashar> CI: pushed a new Jessie image that is faster to boot, should slightly help the current load. T113342

hashar added a comment.Dec 6 2016, 9:48 PM

Deployment of the new image seems all fine. Can't tell whether it has much impact yet.

hashar added a comment.Dec 7 2016, 3:35 PM

Looking at a Jessie instance:

# journalctl |grep 'Startup fini'
Dec 07 15:01:56 hashar-boot systemd[1]: Startup finished in 7.132s (kernel) + 29.603s (userspace) = 36.736s.

# uptime -s
2016-12-07 15:01:19

# journalctl |grep networking
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Configuring network interfaces...Internet Systems Consortium DHCP Client 4.3.1
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Copyright 2004-2014 Internet Systems Consortium.
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: All rights reserved.
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: For info, please visit https://www.isc.org/software/dhcp/
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Listening on LPF/eth0/fa:16:3e:35:f7:c7
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Sending on   LPF/eth0/fa:16:3e:35:f7:c7
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: Sending on   Socket/fallback
Dec 07 15:01:27 ci-jessie-wikimedia-1481120040 networking[185]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 07 15:01:31 ci-jessie-wikimedia-1481120040 networking[185]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPREQUEST on eth0 to 255.255.255.255 port 67
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPOFFER from 10.68.16.1
Dec 07 15:01:42 ci-jessie-wikimedia-1481120040 networking[185]: DHCPACK from 10.68.16.1
Dec 07 15:01:42 hashar-boot networking[185]: bound to 10.68.22.131 -- renewal in 38706 seconds.

That boot took just 15 seconds for the DHCP part.

I noticed it fails to bring up eth1 and there is still a /etc/network/interfaces.d/eth1 file :(

# systemd-analyze blame
         15.259s networking.service
          8.080s cloud-init.service
          5.832s puppet.service
          3.029s cloud-config.service
          2.774s mysql.service
          1.831s apache2.service
          1.824s cloud-init-local.service
          1.581s varnishncsa.service
hashar added a comment.Dec 7 2016, 4:31 PM

In the instance (T113342#2850982) dhclient starts straight with a DHCPDISCOVER. That instance has been booted directly from the reference image.

In my previous comment the instance got booted from the snapshot and it does a couple DHCPREQUEST. I believe that is because the snapshot still has the DHCP lease and the dhclient attempts to refresh the lease. That would be rejected and thus it goes back to a DHCPDISCOVER.

The fix is that at the end of the snapshot, we have to delete the lease files in /var/lib/dhcp/ just before the snapshot/sync and maybe explicitly release the lease. That might fix it.

hashar added a comment.Dec 7 2016, 8:46 PM

I have looked at the jessie snapshot.

/etc/network/interfaces.d/eth1 magically appeared bah :(

For the leases:

# ls -l
total 4
-rw-r--r-- 1 root root 1406 Dec  7 14:16 dhclient.eth0.leases
-rw-r--r-- 1 root root    0 Dec  7 14:16 dhclient.eth1.leases
-rw-r--r-- 1 root root    0 Nov 10 09:01 dhclient.leases

There are actually two leases saved there:

# cat dhclient.eth0.leases
lease {
  interface "eth0";
  fixed-address 10.68.20.151;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option dhcp-lease-time 86400;
  option routers 10.68.16.1;
  option dhcp-message-type 5;
  option dhcp-server-identifier 10.68.16.1;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option domain-name-servers 208.80.155.118;
  option dhcp-renewal-time 43200;
  option dhcp-rebinding-time 75600;
  option broadcast-address 10.68.23.255;
  option host-name "ci-jessie-wikimedia-1481120040";
  option domain-name "eqiad.wmflabs";
  renew 4 2016/12/08 01:04:55;
  rebind 4 2016/12/08 11:16:29;
  expire 4 2016/12/08 14:16:29;
}
lease {
  interface "eth0";
  fixed-address 10.68.20.151;
  filename "jessie-installer/pxelinux.0";
  server-name "carbon.wikimedia.org";
  option subnet-mask 255.255.248.0;
  option routers 10.68.16.1;
  option dhcp-lease-time 86400;
  option dhcp-message-type 5;
  option domain-name-servers 208.80.155.118;
  option unknown-209 "pxelinux.cfg/ttyS1-115200";
  option dhcp-server-identifier 10.68.16.1;
  option dhcp-renewal-time 43200;
  option broadcast-address 10.68.23.255;
  option dhcp-rebinding-time 75600;
  option host-name "ci-jessie-wikimedia-1481120040";
  option domain-name "eqiad.wmflabs";
  renew 3 2016/12/07 23:17:22;
  rebind 4 2016/12/08 11:16:29;
  expire 4 2016/12/08 14:16:29;
}

Diff

--- one 2016-12-07 21:40:20.351200355 +0100
+++ two 2016-12-07 21:40:09.287300624 +0100
@@ -15,7 +15,7 @@
   option routers 10.68.16.1;
   option subnet-mask 255.255.248.0;
   option unknown-209 "pxelinux.cfg/ttyS1-115200";
-  renew 4 2016/12/08 01:04:55;
+  renew 3 2016/12/07 23:17:22;
   rebind 4 2016/12/08 11:16:29;
   expire 4 2016/12/08 14:16:29;
 }

I create a new image and booted an instance out of it. Deleted the lease and synced the filesystem: rm /var/lib/dhclient/dhclient.*.leases && sync. Took a snapshot and booted an instance out of it. It is no more affected by the DHCP long idle time.

Change 325852 had a related patch set uploaded (by Hashar):
nodepool: delete DHCP leases from snapshot

https://gerrit.wikimedia.org/r/325852

Change 325561 merged by jenkins-bot:
dib: fix dupe sourcing in interfaces

https://gerrit.wikimedia.org/r/325561

Change 325854 had a related patch set uploaded (by Hashar):
nodepool: delete DHCP leases from snapshot

https://gerrit.wikimedia.org/r/325854

Change 325854 merged by jenkins-bot:
nodepool: delete DHCP leases from snapshot

https://gerrit.wikimedia.org/r/325854

Mentioned in SAL (#wikimedia-releng) [2016-12-08T08:48:56Z] <hashar> Image ci-jessie-wikimedia-1481186016 in wmflabs-eqiad is ready T113342

Mentioned in SAL (#wikimedia-releng) [2016-12-08T10:21:25Z] <hashar> Refreshing Nodepool base image for Trusty. Was blocked on a mariadb upgrade, should also acquire network faster T113342

For Jessie:

$ grep 445011 /var/log/nodepool/debug.log
08:59:06,747 DEBUG nodepool.NodeLauncher: Launching node id: 445011 
08:59:06,755 INFO  nodepool.NodeLauncher: Creating server with hostname ci-jessie-wikimedia-445011 in wmflabs-eqiad from image ci-jessie-wikimedia for node id: 445011
08:59:07,903 DEBUG nodepool.NodeLauncher: Waiting for server 06a57c69-dbda-43b5-a654-daba735f95b4 for node id: 445011 
08:59:49,534 DEBUG nodepool.NodeLauncher: Node id: 445011 is running, ip: 10.68.22.237, ipv6: None 
08:59:49,534 DEBUG nodepool.NodeLauncher: Node id: 445011 testing ssh at ip: 10.68.22.237
09:00:02,810 INFO  nodepool.NodeLauncher: Node id: 445011 is ready
09:00:02,816 DEBUG nodepool.NodeLauncher: Adding node id: 445011 to jenkins 
09:00:03,096 INFO  nodepool.NodeLauncher: Node id: 445011 added to jenkins

Or 53 seconds all included.

For Trusty, cloud-init reports at the end Up 16.53 seconds!

10:41:09,721 DEBUG nodepool.NodeLauncher: Launching node id: 445085 
10:41:09,734 INFO  nodepool.NodeLauncher: Creating server with hostname ci-trusty-wikimedia-445085 in wmflabs-eqiad from image ci-trusty-wikimedia for node id: 445085
10:41:17,640 DEBUG nodepool.NodeLauncher: Waiting for server eaa9ccc0-2e41-4497-b95b-be411ef16b2c for node id: 445085 
10:41:45,670 DEBUG nodepool.NodeLauncher: Node id: 445085 is running, ip: 10.68.19.88, ipv6: None 
10:41:45,671 DEBUG nodepool.NodeLauncher: Node id: 445085 testing ssh at ip: 10.68.19.88
10:42:00,377 INFO  nodepool.NodeLauncher: Node id: 445085 is ready
10:42:00,382 DEBUG nodepool.NodeLauncher: Adding node id: 445085 to jenkins 
10:42:05,954 INFO  nodepool.NodeLauncher: Node id: 445085 added to jenkins

Or 54 seconds.

Bulk of the work complete!

Change 325852 abandoned by Hashar:
nodepool: delete DHCP leases from snapshot

Reason:
Glitch in Gerrit ( T152642 ). Sent and merged via https://gerrit.wikimedia.org/r/#/c/325854/

https://gerrit.wikimedia.org/r/325852

Change 325052 merged by jenkins-bot:
dib: do not start varnish service on boot

https://gerrit.wikimedia.org/r/325052

After bunch of optimizations:

Trusty images boot in 16 seconds.
Jessie ones boot in ~ 10 seconds.

It still takes ~ 50-70 seconds to get them added though. Partly because Nodepool queries to OpenStack API are rate limited, partly due to OpenStack overhead itself (noticable when spawning instances in Horizon).

hashar closed this task as "Resolved".Mar 13 2017, 3:03 PM

On the images themselves there is nothing else we can do. Part of the slowness was due to OpenStack taking a while to boot some instances but that has been fixed.

For the record, the timing can be seen at https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=11&fullscreen