openstack instance creation sometimes takes >480s
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Mar 2 2017, 4:50 PM

Description

We finally got a continual fullstack instance lifecycle test going in https://gerrit.wikimedia.org/r/#/c/339651/ but it has turned up an oddity where instance creation part of the time takes a long time. Originally, the allowed threshold was 240s and I upped that to 480s as we were leaking 3 instances in half a day. We are still violating the 480s threshold over the course of 2 days. Either we need to agree that it's OK this takes more than 8 minutes periodically or figure out why it does and mitigate. At the moment I am under the impression 8 minutes to create an instance often (where creation is defined as it being returned as active via the API) is probably a sign of deeper issues.

@hashar created a grafana board for this service https://grafana.wikimedia.org/dashboard/db/labs-nova-fullstack

Graphite direct comparison of creation vs fullstack test times where creation is definitely the consistent problem child

https://graphite.wikimedia.org/render/?width=973&height=488&_salt=1488408597.388&lineMode=connected&target=servers.labnet1001.nova.verify.creation&target=servers.labnet1001.nova.verify.fullstack&lineWidth=3&title=Fullstack%20test%20and%20Creation%20times&fontBold=true&bgcolor=FFFFFF&graphOnly=false&hideAxes=false

This morning I am cleaning up 3 instances since yesterday afternoon that took >480s causing the max_pool param 3 to cause the service itself to cease and alert. When I spot checked them now hours later they are fine.

+--------------------------------------+-----------------------+--------+---------------------+
| ID                                   | Name                  | Status | Networks            |
+--------------------------------------+-----------------------+--------+---------------------+
| b051e079-a8b8-4191-832e-39072c3a83fa | fullstackd-1488405544 | ACTIVE | public=10.68.16.43  |
| 3d937088-1d8f-4e5b-ae34-d87535fc4c2b | fullstackd-1488372409 | ACTIVE | public=10.68.18.191 |
| 210422be-69ee-4a06-90d6-9af162e93773 | fullstackd-1488327296 | ACTIVE | public=10.68.19.129 |
+--------------------------------------+-----------------------+--------+---------------------+

Related Objects

Mentioned In: T160908: Instance creation fails before first puppet run around 1% of the time
Mentioned Here: T159721: labvirt1001 and 1002 cannot launch new VMs

Event Timeline

• chasemp created this task.Mar 2 2017, 4:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2017, 4:50 PM

Nodepool emits a metric similar to the fullstack one. That is from the time an instance is created internally to nodepool until the time it has been added as a Jenkins slave. So that is partly fullstack + all nodepool overhead.

The metrics have data all back from beginning of August 2016 so maybe that can help. There is one for each snapshot, but it does not seem to be relevant.

7 days : https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=11&fullscreen&from=now-7d&to=now

30 days : https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=11&fullscreen&from=now-30d&to=now

Over 30 days one can see that is noticeable since mid February.

The 30 days maximum https://grafana.wikimedia.org/dashboard/db/nodepool?panelId=13&fullscreen&from=now-30d&to=now (data between Feb 14 to Feb 21 are invalid, that is due to a graphite host switch :-(

There are some labvirt* that shows a bump in CPU guest / Load average. labvirt1001 specially is concerning (load raising to 45+). Graphs over 30 days:

labvirt1001
labvirt1006
labvirt1008

leaked 3 more now leaving to debug tomorrow

PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack

Paladox subscribed.Mar 2 2017, 11:20 PM

So, this seems like partially SSH timeouts. I have no problem upping that for now while are still figuring out baselines. The puppet run and setup variance is the most understandable of the stages.

https://gerrit.wikimedia.org/r/#/c/340979/

So I just caught an instance that had initial issues.

2017-03-03 14:38:46,717 INFO Creating fullstackd-1488551924
2017-03-03 14:44:32,354 INFO servers.labnet1001.nova.verify.creation => 345.63 1488552272

it was initially attempted on labvirt1001

root@labcontrol1001:~# nova show 4ee7e5ef-502c-4959-8492-f7e5ae15b434
+--------------------------------------+----------------------------------------------------------+
| Property                             | Value                                                    |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                   |
| OS-EXT-AZ:availability_zone          | nova                                                     |
| OS-EXT-SRV-ATTR:host                 | labvirt1001                                              |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | labvirt1001.eqiad.wmnet                                  |
| OS-EXT-SRV-ATTR:instance_name        | i-00064626                                               |
| OS-EXT-STS:power_state               | 0                                                        |
| OS-EXT-STS:task_state                | spawning                                                 |
| OS-EXT-STS:vm_state                  | building                                                 |
| OS-SRV-USG:launched_at               | -                                                        |
| OS-SRV-USG:terminated_at             | -                                                        |
| accessIPv4                           |                                                          |
| accessIPv6                           |                                                          |
| config_drive                         |                                                          |
| created                              | 2017-03-03T14:38:47Z                                     |
| flavor                               | m1.small (2)                                             |
| hostId                               | 4d85e25cf9e51abcda0e591fa1c2612ccbf6b64d4b7d1bba6a4e761f |
| id                                   | 4ee7e5ef-502c-4959-8492-f7e5ae15b434                     |
| image                                | debian-8.7-jessie (ea1ee8b4-c4cf-4d68-9ea6-cfaeb710b9b0) |
| key_name                             | -                                                        |
| metadata                             | {}                                                       |
| name                                 | fullstackd-1488551924                                    |
| os-extended-volumes:volumes_attached | []                                                       |
| progress                             | 0                                                        |
| public network                       | 10.68.17.158                                             |
| security_groups                      | default                                                  |
| status                               | BUILD                                                    |
| tenant_id                            | admin-monitoring                                         |
| updated                              | 2017-03-03T14:38:52Z                                     |
| user_id                              | osstackcanary                                            |
+--------------------------------------+----------------------------------------------------------+

This /failed/ to work out on labvirt1001, which what looks like an labvirt error

Full grep: https://phabricator.wikimedia.org/P5014

 upstart/nova-compute.log:2017-03-03 14:44:13.988 40678 ERROR nova.compute.manager [instance: 4ee7e5ef-502c-4959-8492-f7e5ae15b434]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1059, in createWithFlags
upstart/nova-compute.log:2017-03-03 14:44:13.988 40678 ERROR nova.compute.manager [instance: 4ee7e5ef-502c-4959-8492-f7e5ae15b434]     if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
upstart/nova-compute.log:2017-03-03 14:44:13.988 40678 ERROR nova.compute.manager [instance: 4ee7e5ef-502c-4959-8492-f7e5ae15b434] libvirtError: error from service: ListActivatableNames: Connection is closed

At that point the scheduler compensated and reallocated:

+--------------------------------------+----------------------------------------------------------+
| Property                             | Value                                                    |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                    | MANUAL                                                   |
| OS-EXT-AZ:availability_zone          | nova                                                     |
| OS-EXT-SRV-ATTR:host                 | labvirt1008                                              |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | labvirt1008.eqiad.wmnet                                  |
| OS-EXT-SRV-ATTR:instance_name        | i-00064626                                               |
| OS-EXT-STS:power_state               | 1                                                        |
| OS-EXT-STS:task_state                | deleting                                                 |
| OS-EXT-STS:vm_state                  | active                                                   |
| OS-SRV-USG:launched_at               | 2017-03-03T14:44:29.000000                               |
| OS-SRV-USG:terminated_at             | -                                                        |
| accessIPv4                           |                                                          |
| accessIPv6                           |                                                          |
| config_drive                         |                                                          |
| created                              | 2017-03-03T14:38:47Z                                     |
| flavor                               | m1.small (2)                                             |
| hostId                               | 788f3e33f7f3aef72a1ef874afd11ac46e4dd16413f425a7a2678eb4 |
| id                                   | 4ee7e5ef-502c-4959-8492-f7e5ae15b434                     |
| image                                | debian-8.7-jessie (ea1ee8b4-c4cf-4d68-9ea6-cfaeb710b9b0) |
| key_name                             | -                                                        |
| metadata                             | {}                                                       |
| name                                 | fullstackd-1488551924                                    |
| os-extended-volumes:volumes_attached | []                                                       |
| progress                             | 0                                                        |
| public network                       | 10.68.17.158                                             |
| security_groups                      | default                                                  |
| status                               | ACTIVE                                                   |
| tenant_id                            | admin-monitoring                                         |
| updated                              | 2017-03-03T14:48:00Z                                     |
| user_id                              | osstackcanary                                            |
+--------------------------------------+----------------------------------------------------------+

after which it basically returned in <30s

load across active labvirts

labvirt1001
 15:01:30 up 128 days, 20:30,  1 user,  load average: 52.25, 47.77, 48.21
labvirt1004
 15:01:32 up 121 days, 19:37,  0 users,  load average: 42.00, 39.35, 40.04
labvirt1002
 15:01:34 up 121 days, 20:47,  0 users,  load average: 22.36, 24.42, 24.90
labvirt1006
 15:01:36 up 121 days, 18:39,  0 users,  load average: 26.48, 32.49, 37.48
labvirt1005
 15:01:38 up 121 days, 18:10,  0 users,  load average: 17.05, 19.53, 21.22
labvirt1003
 15:01:40 up 121 days, 20:08,  0 users,  load average: 19.26, 23.03, 24.77
labvirt1007
 15:01:41 up 121 days, 19:05,  0 users,  load average: 18.73, 20.72, 22.05
labvirt1008
 15:01:43 up 121 days, 18:36,  0 users,  load average: 27.04, 29.85, 32.81
labvirt1009
 15:01:45 up 121 days, 18:12,  0 users,  load average: 27.06, 17.00, 14.77
labvirt1010
 15:01:46 up 121 days, 17:53,  0 users,  load average: 33.51, 31.68, 31.98
labvirt1011
 15:01:48 up 121 days, 17:35,  0 users,  load average: 11.58, 11.21, 11.63
labvirt1012
 15:01:50 up 120 days, 20:51,  0 users,  load average: 23.20, 20.63, 22.07
labvirt1013
 15:01:52 up 120 days, 20:28,  0 users,  load average: 21.20, 20.80, 21.31

labvirt1001 is handling more than its share of load here, and I'm wondering if the scheduler is fairly weighted across these nodes that are (at this moment) unfairly allocated. i.e. should we take labvirt1001 out of the pool (maybe even 1002?) and reshuffle a bit?

In T159459#3070765, @chasemp wrote:

labvirt1001 is handling more than its share of load here, and I'm wondering if the scheduler is fairly weighted across these nodes that are (at this moment) unfairly allocated. i.e. should we take labvirt1001 out of the pool (maybe even 1002?) and reshuffle a bit?

https://gerrit.wikimedia.org/r/#/c/340986/

cleaned up and restarted leaked instances from the small fiasco of rolling out https://gerrit.wikimedia.org/r/#/c/340986/ which due to a bug in nova requiring 20m to return caused headache. This eventually sorted itself out and iiuc is a bug fixed in in Mitaka but I did not have time to circle back on nova-fullstack service at the time.

Since labvirt1001 and labvirt1002 have been removed from the scheduler pool, the time to get ssh access has significantly dropped and seems to be rather stable. The graph below is the 1 day moving average of the max time to reach an instance:

Nodepool time to ready average.png (574×832 px, 62 KB)

Potentially what is left to figure out is labvirt1001/labvirt1002 being potentially scheduled more than the other hosts.

This is still super important but the immediate issue of this task is resolved it seems and followed up by T159721: labvirt1001 and 1002 cannot launch new VMs

• chasemp mentioned this in T160908: Instance creation fails before first puppet run around 1% of the time.Mar 20 2017, 12:58 PM

	F6666740: Nodepool time to ready average.png
	Mar 17 2017, 12:54 PM

openstack instance creation sometimes takes >480sClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

openstack instance creation sometimes takes >480s
Closed, ResolvedPublic
Actions