Page MenuHomePhabricator

nova-fullstack: alert due to leaked instances
Closed, ResolvedPublic

Description

Today we got a page because nova-fullstack leaked too many VM instances in the admin-monitoring porject.

aborrero@cloudcontrol1003:~ $ sudo wmcs-openstack server list --project admin-monitoring
+--------------------------------------+-----------------------+--------+-------------------------------------------------------------------+
| ID                                   | Name                  | Status | Networks                                                          |
+--------------------------------------+-----------------------+--------+-------------------------------------------------------------------+
| 2799085e-8409-4aec-911d-2c26739f6c91 | fullstackd-1562052291 | ACTIVE | lan-flat-cloudinstances2b=172.16.7.182                            |
| 9551a639-6d8c-4619-8c16-1f652457478f | fullstackd-1562049720 | ACTIVE | lan-flat-cloudinstances2b=172.16.7.176, 172.16.7.177, 172.16.7.18 |
| 3653dacc-568a-4b9c-a126-3db668ea4972 | fullstackd-1562014758 | ACTIVE | lan-flat-cloudinstances2b=172.16.7.109, 172.16.7.11               |
| 8f4a7f1e-7329-4a97-ab79-79f3955a2880 | fullstackd-1561887686 | ACTIVE | lan-flat-cloudinstances2b=172.16.6.109, 172.16.6.110              |
| 21a45c21-476f-4278-b3a2-ca0d134e0fed | fullstackd-1561861047 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.44, 172.16.5.45                |
+--------------------------------------+-----------------------+--------+-------------------------------------------------------------------+

All VMs were created between 2019-06-30 and 209-07-02. Apparently they are legit VM leaks.

I couldn't find error logs for every VM in that list, only for this one:

Jul 02 06:42:02 cloudcontrol1003 nova-fullstack[40607]: 2019-07-02 06:42:02,023 INFO Creating fullstackd-1562049720
Jul 02 06:42:54 cloudcontrol1003 nova-fullstack[40607]: 2019-07-02 06:42:54,058 INFO servers.cloudcontrol1003.nova.verify.creation => 52.04 1562049774
Jul 02 06:42:54 cloudcontrol1003 nova-fullstack[40607]: 2019-07-02 06:42:54,058 INFO Resolving fullstackd-1562049720.admin-monitoring.eqiad.wmnet from ['208.80.154.143', '208.80.154.24']
Jul 02 06:42:54 cloudcontrol1003 nova-fullstack[40607]: 2019-07-02 06:42:54,093 INFO servers.cloudcontrol1003.nova.verify.dns => 0.03 1562049774
Jul 02 06:42:54 cloudcontrol1003 nova-fullstack[40607]: 2019-07-02 06:42:54,093 INFO SSH to 172.16.7.176
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]: 2019-07-02 06:57:57,211 ERROR fullstackd-1562049720 failed, leaking
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]: Traceback (most recent call last):
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]:   File "/usr/local/sbin/nova-fullstack", line 551, in main
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]:     args.ssh_timeout)
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]:   File "/usr/local/sbin/nova-fullstack", line 161, in verify_ssh
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]:     raise Exception("SSH for {} timed out".format(address))
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]: Exception: SSH for 172.16.7.176 timed out
Jul 02 06:57:57 cloudcontrol1003 nova-fullstack[40607]: 2019-07-02 06:57:57,212 INFO servers.cloudcontrol1003.nova.verify.success => 0 1562050677

Which suggests this one is the only VM actually failing. Perhaps nova-fullstack is just busy and 6 is a low number?

It recovered itself:

aborrero@cloudcontrol1003:~ $ sudo wmcs-openstack server list --project admin-monitoring
+--------------------------------------+-----------------------+--------+-------------------------------------------------------------------+
| ID                                   | Name                  | Status | Networks                                                          |
+--------------------------------------+-----------------------+--------+-------------------------------------------------------------------+
| 95884e97-21d9-4fce-9344-6d87a7bf4ca8 | fullstackd-1562052939 | ACTIVE | lan-flat-cloudinstances2b=172.16.7.183                            |
| 9551a639-6d8c-4619-8c16-1f652457478f | fullstackd-1562049720 | ACTIVE | lan-flat-cloudinstances2b=172.16.7.176, 172.16.7.177, 172.16.7.18 |
+--------------------------------------+-----------------------+--------+-------------------------------------------------------------------+

(again fullstackd-1562049720 seems like a legit issue, but that is not enough to page, isn't it?)

Timeline of the icinga alerts:

  • 07:03 UTC: PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 6 instances in the admin-monitoring project
  • 07:30 UTC: PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 2 instances in the admin-monitoring project

Event Timeline

aborrero created this task.Jul 2 2019, 7:40 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 2 2019, 7:40 AM

I got the page and was also working on this event, cleaning VMs that I found to be active and online. T227057 https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin-monitoring#nova-fullstack

Ok, so I wasn't aware you were working on this. I saw the failing VMs vanishing and thought it was nova-fullstack deleting them. Now I see you deleted them by hand. So I'm not sure what was going on in the first place.

I didn't find anything in the OpenStack or daemon logs on cloudcontrol1003 or a few cloudvirts I spot checked with fullstack instances. I'll check more in the morning.

Full list with cloudvirt location:

| ID                                   | Name                  | Status | Task State | Power State | Networks                                                          | Availability Zone | Host          | Properties |
+--------------------------------------+-----------------------+--------+------------+-------------+-------------------------------------------------------------------+-------------------+---------------+------------+
| e83bf2eb-0256-4ac7-8b36-e8d3c6e3c519 | fullstackd-1562051634 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.181                            | nova              | cloudvirt1004 |            |
| 9551a639-6d8c-4619-8c16-1f652457478f | fullstackd-1562049720 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.176, 172.16.7.177, 172.16.7.18 | nova              | cloudvirt1007 |            |
| 3653dacc-568a-4b9c-a126-3db668ea4972 | fullstackd-1562014758 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.7.109, 172.16.7.11               | nova              | cloudvirt1002 |            |
| 8f4a7f1e-7329-4a97-ab79-79f3955a2880 | fullstackd-1561887686 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.6.109, 172.16.6.110              | nova              | cloudvirt1005 |            |
| 21a45c21-476f-4278-b3a2-ca0d134e0fed | fullstackd-1561861047 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.5.44, 172.16.5.45                | nova              | cloudvirt1006 |            |
| 32e92dfc-290f-4c05-8df8-b43ec48033bb | fullstackd-1561740112 | ACTIVE | None       | Running     | lan-flat-cloudinstances2b=172.16.4.250, 172.16.4.251              | nova              | cloudvirt1002 |            |

I left fullstackd-1562049720 running, it has a strange list of IP addresses. I'd like to figure out if that's intended or not.

With fresh eyes I found that the VMs are intermittently failing with

The request comes in to create the VM

Jul  2 06:42:03 cloudvirt1004 nova-compute[57236]: 2019-07-02 06:42:03.900 57236 INFO nova.virt.libvirt.driver [req-059bec0d-f587-4d57-beee-be24f6359bcd osstackcanary admin-monitoring - - -] [instance: 9551a639-6d8c-4619-8c16-1f652457478f] Creating image

Puppet updates and restarts libvirt before the create completes

Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) --- /etc/libvirt/libvirtd.conf#0112019-07-02 06:12:06.367890757 +0000
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) +++ /tmp/puppet-file20190702-44580-vqqq4w#0112019-07-02 06:42:08.415045307 +0000
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) @@ -26,14 +26,14 @@
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirtan1001.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1002.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1028.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) -                       "CN=cloudvirt1012.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1027.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirtan1004.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) +                       "CN=cloudvirt1012.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1025.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1009.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) -                       "CN=cloudvirt1001.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1023.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1003.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) +                       "CN=cloudvirt1001.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1014.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1016.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content)                         "CN=cloudvirt1013.eqiad.wmnet",
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: Computing checksum on file /etc/libvirt/libvirtd.conf
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]) Filebucketed /etc/libvirt/libvirtd.conf to puppet with sum 481929f97e52b1a2f25340bf6aa6d2a6
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]/content) content changed '{md5}481929f97e52b1a2f25340bf6aa6d2a6' to '{md5}a7c1005d4c563b97ec12e1568ed43353'
Jul  2 06:42:08 cloudvirt1004 puppet-agent[44580]: (/Stage[main]/Openstack::Nova::Compute::Service/File[/etc/libvirt/libvirtd.conf]) Scheduling refresh of Service[libvirtd]

The create fails

2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [req-059bec0d-f587-4d57-beee-be24f6359bcd osstackcanary admin-monitoring - - -] [instance: 9551a639-6d8c-4619-8c16-1f652457478f] Instance failed to spawn
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f] Traceback (most recent call last):
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2218, in _build_resources
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     yield resources
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2064, in _build_and_run_instance
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     block_device_info=block_device_info)
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2780, in spawn
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     block_device_info=block_device_info)
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4950, in _create_domain_and_network
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     guest.resume()
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 154, in resume
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     self._domain.resume()
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     result = proxy_call(self._autowrap, f, *args, **kwargs)
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     rv = execute(f, *args, **kwargs)
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     six.reraise(c, e, tb)
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     rv = meth(*args, **kwargs)
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]   File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1905, in resume
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f]     if ret == -1: raise libvirtError ('virDomainResume() failed', dom=self)
2019-07-02 06:42:16.770 57236 ERROR nova.compute.manager [instance: 9551a639-6d8c-4619-8c16-1f652457478f] libvirtError: internal error: client socket is closed

Libvirt destroys the VM

2019-07-02T06:42:16.788278Z qemu-system-x86_64: terminating on signal 15 from pid 45129 (/usr/sbin/libvirtd)
2019-07-02 06:42:16.989+0000: shutting down, reason=destroyed

Note that after this failure nova rescheduled the VM on a cloudvirt1007 and same thing happens. It tries a 3rd time on cloudvirt1007 and is successful. Each time nova-compute attempts to start the VM it checks out a new IP address.

Looks like the list used for all_cloudvirts is not maintaining any order. It's updating libvirtd.conf every puppet run:

sudo awk '/Filebucketed .*libvirtd.conf/{print$1" "$2" "$3}' /var/log/syslog
Jul 2 06:42:24
Jul 2 07:12:23
Jul 2 07:41:39
Jul 2 08:12:05
Jul 2 08:42:35
Jul 2 09:11:52
Jul 2 09:41:53
Jul 2 10:11:52
Jul 2 10:42:13
Jul 2 11:12:35
Jul 2 11:41:50
Jul 2 12:12:37
Jul 2 12:42:08
Jul 2 13:12:09
Jul 2 13:41:40
Jul 2 14:12:24
Paladox added a subscriber: Paladox.Jul 2 2019, 2:26 PM

Change 520261 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] livirt: sort tls_allowed_dn_list

https://gerrit.wikimedia.org/r/520261

Change 520262 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova-fullstack monitoring: allow for more leaks before we alert

https://gerrit.wikimedia.org/r/520262

Change 520261 merged by Andrew Bogott:
[operations/puppet@production] livirt: sort tls_allowed_dn_list

https://gerrit.wikimedia.org/r/520261

Change 520262 merged by Andrew Bogott:
[operations/puppet@production] nova-fullstack monitoring: allow for more leaks before we alert

https://gerrit.wikimedia.org/r/520262

JHedden closed this task as Resolved.Jul 9 2019, 1:27 PM