Page MenuHomePhabricator

nova-fullstack crashed from a keystone timeout
Closed, ResolvedPublic

Description

A seemingly-transient keystone issue caused the nova-fullstack service to die off entirely. Probably this should be caught as an exception unless it happens repeatedly.

Jun 26 15:08:37 cloudcontrol1003 puppet-agent[36948]: (/Stage[main]/Openstack::Nova::Fullstack::Service/Systemd::Service[nova-fullstack]/Service[nova-fullstack]/ensure) ensure changed 'stopped' to 'running'
Jun 26 15:08:37 cloudcontrol1003 puppet-agent[36948]: (/Stage[main]/Openstack::Nova::Fullstack::Service/Systemd::Service[nova-fullstack]/Service[nova-fullstack]) Unscheduling refresh on Service[nova-fullstack]
Jun 26 15:08:37 cloudcontrol1003 neutron-server[30160]: 2019-06-26 15:08:37.076 30574 INFO neutron.wsgi [req-a501e719-50c5-475e-b7b4-e79fe5e1fb48 novaadmin admin - - -] 10.64.20.28 - - [26/Jun/2019 15:08:37] "GET /v2.0/subnets.json?id=a69bdfad-d7d2-4cfa-8231-3d6d3e0074c9 HTTP/1.1" 200 774 0.264693
Jun 26 15:08:37 cloudcontrol1003 neutron-server[30160]: 2019-06-26 15:08:37.107 30587 INFO neutron.db.agents_db [req-53653daf-9b3e-4509-a2e6-4eb887e1202b - - - - -] Heartbeat received from L3 agent agent on host cloudnet1004, uuid 059d28ed-f858-486b-9fdb-215baa32d3c0 after 0:00:06.107591
Jun 26 15:08:37 cloudcontrol1003 neutron-server[30160]: 2019-06-26 15:08:37.122 30574 INFO neutron.wsgi [req-0d89f06a-bb4a-4ea3-8510-6b9a2ab04e25 novaadmin admin - - -] 10.64.20.28 - - [26/Jun/2019 15:08:37] "GET /v2.0/ports.json?network_id=7425e328-560c-4f00-8e99-706f3fb90bb4&device_owner=network%3Adhcp HTTP/1.1" 200 2682 0.035319
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]: 2019-06-26 15:08:37,691 INFO Starting new HTTP connection (1): cloudcontrol1003.wikimedia.org
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]: 2019-06-26 15:08:37,901 INFO Starting new HTTP connection (1): cloudcontrol1003.wikimedia.org
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]: Traceback (most recent call last):
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/local/sbin/nova-fullstack", line 606, in <module>
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     main()
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/local/sbin/nova-fullstack", line 495, in main
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     exist = nova_conn.servers.list()
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/novaclient/v2/servers.py", line 749, in list
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     "servers")
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/novaclient/base.py", line 242, in _list
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     resp, body = self.api.client.get(url)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/keystoneauth1/adapter.py", line 173, in get
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     return self.request(url, 'GET', **kwargs)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/novaclient/client.py", line 89, in request
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     **kwargs)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/keystoneauth1/adapter.py", line 331, in request
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/keystoneauth1/adapter.py", line 98, in request
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     return self.session.request(url, method, **kwargs)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/positional/__init__.py", line 94, in inner
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     return func(*args, **kwargs)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/keystoneclient/session.py", line 405, in request
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     resp = send(**kwargs)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:   File "/usr/lib/python2.7/dist-packages/keystoneclient/session.py", line 449, in _send_request
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]:     raise exceptions.ConnectionRefused(msg)
Jun 26 15:08:37 cloudcontrol1003 nova-fullstack[38999]: keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://cloudcontrol1003.wikimedia.org:8774/v2.1/servers/detail
Jun 26 15:08:37 cloudcontrol1003 systemd[1]: nova-fullstack.service: Main process exited, code=exited, status=1/FAILURE
Jun 26 15:08:37 cloudcontrol1003 systemd[1]: nova-fullstack.service: Unit entered failed state.
Jun 26 15:08:37 cloudcontrol1003 systemd[1]: nova-fullstack.service: Failed with result 'exit-code'.

Event Timeline

This is related to the task T226632. Nova-api was stopped and started for the wmf_scheduler_hosts_pool config update.

Looks like the transient error auto-resolved after the deployment completed. Seems like the appropriate action may be to suppress the watchdog during service restart/service upgrade. thoughts? Should this issue be closed in favor of a task addressing the watchdog? also to look at -- somewhere between 60 and 120 seconds seems like an eternity for a soft service restart of a critical service.

17:10 <#wikimedia-operations> icinga-wmPROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
17:11 <#wikimedia-operations> icinga-wmPROBLEM - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
17:12 <#wikimedia-operations> icinga-wmRECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
17:12 <#wikimedia-operations> icinga-wmRECOVERY - Check systemd state on cloudcontrol1003 is OK: OK - running: The system is fully operational

Change 519478 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova-fullstack: add a watch on the number of leaked VMs.

https://gerrit.wikimedia.org/r/519478

Change 519478 merged by Andrew Bogott:
[operations/puppet@production] nova-fullstack: add a watch on the number of leaked VMs.

https://gerrit.wikimedia.org/r/519478

Change 519513 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova-fullstack monitoring: fix a misnamed file

https://gerrit.wikimedia.org/r/519513

Change 519513 merged by Andrew Bogott:
[operations/puppet@production] nova-fullstack monitoring: fix a misnamed file

https://gerrit.wikimedia.org/r/519513

Change 519531 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] nova-fullstack: fix name of nrpe check

https://gerrit.wikimedia.org/r/519531

Change 519531 merged by Andrew Bogott:
[operations/puppet@production] nova-fullstack: fix name of nrpe check

https://gerrit.wikimedia.org/r/519531

Change 519653 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] fullstack monitoring: adjust behavior of the leak counter

https://gerrit.wikimedia.org/r/519653

Change 519653 merged by Andrew Bogott:
[operations/puppet@production] fullstack monitoring: adjust behavior of the leak counter

https://gerrit.wikimedia.org/r/519653

Bstorm triaged this task as Medium priority.Feb 11 2020, 4:30 PM
Andrew claimed this task.

I don't think this is happening anymore; closing in favor of T280514