Page MenuHomePhabricator

Monitor nova-scheduler log for lost contact with compute nodes
Open, MediumPublic

Description

I just now restarted nova-compute on virt1002... it seems to have just quietly stopped working :(

Related Objects

Event Timeline

Andrew claimed this task.
Andrew raised the priority of this task from to Medium.
Andrew updated the task description. (Show Details)
Andrew subscribed.

Change 198249 had a related patch set uploaded (by Andrew Bogott):
Icinga monitoring for nova-compute process.

https://gerrit.wikimedia.org/r/198249

Change 198249 merged by Andrew Bogott:
Icinga monitoring for nova-compute process.

https://gerrit.wikimedia.org/r/198249

In addition to process monitoring, Something should probably be running 'nova service list' on virt1000 and checking the status there -- in theory that's upgraded via queue messages so will verify that the services are actually responding rather than just locked up and occupying process space.

Today, the nova-api process was running but api calls were timing out. So that's another thing to watch for.

Once we have a read-only nova account, the monitoring can do proper queries.

Hi @Andrew is this a duplicate of T42022? It seems you already added some monitoring in the past, but this is still open. Are your comments above still currrent and what is blocking this? (read-only nova account, check for API calls timing out, run nova service list)?

This is modestly different, but needs to be retitled. T42022 is about public http APIs, this is about internal services which can break despite the public APIs functioning.

Andrew renamed this task from Monitor nova services to Monitor nova-scheduler log for lost contact with compute nodes.Sep 20 2017, 9:23 PM