Page MenuHomePhabricator

Monitor nova-scheduler log for lost contact with compute nodes
Open, MediumPublic

Description

I just now restarted nova-compute on virt1002... it seems to have just quietly stopped working :(

Related Objects

Event Timeline

Andrew created this task.Feb 25 2015, 8:51 PM
Andrew claimed this task.
Andrew raised the priority of this task from to Medium.
Andrew updated the task description. (Show Details)
Andrew added a subscriber: Andrew.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2015, 8:51 PM

Change 198249 had a related patch set uploaded (by Andrew Bogott):
Icinga monitoring for nova-compute process.

https://gerrit.wikimedia.org/r/198249

Change 198249 merged by Andrew Bogott:
Icinga monitoring for nova-compute process.

https://gerrit.wikimedia.org/r/198249

In addition to process monitoring, Something should probably be running 'nova service list' on virt1000 and checking the status there -- in theory that's upgraded via queue messages so will verify that the services are actually responding rather than just locked up and occupying process space.

Today, the nova-api process was running but api calls were timing out. So that's another thing to watch for.

Once we have a read-only nova account, the monitoring can do proper queries.

Andrew moved this task from To do to Code Review/Blocked on the labs-sprint-117 board.
Andrew removed Andrew as the assignee of this task.Nov 12 2015, 11:44 PM
Luke081515 moved this task from Triage to Backlog on the Cloud-Services board.Mar 25 2016, 4:11 PM

Basic API HTTP code check added in T42022

Paladox added a subscriber: Paladox.Apr 8 2017, 6:37 PM
Dzahn added a subscriber: Dzahn.Sep 18 2017, 4:00 PM

Hi @Andrew is this a duplicate of T42022? It seems you already added some monitoring in the past, but this is still open. Are your comments above still currrent and what is blocking this? (read-only nova account, check for API calls timing out, run nova service list)?

This is modestly different, but needs to be retitled. T42022 is about public http APIs, this is about internal services which can break despite the public APIs functioning.

Andrew renamed this task from Monitor nova services to Monitor nova-scheduler log for lost contact with compute nodes.Sep 20 2017, 9:23 PM