Page MenuHomePhabricator

Openstack instability - main task
Closed, ResolvedPublic

Description

This task is to track all the investigation and sync on the discoveries and actions to track and debug the current
instability on the eqiad openstack setup.

Current issues:

  • Novafullstack VMs unable to setup:
    • VMs stuck trying to get metadata from the metadata servers Fixed
    • apt-get upgrade fails to downgrade (fixed on vendordata apt conf section, should not happen again, untested)
  • Unable to remove VMs

Errors/leads/suspects:

  • RabbitMQ timeouts
  • Haproxy instability (has been solved already)

Event Timeline

dcaro triaged this task as High priority.Jun 21 2022, 9:48 PM
dcaro created this task.
dcaro added a subscriber: Andrew.

When I first looked at this the nova-api logs were incredibly busy with requests (mostly getting flavor details). I added a rate-limiter to haproxy to address this, although that might be now causing additional issues.

Right now I am going to

  • remove the rate-limiter
  • restart each rabbit-mq node
  • restart all nova services

And then we'll see how things are doing.

Actually, what I'm going to do is:

  • remove the rate-limiter
  • reboot each cloudcontrol in turn. Might as well!

Taavi thinks we may be flooded by monitoring checks. I just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/807541/ which should reduce API calls considerably.

I seems this is stable for the last couple of week, will close.