Page MenuHomePhabricator

Labs instances failing with "internal error: No PCI buses available"
Closed, ResolvedPublic

Description

I've seen this twice now -- this is a tracking bug.

It seems to happen mostly on labvirt1001 -- an instance will spontaneously drop into an 'ERROR' state and attempts to revive it produce the No PCI buses error. Migrating to another host resolves the issue.

Details

Related Gerrit Patches:

Event Timeline

Andrew created this task.Jun 15 2016, 12:27 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 15 2016, 12:27 AM
chasemp triaged this task as Medium priority.Jun 21 2016, 1:33 PM
chasemp added a subscriber: chasemp.

seems to be intermittent and I'm not reproducing atm

Andrew updated the task description. (Show Details)Jun 21 2016, 5:30 PM
bd808 added a subscriber: bd808.Jun 28 2016, 5:40 PM

The phlogiston-1 instance has managed to reproduce the issue a second time after being moved from labvirt1001 to labvirt1009.

Change 297189 had a related patch set uploaded (by Yuvipanda):
labs: Take out all hosts other than labvirt1011 out of pool

https://gerrit.wikimedia.org/r/297189

I think all hosts are full now, except labvirt1011 - mostly from the fact that all reboots seem to be failing. I'm attempting to validate this by looking at things in ERROR states and seeing which hosts they're on. I'm trying to migrate tools-exec-1209 to it now, let's see how that goes

If this theory gets validated, I'm going to turn off scheduling on all hosts other than labvirt1011, and maybe turn off instance creation.

This comment was removed by Paladox.

Since the ERROR'd ones were from at least three different hosts and I don't exactly know what's going on, I'm going to take out all hosts other than 1011 out of the pool, with the assumption that the resource crunch is at least partially responsible for this and new instances going into non-labvirt1011 instances will cause issues for existing instances...

Change 297189 merged by Yuvipanda:
labs: Take out all hosts other than labvirt1011 out of pool

https://gerrit.wikimedia.org/r/297189

Change 297389 had a related patch set uploaded (by Andrew Bogott):
Update our custom libvirt driver for 2015.1.4-0ubuntu2

https://gerrit.wikimedia.org/r/297389

Change 297389 merged by Andrew Bogott:
Update our custom libvirt driver for 2015.1.4-0ubuntu2

https://gerrit.wikimedia.org/r/297389

Andrew closed this task as Resolved.Jul 5 2016, 2:03 PM

I upgraded libvirt packages on all of the labvirts, and updated our hacked driver to correspond. I suspect that this will resolve the issue, as labvirt1011 was already running the latest packages and wasn't encountering this problem.