Investigate kernel issues on labvirt** hosts
Closed, ResolvedPublic

Description

A couple of outages have been caused by kernel issues (apparently). We should investigate and decide to move on to a different kernel or put some other safeguard in place.

Kernel logs from affected time are in ~yuvipanda/kernlog-20150519-outage on labvirt1006

yuvipanda updated the task description. (Show Details)
yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda added subscribers: yuvipanda, Andrew.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2015, 10:13 PM
yuvipanda updated the task description. (Show Details)May 19 2015, 10:15 PM
yuvipanda set Security to None.

My from reading these crashes are all related to the networking interface between the host and the virtual machines (vhost_net on the virtualisation server and virtio_net inside the virtual machine). Some googling did not reveal an obvious isolated fix (and it could also be an issue of vhost_net related to other bits of the network stack).

Disabling virtio is not an option due to a loss of performance and features, so I would suggest to move to a new kernel:

Ubuntu provides backports of the later releases for LTS (these are covered by the Canonical security support for as long as the underlying Ubuntu release is supported). Currently only a backport of the 3.16 kernel (from Ubuntu 14.10) is available. 3.19 from 15.04 will likely follow in a later Trusty point release.

I suggest we give 3.16 a try by installing linux-image-3.16.0.38-generic. Running a series of suspend/resume cycles should reveal pretty quickly whether the problem is fixed in 3.16, I suppose. If all works fine we can move the remaining labvirt* hosts to 3.16.

So I guess this would need us to test by:

  1. Upgrading kernel on one host and rebooting (and appropriate housekeeping for instances)
  2. Bring back hosts
  3. Suspend and resume a *lot* of instances
  4. See what happens.

Need to find an appropriate host for this...

Moritz pointed out that labvirt1005 doesn't have any instances running so we could do this there.

(labvirt1005 is empty because of T97521 - it hasn't been repooled yet)

yep, we can do this on labvirt1005 at any time. We can cold-migrate some test instances there to test the suspend/resume issue.

Well, ok, actually, let's make sure we can make it crash /before/ we upgrade the kernel. For science.

And, btw, there's no short-term plan to repool labvirt1005 -- we've always planned to keep a server empty as a backup, and since labvirt1005 was down during the migration it gets to be the lifeboat.

I feel like an idiot - I upgraded the kernel on labvirt100*6* - thankfully caught myself before doing a reboot. I'll let it be, and just upgrade 1005 as well.

Whoops - 1005 isn't coming back up. In shell with: `ALERT! /dev/disk/by-uuid/861a4750-9243-4da7-b566-8c3cebfd6114 does not exist. Dropping to a shell!`

Glad we caught that before we moved stuff to it >_>

For the record: I've seen this once before. During an earlier labvirt crash I tried updating to a non 3.13 kernel (.15 I think?) and that box couldn't see its filesystem either. This was in the middle of an outage so I just rolled back and didn't investigate though.

See T100030; you should install the meta package linux-image-generic-lts-utopic, which pulls in the latest linux-image-3.16.0-foo-generic and linux-image-extra-3.16.0-foo-generic.

In the mean time there's now also a trusty backport of the 3.19 kernel in Ubuntu 15.04; this one can be installed with linux-image-generic-lts-vivid.

Installed linux-image-generic-lts-vivid, seems to have brought in the extra package too.

A bunch of

update-initramfs: Generating /boot/initrd.img-3.19.0-20-generic
W: Possible missing firmware /lib/firmware/bnx2x/bnx2x-e2-7.10.51.0.fw for module bnx2x
W: Possible missing firmware /lib/firmware/bnx2x/bnx2x-e1h-7.10.51.0.fw for module bnx2x
W: Possible missing firmware /lib/firmware/bnx2x/bnx2x-e1-7.10.51.0.fw for module bnx2

errors though.

These firmware files are only present in the linux-firmware package starting with utopic and it seems Ubuntu doesn't provide a backport of linux-firmware for their backported HWE kernels. Does any of the labvirt* systems have a bnx2x NIC? It's probably possible to simply install the linux-firmware package from Ubuntu vivid.

Back up with Linux labvirt1005 3.19.0-20-generic #20~14.04.1-Ubuntu SMP Sat May 30 00:15:44 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux w00t!

What next for this? We put some hosts on it and run a suspend resume loop?

I think so. That should show fairly reliable whether the problem still exists (the previous crashes were caused by the restarts after the VENOM security vulnerability in qemu)

I just ran a simple test on labvirt1005 (with 3.13), and was able to make it lock up on the first try. So now I'm ready to try a different kernel.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 15 2015, 4:31 PM

Install linux-generic-lts-vivid package and reboot?

Starting with a 3.13 system...

  1. apt-get install linux-generic-lts-vivid
  2. apt-get install linux-image-3.19 linux-headers-3.19
  3. apt-get dist-upgrade
  4. puppet agent -tv
  5. reboot

labvirt1005 with 3.19.0-22-lowlatency has survived quite a few cycles of suspend/resume. So I'm convinced that it does not exhibit that particular bug, at least.

Oh, except on 3.19.0, resuming an instance doesn't work. It says it's resuming but actually never works again.

well, hm, or /something/ about it breaks. Still investigating.

So, here's what I'm seeing:

  • 3.19 doesn't crash with suspend/resume. That's good!
  • Suspend/resume doesn't work reliably... instances seem to lose some amount of network access after resuming. (existing sessions work, dns works, but I can't really ping anywhere or start a new ssh session.)

That latter seems like it's probably an openstack issue and not a kernel issue -- probably good to upgrade nova before we update the kernel on the compute nodes in case it helps.

Let's schedule this for one of the live labvirts next week.

use labvirt1009, has only 3 tools instances and they all can be failed over or sustain downtime.

@Andrew can you send out a scheduling email? If labvirt1009 is ok with you I can provide exact toollabs failover mechanisms.

I think I've found a new issue with the .19 kernel, so investigating further today.

Update:

3.19 kernels don't crash when I suspend/resume, but the VMs don't come up properly; their clocks are seriously broken such that a simple 'sleep' call hangs forever.

But... I may have found a sweet spot.

uname -a

Linux labvirt1005 3.16.0-45-generic #60~14.04.1-Ubuntu SMP Fri Jul 24 21:16:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

With this kernel, I can suspend and resume to my heart's content, and the resumed VMs actually work.

So, I propose we use 3.16 on the virt nodes, upgraded via:

apt-get install linux-image-generic-lts-utopic

I also need to file a bunch of bugs :(

Andrew added a comment.Aug 1 2015, 6:25 PM

Reboot of labvirt1009 is now scheduled and announced for Wednesday.

Andrew added a comment.Aug 5 2015, 5:08 PM

labvirt1009 is now running 3.16.0-45-generic.

A few tentative suspend/resumes suggest that all is well. If labvirt1009 is still happy at the end of the week I will schedule more updates.

coren moved this task from To Do to Doing on the Labs-Sprint-109 board.

labvirt1001, 1002, 1009 done.

Andrew moved this task from To do to Code Review/Blocked on the labs-sprint-110 board.
Andrew closed this task as Resolved.Aug 24 2015, 2:53 PM
Andrew claimed this task.

All labvirt hosts are now running 3.16 kernels, and puppet now actively excludes the known-buggy kernel versions.

Andrew moved this task from Doing to Done on the labs-sprint-110 board.Aug 24 2015, 2:53 PM