Page MenuHomePhabricator

kvm on ganeti instances getting stuck
Closed, ResolvedPublic



We have a number of VMs that showed a weird behavior at times. Symptoms were:

* Neon saying services on the host are down, but not the host
* Indeed the host would ping and most networking would still work, but no SSH
* Ganglia would show huge IO wait like in
* Connecting to the console via sudo gnt-instance console <vm_hostname> and hitting a single enter would fix it
* Sometimes just connecting via ssh would fix it

Probable cause

KVM/QEMU bug. Very rare and seems to be VM load dependent. No way to reproduce it has been yet found. It would usually trigger in newly created VMs. VMs with any kind of load would not show this symptom hence the relatively low priority. Idling VMs were the most probable to display the problem. Talking to other people who had experienced the bug (I know exactly 2) seemed to yield a workaround. The bug has NOT being filed upstream to my knowledge, mostly due to the difficulty of reproducing it.

An effort for a workaround

Setting disk_aio to native to a couple of VMs yielded promising results. The issue had not been reproduced on them. There were no other side-effects either. Unfortunately after migrating all the hosts to use the setting, the issue is still present.

Other stuff

There is one old bug dated back to 2012 that seemed bad but does NOT apply to our case. It was sparse files being used over ext4 or xfs volumes. Fixed since then.

Event Timeline

furud.codfw.wmnet also experienced high load average, a reboot fixed it (T134098)

Another occurrance was serpens yesterday morning (which is also listed as having the workaround applied in the etherpad, so that doesn't seem to be affective)

Yes, familiar with this issue. Next time you see it and Icinga reports a bunch of services as down, and they have in common they are all on ganeti VMs and one of them is alsafi, just do "ssh alsafi" and nothing else, and watch magic happens.. Icinga recoveries will appear without you doing anything else. It's like ganeti networking goes to hibernate or something and when there is activity things wake up again.

Also, try grepping SAL for "alsafi"'s a fun history.

0<li> 09:10 moritzm: powercycled alsafi (stuck in KVM)</li>
<li> 02:21 mutante: ssh alsafi</li>
23:56 mutante: ssh alsafi</li>
<li> 17:57 mutante: ssh alsafi fixes ganeti VM timeouts once again</li>
<li> 18:39 mutante: alsafi deleted service template file remnant </li>
<li> 09:08 paravoid: hard-resetting alsafi, I/O-stuck (qemu bug?)</li>
<li> 19:00 mutante: alsafi back up with 4.4 kernel</li>
<li> 18:57 mutante: alsafi - url-downloader codfw - reboot</li>
<li> 01:46 mutante: alsafi was hanging and in the second in connected to the Ganeti console it was back like nothing happened</li>
<ul><li> 10:45 _joe_: rebooting alsafi</li>
<li> 08:08 _joe_: gnt-console reboot of alsafi</li>
<li> 08:19 akosiaris: gnt-instance reboot</li>
<li> 15:56 paravoid: "power"cycling alsafi</li>
<li> 21:45 mutante: alsafi - was reported down in icinga , is ganeti VM - fixed by just logging in as if it went to hibernate </li>
<li> 07:34 _joe_: rebooting alsafi, unresponsive to ssh</li>
<li> 13:51 paravoid: powercycling alsafi</li>

^ for me it almost always works with just "ssh alsafi" without even resetting or rebooting anything

Unfortunately disk_aio=native did not solve the problem. It is not however possible yet to reproduce it reliably.

Perhaps a qemu upgrade to a newer version (jessie-backports has 2.5, we are on 2.1) could help. I 'll empty a host in codfw and move alsafi over there.

pollux had this issue 2 days ago, sent an email to ops, but it clearly seems related (IO load).

It happened again a few minutes ago on alsafi, fixed on doing ssh again.

I 've emptied ganeti2006 and drained it. It can not accept new VMs (either primary or secondary). I 've left alsafi on it and upgrade to qemu to 2.5+dfsg-4~bpo8+1.

Now the waiting part begins. Let's see if the bug is still there.

I've restarted today planet2001 and kraz. The synthoms were different this time- they did not woke up on connection. SSH and console were down/overloaded. I restarted them and the services came to life, with an error, though:

Sun May 22 14:41:05 2016  - WARNING: Could not shutdown block device disk/0 on node ganeti2005.codfw.wmnet: drbd4: can't shutdown drbd device: resource4: State change failed: (-12) Device is held open by someone\nadditional info from kernel:\nfailed to demote\n

Despite all services being back green, the service doesn't seem reestablished: T135948 and the instructions are not up to date:

Actually, I added them too, more complete: IRCD#Services

alsafi is on the other hand up for 10 consecutive days without displaying any of the symptoms. I am thinking we 've probably have a decent fix finally. I 'll wait a couple more and if all goes well, schedule a cross cluster upgrade

Mentioned in SAL [2016-06-14T11:13:40Z] <akosiaris> T134242 install qemu-system-common, qemu-system-x86 1:2.5+dfsg-4~bpo8+1 from jessie-backports on ganeti200{1,2,3,4,5,6}

Mentioned in SAL [2016-06-14T11:15:32Z] <akosiaris> T134242 rebooting hassaleh.codfw.wmnet planet2001.codfw.wmnet pybal-test2001.codfw.wmnet pybal-test2002.codfw.wmnet pybal-test2003.codfw.wmnet for qemu-kvm upgrade

All of codfw VMs have been upgraded to qemu 2.5. I 'll wait a few more days for any problems to manifest and then do eqiad.

Mentioned in SAL [2016-07-07T10:07:27Z] <akosiaris> reboot etherpad1001.eqiad.wmnet, kernel upgrade and qemu upgrade, T134242

Mentioned in SAL [2016-07-07T10:13:47Z] <akosiaris> reboot bohrium T134242

Mentioned in SAL [2016-07-07T10:22:09Z] <akosiaris> reboot bromine T134242

Mentioned in SAL [2016-07-07T10:22:15Z] <akosiaris> reboot dubnium T134242

Mentioned in SAL [2016-07-07T10:25:55Z] <akosiaris> reboot mx1001, planet1001, rutherfordium, seaborgium, ununpentium T134242

Mentioned in SAL [2016-07-07T10:29:40Z] <akosiaris> reboot hassium.eqiad.wmnet krypton.eqiad.wmnet mendelevium.eqiad.wmnet T134242

All eqiad VMs have been upgraded to qemu 2.5 as well. I 'll leave this open just in case some bug manifests, but otherwise I consider it resolved

Enough days have passed, it seems there has been no issue. Resolving