Instances locking up randomly
Closed, ResolvedPublic

Description

Has been happening at least once a day to the kubernetes workers. They respond to ping but not to ssh and processes on them are unreachable. tools-worker-08 is in this state right now.

yuvipanda updated the task description. (Show Details)
yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda added a project: Cloud-Services.
yuvipanda added a subscriber: yuvipanda.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 20 2015, 9:12 AM
chasemp triaged this task as High priority.Dec 21 2015, 6:26 PM
chasemp set Security to None.
coren added a subscriber: coren.Dec 21 2015, 6:28 PM

May or may not be significant, but tools-proxy-01 has (partially) live userland at least insofar as nginx continues accepting connection and proxying.

ssh, on the other hand, times out.

Andrew added a subscriber: Andrew.Dec 21 2015, 6:30 PM

2015-12-16 16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup
2015-12-16 23:14 andrewbogott: rebooting tools-exec-1407, unresponsive
2015-12-18 15:16 andrewbogott: rebooting locked up host tools-exec-1409

faidon added a subscriber: faidon.Dec 23 2015, 10:35 AM

So, tools-worker-07 is currently "stuck". Kernel and userland both seem to be responsive, but logins over SSH timeout. I logged in over VNC and Yuvi ran "passwd -d root" over salt (which still works), so I could login over the VGA console and investigate.

I investigated it extensively. All of the symptoms are explained from NFS being unresponsive and processes trying to access NFS to be stuck in the D state (since we don't mount with "soft" for some reason).

This in turn appears to be caused by a weird NFSv4 failure: the client seems to be sending a RENEW op and the server responds with NFS4ERR_EXPIRED; the client then seems to be sending a RENEW again and this goes on and on in a loop, with no other operations visible in the packet capture (

).

The response from the server appears to be a valid one, at least according to my naive interpretation of RFC 3530 (paragraphs 14.2.28 and 8.6.3).

Moreover, this seems to be happening only with instances running jessie with the 4.2 kernel (but not precise/3.2, trusty/3.13 nor jessie/3.16). My theory so far is that it's a Linux kernel bug in the NFSv4 client, one that unfortunately I have been able to pinpoint despite spending quite some time reading kernel git logs and commits.

As for next steps, it might be worth it to test with older (and newer!) kernels. Yuvi is going to try with 3.19.

Ok, I'm going to move tools-worker-01 to -05 to 3.19 and -06 to -09 on 4.2

@valhallasw raised the good point that root SSH logins should be working even with NFS failed/stuck. I ran strace and found out that ssh-key-ldap-lookup was trying to access /home/ssh-key-ldap-lookup/.local/lib/python2.7/site-packages. Yuvi pushed d373c7ca80664fe2a1e02006d038cbcca2217868 (cf. T104327) and this is now fixed, so we should be able to ssh as root normally on instances locked up because of this bug.

yuvipanda closed this task as Resolved.Jul 29 2016, 11:57 PM
yuvipanda claimed this task.

I'm going to close this, since we are pretty sure this particular type of hang was caused by NFS