Page MenuHomePhabricator

Instances locking up randomly
Closed, ResolvedPublic

Description

Has been happening at least once a day to the kubernetes workers. They respond to ping but not to ssh and processes on them are unreachable. tools-worker-08 is in this state right now.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Cloud-Services.
yuvipanda added a subscriber: yuvipanda.
chasemp set Security to None.

May or may not be significant, but tools-proxy-01 has (partially) live userland at least insofar as nginx continues accepting connection and proxying.

ssh, on the other hand, times out.

2015-12-16 16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup
2015-12-16 23:14 andrewbogott: rebooting tools-exec-1407, unresponsive
2015-12-18 15:16 andrewbogott: rebooting locked up host tools-exec-1409

So, tools-worker-07 is currently "stuck". Kernel and userland both seem to be responsive, but logins over SSH timeout. I logged in over VNC and Yuvi ran "passwd -d root" over salt (which still works), so I could login over the VGA console and investigate.

I investigated it extensively. All of the symptoms are explained from NFS being unresponsive and processes trying to access NFS to be stuck in the D state (since we don't mount with "soft" for some reason).

This in turn appears to be caused by a weird NFSv4 failure: the client seems to be sending a RENEW op and the server responds with NFS4ERR_EXPIRED; the client then seems to be sending a RENEW again and this goes on and on in a loop, with no other operations visible in the packet capture (

).

The response from the server appears to be a valid one, at least according to my naive interpretation of RFC 3530 (paragraphs 14.2.28 and 8.6.3).

Moreover, this seems to be happening only with instances running jessie with the 4.2 kernel (but not precise/3.2, trusty/3.13 nor jessie/3.16). My theory so far is that it's a Linux kernel bug in the NFSv4 client, one that unfortunately I have been able to pinpoint despite spending quite some time reading kernel git logs and commits.

As for next steps, it might be worth it to test with older (and newer!) kernels. Yuvi is going to try with 3.19.

Ok, I'm going to move tools-worker-01 to -05 to 3.19 and -06 to -09 on 4.2

@valhallasw raised the good point that root SSH logins should be working even with NFS failed/stuck. I ran strace and found out that ssh-key-ldap-lookup was trying to access /home/ssh-key-ldap-lookup/.local/lib/python2.7/site-packages. Yuvi pushed d373c7ca80664fe2a1e02006d038cbcca2217868 (cf. T104327) and this is now fixed, so we should be able to ssh as root normally on instances locked up because of this bug.

yuvipanda claimed this task.

I'm going to close this, since we are pretty sure this particular type of hang was caused by NFS