Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | valhallasw | T100554 Grid engine masters down | |||
Declined | scfc | T100660 Test if grid engine master non-failure depends on the lengths of /etc/hosts lines |
Event Timeline
I could consistently reproduce the failures by adding the lines back. Question is wether it was line length or total size length that's the problem.
IIRC the aliases were in that instance's /etc/hosts previously, so the overall size should only have decreased, but (cf. source/libs/uti/sge_hostname.c):
{ struct hostent re; char buffer[4096]; /* No need to malloc he because it will end up pointing to re. */ gethostbyname_r(name, &re, buffer, 4096, &he, &l_errno); /* Since re contains pointers into buffer, and both re and the buffer go * away when we exit this code block, we make a deep copy to return. */ /* Yes, I do mean to check if he is NULL and then copy re! No, he * doesn't need to be freed first. */ if (he != NULL) { he = sge_copy_hostent (&re); } }
From gethostbyname_r(3):
Glibc2 also has reentrant versions gethostent_r(), gethostbyaddr_r(), gethostbyname_r() and gethostbyname2_r(). The caller supplies a hostent structure ret which will be filled in on success, and a temporary work buffer buf of size buflen. […] In addition to the errors returned by the nonreentrant versions of these functions, if buf is too small, the functions will return ERANGE, and the call should be retried with a larger buffer. […]
So if gethostbyname_r() uses buffer to parse lines, lines with a length of more than 4096 - 1 characters would cause an error. I will test this later.
If so, one solution would be to fix SGE, the second would be to fix Puppet's host resource. I think the latter would be more useful.
My current fix is to get rid of the labsdb aliases on the master - they aren't being used from there at all.
I agree that fixing puppet's host might be just generally useful. I'm still confused about how this sometimes worked, though. A buffer overflow is a buffer overflow...
It could be that if the host name is already cached by nscd (for example by debugging on the command line), gethostbyname_r() doesn't even try to parse /etc/hosts. I will test that as well.
We've gotten rid of most etc/hosts customizations, and this is a year old now, so am going to close this as resolved.