Page MenuHomePhabricator

Test if grid engine master non-failure depends on the lengths of /etc/hosts lines
Closed, DeclinedPublic

Event Timeline

scfc created this task.May 28 2015, 2:44 PM
scfc raised the priority of this task from to Needs Triage.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.

I could consistently reproduce the failures by adding the lines back. Question is wether it was line length or total size length that's the problem.

scfc added a comment.May 28 2015, 3:26 PM

IIRC the aliases were in that instance's /etc/hosts previously, so the overall size should only have decreased, but (cf. source/libs/uti/sge_hostname.c):

{
   struct hostent re;
   char buffer[4096];

   /* No need to malloc he because it will end up pointing to re. */
   gethostbyname_r(name, &re, buffer, 4096, &he, &l_errno);

   /* Since re contains pointers into buffer, and both re and the buffer go
    * away when we exit this code block, we make a deep copy to return. */
   /* Yes, I do mean to check if he is NULL and then copy re!  No, he
    * doesn't need to be freed first. */
   if (he != NULL) {
      he = sge_copy_hostent (&re);
   }
}

From gethostbyname_r(3):

Glibc2 also has reentrant versions gethostent_r(), gethostbyaddr_r(), gethostbyname_r() and gethostbyname2_r(). The caller supplies a hostent structure ret which will be filled in on success, and a temporary work buffer buf of size buflen. […] In addition to the errors returned by the nonreentrant versions of these functions, if buf is too small, the functions will return ERANGE, and the call should be retried with a larger buffer. […]

So if gethostbyname_r() uses buffer to parse lines, lines with a length of more than 4096 - 1 characters would cause an error. I will test this later.

If so, one solution would be to fix SGE, the second would be to fix Puppet's host resource. I think the latter would be more useful.

My current fix is to get rid of the labsdb aliases on the master - they aren't being used from there at all.

I agree that fixing puppet's host might be just generally useful. I'm still confused about how this sometimes worked, though. A buffer overflow is a buffer overflow...

scfc added a comment.May 28 2015, 3:53 PM

It could be that if the host name is already cached by nscd (for example by debugging on the command line), gethostbyname_r() doesn't even try to parse /etc/hosts. I will test that as well.

valhallasw assigned this task to scfc.Jul 2 2015, 7:51 PM
valhallasw triaged this task as Low priority.
valhallasw set Security to None.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 2 2015, 7:51 PM
valhallasw moved this task from Triage to Backlog on the Toolforge board.Jul 2 2015, 7:54 PM
yuvipanda closed this task as Resolved.Jul 5 2016, 12:29 PM

We've gotten rid of most etc/hosts customizations, and this is a year old now, so am going to close this as resolved.

scfc changed the task status from Resolved to Declined.Jan 17 2017, 11:41 PM