I've been looking at the situation surrounding NFSv4 idmapping in Labs, because of the effects of T78076 and explanations that did not make sense to me.
It looks like the current situation is severly buggy and very messy due to multiple reasons:
- idmapd in the kernel is enabled by default in kernels < 3.3 and disabled in kernels >= 3.3 (cf. Linux upstream commit). This means that the behavior across precise & trusty (and jessie) is different, which in turns means that half of the Labs fleet does idmapping and the other does not(!)
- Moreover, there used to be an upstart script in our repository that explicitly disabled idmapping but this is not applied anymore. Confusingly enough, the upstart script was left over in our repository, although referenced from nowhere (I killed this). This seems to mostly be a pmtpa zone thing according to git log; however, a salt run across Labs revealed a single eqiad instance having this upstart script in place.
- Even if that upstart script was applied, it would be broken anyway, as that sysfs setting needs to be applied before any mounts happen; a better strategy would have been to echo options nfs nfs4_disable_idmapping=1 > /etc/modprobe.d/nfs-disable-idmap, which is simpler to code anyway.
- Even though the intention seems to be to have idmapping disabled, labstore1001 includes ldap::role::client::labs, which additionally breaks prod logins to the server, making it a one of a kind across our (prod) fleet. Moreover, it has the incredibly ugly User['apache'] definition. This seems consistent with solving idmapping issues, although the intention was probably the opposite.
This is obviously a very messy situation. I propose the following (and I'd like the Labs team, probably @coren, to implement it, if there's agreement):
- Disable idmapping everywhere. It's fully supported by RFCs nowadays. Not disabling it would mean that we'd have to instantiate all potential users across our puppet tree into labstore which would be... gross. I suggest doing this by instantiating a modprobe.d file (as above) under an os_version('ubuntu < trusty') guard; note that this will need a reboot of the affected instances for both the kernel option to be set and the mounts to be redone.
- Clean up all possible remnants of that upstart script.
- Remove all the Labs LDAP stanzas (and User/Group apache) from labstore1001. This is currently a liability as labstore1001 despite being a production host has a different security model than the rest of production.
- Enforce uid/gids in our User/Group stanzas across all of our puppet tree. This is a good idea for production anyway, so it's not a Labs-only requirement.