This is more to describe what happened: After restarting tools-webgrid-generic-1404, qmaster died constantly with /var/lib/gridengine/spool/qmaster/messages saying:
09/24/2015 14:56:01|worker|tools-master|Efirstname.lastname@example.org reports running job (11474.1/master) in queue "email@example.com" that was not supposed to be there - killing 09/24/2015 14:56:09|worker|tools-master|E|writing job finish information: can't locate queue "firstname.lastname@example.org" 09/24/2015 14:56:09|worker|tools-master|W|job 1766173.1 failed on host <unknown host> before writing exit_status because: shepherd exited with exit status 19: before writing exit_status 09/24/2015 14:56:09|worker|tools-master|C|!!!!!!!!!! got NULL element for QU_rerun !!!!!!!!!!
So I added the host back to host_aliases, restarted gridengine-master and gridengine-exec on the host and everything seems to be fine so far.
I'll try again (restarting gridengine-master and gridengine-exec with host_aliases not containing the host) in a few hours to see if it the old jobs just needed to be purged from some list.
More importantly, I'm interested in how to avoid this :-). I had looked at qhost -h $hostname and it was empty, and so I never would have assumed that a reference to that host was anywhere. I rechecked that it didn't appear in any queue execution host lists, so the only reference indeed seems to have been the host itself.