This is more to describe what happened: After restarting `tools-webgrid-generic-1404`, `qmaster` died constantly with `/var/lib/gridengine/spool/qmaster/messages` saying:
09/24/2015 14:56:01|worker|tools-master|Efirstname.lastname@example.org reports running job (11474.1/master) in queue "email@example.com" that was not supposed to be there - killing
09/24/2015 14:56:09|worker|tools-master|E|writing job finish information: can't locate queue "firstname.lastname@example.org"
09/24/2015 14:56:09|worker|tools-master|W|job 1766173.1 failed on host <unknown host> before writing exit_status because: shepherd exited with exit status 19: before writing exit_status
09/24/2015 14:56:09|worker|tools-master|C|!!!!!!!!!! got NULL element for QU_rerun !!!!!!!!!!
So I added the host back to `host_aliases`, restarted `gridengine-master` and `gridengine-exec` on the host and everything seems to be fine so far.
I'll try again (restarting `gridengine-master` and `gridengine-exec` with `host_aliases` //not// containing the host) in a few hours to see if it the old jobs just needed to be purged from some list.
More importantly, I'm interested in how to avoid this :-). I had looked at `qhost -h $hostname` and it was empty, and so I never would have assumed that a reference to that host was anywhere. I rechecked that it didn't appear in any queue execution host lists, so the only reference indeed seems to have been the host itself.