host_aliases was created as an emergency fix when the switch to the new DNS structure (*.eqiad.wmflabs => *.tools.eqiad.wmflabs) failed. Its effects can be confusing. In the long term, we should move to use the "true" host names instead and remove host_aliases.
As host names are probably cached by SGE, to avoid catastrophic failures the modus operandi should probably be:
- For one of submit host/execution host/etc., remove the old host names from their respective functions, i. e. disable an execution host and drain all jobs running on it first.
- Remove the alias from host_aliases.
- Restart the grid master service.
- Add the new host name to their respective functions, i. e. add an execution host as usual and enable it.
- Increase "one" in 1. to a comfortable number and repeat the process.