A few things I could think of:
- disable as exec host: qmod -d '*@node_name'
- restart continuous jobs (qhost -j -h node_name|sed -e 's/^\s*//' | grep 'continuous' | cut -d ' ' -f 1 | xargs qmod -rj. For webgrid nodes you can just xargs to qdel instead - webservicemonitor will start them back up from service manifests shortly.
- wait for other jobs to drain
- unregister host as SGE exec host qconf -de $HOSTNAME
- unregiste rhost from host group: qcond -mhgrp @default or qconf -mhgrp @webgrid
- mark as planned down in shinken: http://shinken.wmflabs.org/host/$HOSTNAME note: not all hosts are in there?!
- check for running non-SGE processes: ps hax -o user | sort | uniq -c | sort -n
- delete host: https://wikitech.wikimedia.org/wiki/Special:NovaInstance
- clear graphite metrics (needs access to graphite server, see http://geek.michaelgrace.org/2011/09/delete-data-from-graphite/ )
- remove host from /data/project/.system/store
- remove external hostname/IP from special:NovaAddress
- (in some cases) remove rDNS registration in ops/dns
- check if documentation still refers to this host and update
There should be no firewall rules to update, but it doesn't hurt to check,