The HDFS and Yarn masters (analytics100[12]) need to be replaced as part of the usual workflow for hardware refresh. We have already ordered the hardware, that is currently being racked in T201939.
As for all the delicate/risky Hadoop procedures, there is not a lot of documentation from upstream about the safest way to do things, except some occasional brave user that reports his/her story:
https://stackoverflow.com/questions/40216709/moving-hadoop-master-node-in-another-box-how-to-handle-hdfs?rq=1
After a chat with Andrew and Joseph we reached the same conclusion as the user of the above thread, namely that it is way safer and less error prone to shutdown completely the cluster to do this maintenance.
The (high level) idea is the following:
- Stop all the regular Analytics processing jobs, and alert people in advance about the maintenance.
- Stop Hive, Hue, etc..
- Enter HDFS Safe Mode (only reads, no writes allowed)
- Shutdown the cluster
- Replace the master node domains in puppet and make sure that the file is updated everywhere
- Copy over to the new master the HDFS state from the "current" masters (a rsync is sufficient).
- Start the cluster.
This of course needs to be tested in labs, but the procedure seems sound from a quick review.
Useful doc to reference: https://etherpad.wikimedia.org/p/analytics-hadoop-java8 (last Java upgrade, that involved shutting down the cluster).
The selected date for the switch in production (if testing supports us) should be Sept 25th (to be announced/scheduled). We think it will be only a matter of shutting down the Hadoop cluster for a couple of hours.