This ticket will track the replacement of the two current Hadoop masters an-master100[1-2] with the new servers, an-master100[3-4.
Thereis the outline of a procedure here, but it will need validating carefully:
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Migrating_to_new_HA_NameNodes
Procedure
This will need a full cluster restart, I believe.
The documented procedure looks sound to me.
The rough outline of the process is:
- Bring up an-master1003 in its correct role, but in the hdfs-site.xml we only tell it about an-master1001 and an-master1003
- Bring up an-master1004 in its correct role, but in the hdfs-site.xml we only tell it about an-master1001 and an-master1004
- Bootstrap the namenodes, so that they are receiving updates about edits from the journalnodes and are thus populated.
- Prepare the patches to replace an-master1001 with an-master1003 and an-master1002 with an-master1004 globally.
- Announce downtime for the cluster and wait patiently.
At the appointed time:
- Stop gobblin, which will pause ingestion from kafka https://gerrit.wikimedia.org/r/c/operations/puppet/+/990605
- Disable systemd jobs e.g. refine https://gerrit.wikimedia.org/r/c/operations/puppet/+/990629
- Pause airflow tasks
- Put the cluster in to safe mode
- Shut down the cluster
- Apply the patches to update the namenode location everywhere: https://gerrit.wikimedia.org/r/c/operations/puppet/+/990600
- Start the journalnodes first
- Start the namenode services
- Start the rest of the Hadoop services.
- Re-enable gobblin, which will restart ingestion from kafka
- Re-enable airflow tasks
- Re-enable systemd jobs , e.g. refine
Acceptance Criteria
- an-master1003 is running in the role analytics_cluster::hadoop::master
- an-master1004 is running in the role analytics_cluster::hadoop::standby
- an-master1001 is ready for decommissioning
- an-master1002 is ready for decommissioning
- Wikitech documentation has been updated