Decommission old Hadoop worker nodes and add newer ones
Open, NormalPublic

Description

In T207192 an-worker1078->95 have been configured as hadoop worker nodes (datanode partitions configured). We need to remove analytics1028->1042 from service and add the newer ones.

Notes:

A safe procedure to move journal nodes could be the following:

  • Set HDFS in safe mode (no writes accepted)
  • stop/mask the two journal node daemons (on analytics1028 and analytics1035)
  • merge a config change for puppet to move the journal config to another two nodes, copy the journal node partition to them and roll restart all the journal nodes
  • Remove HDFS Safe mode
elukey created this task.Tue, Nov 20, 10:54 AM
elukey triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Nov 20, 10:54 AM

Change 474904 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

elukey moved this task from Backlog to In Progress on the User-Elukey board.Wed, Dec 5, 2:10 PM

Change 474904 merged by Elukey:
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Wed, Dec 5, 2:14 PM

Change 477779 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

elukey added a comment.Wed, Dec 5, 2:25 PM

The plan is:

  1. Add new nodes to rack awareness config and site.pp, and leave the cluster re-balance for some days.
  2. In the meantime, move the two journal nodes on analytics1028/35 to an-workerXXXX
  3. Remove the old nodes in batches from HDFS/Yarn, and remove all puppet configs.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T14:54:30Z] <elukey> restart HDFS namenode and Yarn resource manager on an-master100[1,2] to update rack topology config - T209929

Change 477779 merged by Elukey:
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

Change 478623 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add two new HDFS journalnodes to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/478623