Page MenuHomePhabricator

Decommission old Hadoop worker nodes and add newer ones
Closed, ResolvedPublic13 Estimated Story Points

Description

In T207192 an-worker1078->95 have been configured as hadoop worker nodes (datanode partitions configured). We need to remove analytics1028->1041 from service and add the newer ones.

Notes:

A safe procedure to move journal nodes could be the following:

  • Set HDFS in safe mode (no writes accepted)
  • stop/mask the two journal node daemons (on analytics1028 and analytics1035)
  • merge a config change for puppet to move the journal config to another two nodes, copy the journal node partition to them and roll restart all the journal nodes
  • Remove HDFS Safe mode

Hosts decommed from HDFS/Yarn:

  • analytics1028
  • analytics1029
  • analytics1030
  • analytics1031
  • analytics1032
  • analytics1033
  • analytics1034
  • analytics1035
  • analytics1036
  • analytics1037
  • analytics1038
  • analytics1039
  • analytics1040
  • analytics1041

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -0
operations/puppetproduction+0 -60
operations/puppetproduction+11 -2
operations/puppetproduction+0 -14
operations/puppetproduction+12 -0
operations/puppetproduction+8 -0
operations/puppetproduction+8 -0
operations/puppetproduction+8 -0
operations/puppetproduction+8 -0
operations/puppetproduction+8 -0
operations/puppetproduction+7 -0
operations/puppetproduction+14 -4
operations/puppetproduction+2 -2
operations/puppetproduction+0 -2
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+22 -5
Show related patches Customize query in gerrit

Event Timeline

elukey triaged this task as Medium priority.Nov 20 2018, 10:54 AM
elukey created this task.

Change 474904 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

Change 474904 merged by Elukey:
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

Change 477779 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

The plan is:

  1. Add new nodes to rack awareness config and site.pp, and leave the cluster re-balance for some days.
  2. In the meantime, move the two journal nodes on analytics1028/35 to an-workerXXXX
  3. Remove the old nodes in batches from HDFS/Yarn, and remove all puppet configs.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T14:54:30Z] <elukey> restart HDFS namenode and Yarn resource manager on an-master100[1,2] to update rack topology config - T209929

Change 477779 merged by Elukey:
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

Change 478623 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add two new HDFS journalnodes to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/478623

Change 478623 merged by Elukey:
[operations/puppet@production] Add two new HDFS journalnodes to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/478623

Mentioned in SAL (#wikimedia-operations) [2018-12-20T14:39:57Z] <elukey> add two journal nodes to the Analytics Hadoop cluster - T209929

Change 480965 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove two journal nodes from the Analytics Hadoop config

https://gerrit.wikimedia.org/r/480965

Change 480965 merged by Elukey:
[operations/puppet@production] Remove two journal nodes from the Analytics Hadoop config

https://gerrit.wikimedia.org/r/480965

Mentioned in SAL (#wikimedia-operations) [2018-12-20T16:31:02Z] <elukey> remove two journal nodes from the Analytics hadoop cluster - T209929

Change 481011 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G

https://gerrit.wikimedia.org/r/481011

Change 481011 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G

https://gerrit.wikimedia.org/r/481011

Mentioned in SAL (#wikimedia-operations) [2018-12-20T18:30:02Z] <elukey> remove hdfs journalnode config+packages from analytics10(28|35) - not used anymore - T209929

Change 481882 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop:master/standby: support a custom hosts.exclude file

https://gerrit.wikimedia.org/r/481882

Change 481882 merged by Elukey:
[operations/puppet@production] profile::hadoop:master/standby: support a custom hosts.exclude file

https://gerrit.wikimedia.org/r/481882

Change 481885 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop:master/standby: remove an1028

https://gerrit.wikimedia.org/r/481885

Change 481885 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop:master/standby: remove an1028

https://gerrit.wikimedia.org/r/481885

Change 481899 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Exclude two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/481899

Change 481899 merged by Elukey:
[operations/puppet@production] Exclude two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/481899

Change 482016 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/482016

Change 482016 merged by Elukey:
[operations/puppet@production] Remove two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/482016

Change 482401 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decomission two Hadoop worker nodes from the Analtytics cluster

https://gerrit.wikimedia.org/r/482401

Change 482401 merged by Elukey:
[operations/puppet@production] Decomission two Hadoop worker nodes from the Analtytics cluster

https://gerrit.wikimedia.org/r/482401

Change 482489 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission an1034/5 from Hadoop Analytics

https://gerrit.wikimedia.org/r/482489

Change 482489 merged by Elukey:
[operations/puppet@production] Decommission an1034/5 from Hadoop Analytics

https://gerrit.wikimedia.org/r/482489

Change 482582 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission analytics103[6,7] from Analytics Hadoop

https://gerrit.wikimedia.org/r/482582

Change 482582 merged by Elukey:
[operations/puppet@production] Decommission analytics103[6,7] from Analytics Hadoop

https://gerrit.wikimedia.org/r/482582

Change 482649 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission analytics10[39-41] from Hadoop Analytics

https://gerrit.wikimedia.org/r/482649

Change 482649 merged by Elukey:
[operations/puppet@production] Decommission analytics10[39-41] from Hadoop Analytics

https://gerrit.wikimedia.org/r/482649

elukey updated the task description. (Show Details)

Change 482767 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove decommed nodes from Analytics Hadoop's net topology

https://gerrit.wikimedia.org/r/482767

Change 483136 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign role::spare::system to analytics1028->41

https://gerrit.wikimedia.org/r/483136

Change 482767 merged by Elukey:
[operations/puppet@production] Remove decommed nodes from Analytics Hadoop's net topology

https://gerrit.wikimedia.org/r/482767

Change 483136 merged by Elukey:
[operations/puppet@production] Assign role::spare::system to analytics1028->41

https://gerrit.wikimedia.org/r/483136

Change 483141 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove decomissioned nodes from Analytics Hadoop

https://gerrit.wikimedia.org/r/483141

Change 483141 merged by Elukey:
[operations/puppet@production] Remove decomissioned nodes from Analytics Hadoop

https://gerrit.wikimedia.org/r/483141

Nodes completely removed:

  • removed from the network topology and restarted namenodes
  • assigned role::spare:system and removed the datanode/nodemanager packages from the hosts to prevent the chance that a node comes back alive.
  • cleaned up the hosts.exclude list

And finally updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning with some notes.

As mentioned before these nodes will become a new testing cluster, more info in T212256

elukey set the point value for this task to 13.Jan 9 2019, 3:32 PM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.

Change 540149 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add missing an-worker1088 to hadoop net_topology

https://gerrit.wikimedia.org/r/540149

Change 540149 merged by Ottomata:
[operations/puppet@production] Add missing an-worker1088 to hadoop net_topology

https://gerrit.wikimedia.org/r/540149