Page MenuHomePhabricator

Decommission old Hadoop worker nodes and add newer ones
Closed, ResolvedPublic13 Story Points

Description

In T207192 an-worker1078->95 have been configured as hadoop worker nodes (datanode partitions configured). We need to remove analytics1028->1041 from service and add the newer ones.

Notes:

A safe procedure to move journal nodes could be the following:

  • Set HDFS in safe mode (no writes accepted)
  • stop/mask the two journal node daemons (on analytics1028 and analytics1035)
  • merge a config change for puppet to move the journal config to another two nodes, copy the journal node partition to them and roll restart all the journal nodes
  • Remove HDFS Safe mode

Hosts decommed from HDFS/Yarn:

  • analytics1028
  • analytics1029
  • analytics1030
  • analytics1031
  • analytics1032
  • analytics1033
  • analytics1034
  • analytics1035
  • analytics1036
  • analytics1037
  • analytics1038
  • analytics1039
  • analytics1040
  • analytics1041

Event Timeline

elukey created this task.Nov 20 2018, 10:54 AM
elukey triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2018, 10:54 AM

Change 474904 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

elukey moved this task from Backlog to In Progress on the User-Elukey board.Dec 5 2018, 2:10 PM

Change 474904 merged by Elukey:
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Dec 5 2018, 2:14 PM

Change 477779 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

elukey added a comment.Dec 5 2018, 2:25 PM

The plan is:

  1. Add new nodes to rack awareness config and site.pp, and leave the cluster re-balance for some days.
  2. In the meantime, move the two journal nodes on analytics1028/35 to an-workerXXXX
  3. Remove the old nodes in batches from HDFS/Yarn, and remove all puppet configs.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T14:54:30Z] <elukey> restart HDFS namenode and Yarn resource manager on an-master100[1,2] to update rack topology config - T209929

Change 477779 merged by Elukey:
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

Change 478623 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add two new HDFS journalnodes to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/478623

Change 478623 merged by Elukey:
[operations/puppet@production] Add two new HDFS journalnodes to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/478623

Mentioned in SAL (#wikimedia-operations) [2018-12-20T14:39:57Z] <elukey> add two journal nodes to the Analytics Hadoop cluster - T209929

Change 480965 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove two journal nodes from the Analytics Hadoop config

https://gerrit.wikimedia.org/r/480965

Change 480965 merged by Elukey:
[operations/puppet@production] Remove two journal nodes from the Analytics Hadoop config

https://gerrit.wikimedia.org/r/480965

Mentioned in SAL (#wikimedia-operations) [2018-12-20T16:31:02Z] <elukey> remove two journal nodes from the Analytics hadoop cluster - T209929

Change 481011 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G

https://gerrit.wikimedia.org/r/481011

Change 481011 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G

https://gerrit.wikimedia.org/r/481011

Mentioned in SAL (#wikimedia-operations) [2018-12-20T18:30:02Z] <elukey> remove hdfs journalnode config+packages from analytics10(28|35) - not used anymore - T209929

Change 481882 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop:master/standby: support a custom hosts.exclude file

https://gerrit.wikimedia.org/r/481882

Change 481882 merged by Elukey:
[operations/puppet@production] profile::hadoop:master/standby: support a custom hosts.exclude file

https://gerrit.wikimedia.org/r/481882

Change 481885 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop:master/standby: remove an1028

https://gerrit.wikimedia.org/r/481885

Change 481885 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop:master/standby: remove an1028

https://gerrit.wikimedia.org/r/481885

Change 481899 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Exclude two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/481899

Change 481899 merged by Elukey:
[operations/puppet@production] Exclude two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/481899

Change 482016 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/482016

Change 482016 merged by Elukey:
[operations/puppet@production] Remove two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/482016

elukey updated the task description. (Show Details)Jan 4 2019, 9:18 AM

Change 482401 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decomission two Hadoop worker nodes from the Analtytics cluster

https://gerrit.wikimedia.org/r/482401

Change 482401 merged by Elukey:
[operations/puppet@production] Decomission two Hadoop worker nodes from the Analtytics cluster

https://gerrit.wikimedia.org/r/482401

elukey updated the task description. (Show Details)Jan 5 2019, 7:44 AM

Change 482489 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission an1034/5 from Hadoop Analytics

https://gerrit.wikimedia.org/r/482489

Change 482489 merged by Elukey:
[operations/puppet@production] Decommission an1034/5 from Hadoop Analytics

https://gerrit.wikimedia.org/r/482489

elukey updated the task description. (Show Details)Jan 6 2019, 8:00 AM

Change 482582 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission analytics103[6,7] from Analytics Hadoop

https://gerrit.wikimedia.org/r/482582

Change 482582 merged by Elukey:
[operations/puppet@production] Decommission analytics103[6,7] from Analytics Hadoop

https://gerrit.wikimedia.org/r/482582

elukey updated the task description. (Show Details)Jan 7 2019, 7:45 AM

Change 482649 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission analytics10[39-41] from Hadoop Analytics

https://gerrit.wikimedia.org/r/482649

Change 482649 merged by Elukey:
[operations/puppet@production] Decommission analytics10[39-41] from Hadoop Analytics

https://gerrit.wikimedia.org/r/482649

elukey updated the task description. (Show Details)Jan 8 2019, 7:31 AM
elukey updated the task description. (Show Details)

Change 482767 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove decommed nodes from Analytics Hadoop's net topology

https://gerrit.wikimedia.org/r/482767

Change 483136 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign role::spare::system to analytics1028->41

https://gerrit.wikimedia.org/r/483136

Change 482767 merged by Elukey:
[operations/puppet@production] Remove decommed nodes from Analytics Hadoop's net topology

https://gerrit.wikimedia.org/r/482767

Change 483136 merged by Elukey:
[operations/puppet@production] Assign role::spare::system to analytics1028->41

https://gerrit.wikimedia.org/r/483136

Change 483141 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove decomissioned nodes from Analytics Hadoop

https://gerrit.wikimedia.org/r/483141

Change 483141 merged by Elukey:
[operations/puppet@production] Remove decomissioned nodes from Analytics Hadoop

https://gerrit.wikimedia.org/r/483141

elukey added a comment.Jan 9 2019, 3:31 PM

Nodes completely removed:

  • removed from the network topology and restarted namenodes
  • assigned role::spare:system and removed the datanode/nodemanager packages from the hosts to prevent the chance that a node comes back alive.
  • cleaned up the hosts.exclude list

And finally updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning with some notes.

elukey added a comment.Jan 9 2019, 3:32 PM

As mentioned before these nodes will become a new testing cluster, more info in T212256

elukey set the point value for this task to 13.Jan 9 2019, 3:32 PM
elukey moved this task from In Progress to Done on the Analytics-Kanban board.
elukey moved this task from In Progress to Done on the User-Elukey board.Jan 10 2019, 9:05 AM
Nuria closed this task as Resolved.Jan 14 2019, 5:54 PM