⚓ T209929 Decommission old Hadoop worker nodes and add newer ones

Subject	Repo	Branch	Lines +/-
Add missing an-worker1088 to hadoop net_topology	operations/puppet	production	+1 -0
Remove decomissioned nodes from Analytics Hadoop	operations/puppet	production	+0 -60
Assign role::spare::system to analytics1028->41	operations/puppet	production	+11 -2
Remove decommed nodes from Analytics Hadoop's net topology	operations/puppet	production	+0 -14
Decommission analytics10[39-41] from Hadoop Analytics	operations/puppet	production	+12 -0
Decommission analytics103[6,7] from Analytics Hadoop	operations/puppet	production	+8 -0
Decommission an1034/5 from Hadoop Analytics	operations/puppet	production	+8 -0
Decomission two Hadoop worker nodes from the Analtytics cluster	operations/puppet	production	+8 -0
Remove two Analytics Hadoop worker nodes for decom	operations/puppet	production	+8 -0
Exclude two Analytics Hadoop worker nodes for decom	operations/puppet	production	+8 -0
role::analytics_cluster::hadoop:master/standby: remove an1028	operations/puppet	production	+7 -0
profile::hadoop:master/standby: support a custom hosts.exclude file	operations/puppet	production	+14 -4
role::analytics_cluster::hadoop::master\|standby: bump hdfs heap to 12G	operations/puppet	production	+2 -2
Remove two journal nodes from the Analytics Hadoop config	operations/puppet	production	+0 -2
Add two new HDFS journalnodes to the Analytics Hadoop cluster	operations/puppet	production	+2 -0
Assign analytics_cluster::hadoop::worker to an-worker*	operations/puppet	production	+1 -1
Add the Hadoop worker nodes' racking awareness config	operations/puppet	production	+22 -5

elukey triaged this task as Medium priority.Nov 20 2018, 10:54 AM

elukey created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2018, 10:54 AM

elukey added a project: User-Elukey.Nov 20 2018, 12:59 PM

Change 474904 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

gerritbot added a project: Patch-For-Review.Nov 20 2018, 1:00 PM

elukey moved this task from Backlog to In Progress on the User-Elukey board.Dec 5 2018, 2:10 PM

Change 474904 merged by Elukey:
[operations/puppet@production] Add the Hadoop worker nodes' racking awareness config

https://gerrit.wikimedia.org/r/474904

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Dec 5 2018, 2:14 PM

Change 477779 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

The plan is:

Add new nodes to rack awareness config and site.pp, and leave the cluster re-balance for some days.
In the meantime, move the two journal nodes on analytics1028/35 to an-workerXXXX
Remove the old nodes in batches from HDFS/Yarn, and remove all puppet configs.

Mentioned in SAL (#wikimedia-operations) [2018-12-05T14:54:30Z] <elukey> restart HDFS namenode and Yarn resource manager on an-master100[1,2] to update rack topology config - T209929

Change 477779 merged by Elukey:
[operations/puppet@production] Assign analytics_cluster::hadoop::worker to an-worker*

https://gerrit.wikimedia.org/r/477779

Change 478623 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add two new HDFS journalnodes to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/478623

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.Dec 10 2018, 5:08 PM

Change 478623 merged by Elukey:
[operations/puppet@production] Add two new HDFS journalnodes to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/478623

Mentioned in SAL (#wikimedia-operations) [2018-12-20T14:39:57Z] <elukey> add two journal nodes to the Analytics Hadoop cluster - T209929

Change 480965 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove two journal nodes from the Analytics Hadoop config

https://gerrit.wikimedia.org/r/480965

Change 480965 merged by Elukey:
[operations/puppet@production] Remove two journal nodes from the Analytics Hadoop config

https://gerrit.wikimedia.org/r/480965

Mentioned in SAL (#wikimedia-operations) [2018-12-20T16:31:02Z] <elukey> remove two journal nodes from the Analytics hadoop cluster - T209929

Change 481011 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G

https://gerrit.wikimedia.org/r/481011

Change 481011 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop::master|standby: bump hdfs heap to 12G

https://gerrit.wikimedia.org/r/481011

Mentioned in SAL (#wikimedia-operations) [2018-12-20T18:30:02Z] <elukey> remove hdfs journalnode config+packages from analytics10(28|35) - not used anymore - T209929

Last step is to follow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning and decom analytics1028->1041 in batches.

Change 481882 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop:master/standby: support a custom hosts.exclude file

https://gerrit.wikimedia.org/r/481882

Change 481882 merged by Elukey:
[operations/puppet@production] profile::hadoop:master/standby: support a custom hosts.exclude file

https://gerrit.wikimedia.org/r/481882

Change 481885 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::hadoop:master/standby: remove an1028

https://gerrit.wikimedia.org/r/481885

Change 481885 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::hadoop:master/standby: remove an1028

https://gerrit.wikimedia.org/r/481885

Change 481899 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Exclude two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/481899

Change 481899 merged by Elukey:
[operations/puppet@production] Exclude two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/481899

Change 482016 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/482016

Change 482016 merged by Elukey:
[operations/puppet@production] Remove two Analytics Hadoop worker nodes for decom

https://gerrit.wikimedia.org/r/482016

elukey updated the task description. (Show Details)Jan 4 2019, 9:18 AM

Change 482401 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decomission two Hadoop worker nodes from the Analtytics cluster

https://gerrit.wikimedia.org/r/482401

Change 482401 merged by Elukey:
[operations/puppet@production] Decomission two Hadoop worker nodes from the Analtytics cluster

https://gerrit.wikimedia.org/r/482401

elukey updated the task description. (Show Details)Jan 5 2019, 7:44 AM

Change 482489 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission an1034/5 from Hadoop Analytics

https://gerrit.wikimedia.org/r/482489

Change 482489 merged by Elukey:
[operations/puppet@production] Decommission an1034/5 from Hadoop Analytics

https://gerrit.wikimedia.org/r/482489

elukey updated the task description. (Show Details)Jan 6 2019, 8:00 AM

Change 482582 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission analytics103[6,7] from Analytics Hadoop

https://gerrit.wikimedia.org/r/482582

Change 482582 merged by Elukey:
[operations/puppet@production] Decommission analytics103[6,7] from Analytics Hadoop

https://gerrit.wikimedia.org/r/482582

elukey updated the task description. (Show Details)Jan 7 2019, 7:45 AM

Change 482649 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission analytics10[39-41] from Hadoop Analytics

https://gerrit.wikimedia.org/r/482649

Change 482649 merged by Elukey:
[operations/puppet@production] Decommission analytics10[39-41] from Hadoop Analytics

https://gerrit.wikimedia.org/r/482649

elukey updated the task description. (Show Details)Jan 8 2019, 7:31 AM

elukey updated the task description. (Show Details)

Change 482767 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove decommed nodes from Analytics Hadoop's net topology

https://gerrit.wikimedia.org/r/482767

Change 483136 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Assign role::spare::system to analytics1028->41

https://gerrit.wikimedia.org/r/483136

Change 482767 merged by Elukey:
[operations/puppet@production] Remove decommed nodes from Analytics Hadoop's net topology

https://gerrit.wikimedia.org/r/482767

Change 483136 merged by Elukey:
[operations/puppet@production] Assign role::spare::system to analytics1028->41

https://gerrit.wikimedia.org/r/483136

Change 483141 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove decomissioned nodes from Analytics Hadoop

https://gerrit.wikimedia.org/r/483141

Change 483141 merged by Elukey:
[operations/puppet@production] Remove decomissioned nodes from Analytics Hadoop

https://gerrit.wikimedia.org/r/483141

Nodes completely removed:

removed from the network topology and restarted namenodes
assigned role::spare:system and removed the datanode/nodemanager packages from the hosts to prevent the chance that a node comes back alive.
cleaned up the hosts.exclude list

And finally updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning with some notes.

As mentioned before these nodes will become a new testing cluster, more info in T212256

elukey set the point value for this task to 13.Jan 9 2019, 3:32 PM

elukey moved this task from In Progress to Done on the Analytics-Kanban board.

elukey moved this task from In Progress to Done on the User-Elukey board.Jan 10 2019, 9:05 AM

• Nuria closed this task as Resolved.Jan 14 2019, 5:54 PM

Change 540149 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add missing an-worker1088 to hadoop net_topology

https://gerrit.wikimedia.org/r/540149

Change 540149 merged by Ottomata:
[operations/puppet@production] Add missing an-worker1088 to hadoop net_topology

https://gerrit.wikimedia.org/r/540149

Maintenance_bot removed a project: Patch-For-Review.Oct 2 2019, 7:10 PM

Decommission old Hadoop worker nodes and add newer ones
Closed, ResolvedPublic13 Estimated Story Points
Actions

Description

Details

Related Objects

Event Timeline

Decommission old Hadoop worker nodes and add newer onesClosed, ResolvedPublic13 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Decommission old Hadoop worker nodes and add newer ones
Closed, ResolvedPublic13 Estimated Story Points
Actions