Decommission analytics10[58-69]
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	BTullis
	Sep 15 2022, 10:28 AM

Description

We have now completed T311210: Add an-worker11[42-48] to the Hadoop cluster so the hosts that they replace may now be decomissioned from the cluster.

These are: analytics10[58-69]

They should be decommissioned according to the procedure outlined here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning

Details

Subject	Repo	Branch	Lines +/-
analytics: remove puppet references for analytics[1058-1069]	operations/puppet	production	+1 -21
analytics: Remove analytics1064_1069 from hdfs net_topology	operations/puppet	production	+0 -6
Update the hadoop-worker-canary cumin alias	operations/puppet	production	+1 -1
analytics: Exclude analytics1061_1069 from HDFS and YARN	operations/puppet	production	+12 -0
analytics: Remove analytics1069 from the HDFS topology	operations/puppet	production	+0 -1
analytics: Decommission analytics1069 from hadoop cluster	operations/puppet	production	+2 -4
analytics: Remove analytics106[7-8] from the HDFS topology	operations/puppet	production	+0 -2
analytics: Decommission analytics106[7-8] from hadoop cluster	operations/puppet	production	+4 -6
analytics: Remove analytics106[4-6] from the HDFS topology	operations/puppet	production	+0 -3
analytics: Decommission analytics106[4-6] from hadoop cluster	operations/puppet	production	+6 -6
analytics: Remove analytics106[1-3] from the HDFS topology	operations/puppet	production	+0 -3
analytics: Decommission analytics106[1-3] from hadoop cluster	operations/puppet	production	+6 -6
analytics: add analytics19[59-60] to excluded_hosts	operations/puppet	production	+4 -0
analytics: remove analytics10[58-60] from net_topology	operations/puppet	production	+0 -3
Decommission an-worker1058 from hadoop cluster	operations/puppet	production	+6 -0

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Stevemunene	T317861 Decommission analytics10[58-69]
Resolved	Request	Jclark-ctr	T338227 decommission analytics1058.eqiad.wmnet
Resolved		Stevemunene	T338336 Swap an existing Journal node analytics1069 with an-worker1142
Resolved	Request	Jclark-ctr	T338408 decommission analytics1059.eqiad.wmnet
Resolved	Request	Jclark-ctr	T338409 decommission analytics1060.eqiad.wmnet
Resolved	Request	Jclark-ctr	T339201 decommission analytics1063.eqiad.wmnet
Resolved	Request	Jclark-ctr	T339200 decommission analytics1062.eqiad.wmnet
Resolved	Request	Jclark-ctr	T339199 decommission analytics1061.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341205 decommission analytics1065.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341206 decommission analytics1066.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341207 decommission analytics1067.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341208 decommission analytics1068.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341209 decommission analytics1069.eqiad.wmnet
Resolved	Request	Jclark-ctr	T341204 decommission analytics1064.eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We're now watching this Under replicated blocks value decreasing slowly, as the data is copied to other hosts.

Interestingly, the capacity value hasn't decreased; it's still at 42.97 TB. That means the data isn't being deleted from the host as it's being decommissioned, it's merely left in place as it's redundant after being copied away.
Once the under-replicated blocks value reaches zero, the host will be in a decommissioned state and we can proceed.

We can also check the value on this graph.

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41

The overall impact on the cluster by migrating data from one host seems fine. I think what I would do @Stevemunene is to start the process to exclude more of the remaining hosts on this ticket as well, to speed up the process.
Perhaps you could start by excluding another 4 or 5 hosts and check the impact on the cluster I/O - it's it's still OK then you could go up to add all 11 simultaneously.

An alternative could be to group the nodes in the same rack and decom them in little batches:

analytics1058.eqiad.wmnet:  /eqiad/A/1   -> already started

analytics1059.eqiad.wmnet:  /eqiad/A/3  -> second batch
analytics1060.eqiad.wmnet:  /eqiad/A/3  -> second batch

analytics1061.eqiad.wmnet:  /eqiad/B/8  -> third batch
analytics1062.eqiad.wmnet:  /eqiad/B/8  -> etc..
analytics1063.eqiad.wmnet:  /eqiad/B/8

analytics1064.eqiad.wmnet:  /eqiad/C/3
analytics1065.eqiad.wmnet:  /eqiad/C/3
analytics1066.eqiad.wmnet:  /eqiad/C/3

analytics1067.eqiad.wmnet:  /eqiad/D/2
analytics1068.eqiad.wmnet:  /eqiad/D/2
analytics1076.eqiad.wmnet:  /eqiad/D/2

analytics1069.eqiad.wmnet:  /eqiad/D/8

The idea is to avoid to decom nodes in different racks, because we may impact the replication of some blocks. Worst case scenario is to decom 3 nodes in different racks that have the 3 replicas for certain blocks, ending up with some HDFS errors (no blocks available etc..).

In T317861#8908912, @elukey wrote:

An alternative could be to group the nodes in the same rack and decom them in little batches:
The idea is to avoid to decom nodes in different racks, because we may impact the replication of some blocks. Worst case scenario is to decom 3 nodes in different racks that have the 3 replicas for certain blocks, ending up with some HDFS errors (no blocks available etc..).

Thanks @elukey - I like the idea for a bit of extra safety, even though I suspect that the scenario you've described would probably not happen.
It seems to me that the datanodes can still server data when they're in a decomissioning state, so it's like adding an additional replica of these blocks, until the node enters the decommissioned state.
Still that's only my hypothesis, so your idea seems good in terms of additional safety and we're not really in a hurry, so 👍 from me.

yes yes good point, it should be safe, but I'd be cautious on the batch size just to be sure (HDFS is battle tested but we had some horror stories in the past :D)

Stevemunene mentioned this in T338336: Swap an existing Journal node analytics1069 with an-worker1142.Jun 7 2023, 3:44 PM

Stevemunene added a subtask: T338336: Swap an existing Journal node analytics1069 with an-worker1142.

Stevemunene added a subtask: T338408: decommission analytics1059.eqiad.wmnet.Jun 8 2023, 1:58 AM

Stevemunene added a subtask: T338409: decommission analytics1060.eqiad.wmnet.Jun 8 2023, 2:00 AM

Change 928465 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] analytics: add analytics19[59-60] to excluded_hosts

https://gerrit.wikimedia.org/r/928465

Change 928466 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] analytics: remove analytics10[58-60] from net_topology

https://gerrit.wikimedia.org/r/928466

Change 928466 abandoned by Gehel:

[operations/puppet@production] analytics: remove analytics10[58-60] from net_topology

Reason:

this was just for testing

https://gerrit.wikimedia.org/r/928466

Change 928465 abandoned by Gehel:

[operations/puppet@production] analytics: add analytics19[59-60] to excluded_hosts

Reason:

this was just for testing

https://gerrit.wikimedia.org/r/928465

Maintenance_bot removed a project: Patch-For-Review.Jun 8 2023, 10:10 AM

Mentioned in SAL (#wikimedia-analytics) [2023-06-14T13:15:30Z] <stevemunene> running the puppet on an-master100[1-2] Remove analytics58_60 from the HDFS topology T317861

Stevemunene added a subtask: T339201: decommission analytics1063.eqiad.wmnet.Jun 15 2023, 9:18 AM

Stevemunene added a subtask: T339200: decommission analytics1062.eqiad.wmnet.

Stevemunene added a subtask: T339199: decommission analytics1061.eqiad.wmnet.

Change 930580 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics106[1-3] from hadoop cluster

https://gerrit.wikimedia.org/r/930580

Change 930581 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics106[1-3] from the HDFS topology

https://gerrit.wikimedia.org/r/930581

Change 930582 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics106[4-6] from hadoop cluster

https://gerrit.wikimedia.org/r/930582

Change 930583 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics106[4-6] from the HDFS topology

https://gerrit.wikimedia.org/r/930583

Change 930584 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics106[7-8] from hadoop cluster

https://gerrit.wikimedia.org/r/930584

Change 930585 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics106[7-8] from the HDFS topology

https://gerrit.wikimedia.org/r/930585

Change 930606 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics1069 from hadoop cluster

https://gerrit.wikimedia.org/r/930606

Change 930607 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics1069 from the HDFS topology

https://gerrit.wikimedia.org/r/930607

Mentioned in SAL (#wikimedia-analytics) [2023-06-15T12:47:20Z] <stevemunene> roll running sre.hadoop.roll-restart-masters to completely remove any reference of analytics1058-1060 for T317861

Done decommissioning the hosts in the first batches in /eqiad/A. Next is to Delete Kerberos principals and keytabs when a host is decommissioned and to remove any reference to the hosts in puppet as I move on to the next batch in /eqiad/B

Change 930580 merged by Stevemunene:

[operations/puppet@production] analytics: Decommission analytics106[1-3] from hadoop cluster

https://gerrit.wikimedia.org/r/930580

Change 930581 merged by Stevemunene:

[operations/puppet@production] analytics: Remove analytics106[1-3] from the HDFS topology

https://gerrit.wikimedia.org/r/930581

Mentioned in SAL (#wikimedia-analytics) [2023-06-22T14:02:44Z] <stevemunene> running sre.hadoop.roll-restart-masters restart the Namenodes to completely remove any reference of analytics106[1-3] T317861

/eqiad/B/ hosts to be decommissioned have been successfully Excluded from HDFS and YARN and removed from the HDFS topology, moving on to decommissioning them with cookbook sre.hosts.decommission.
Also adding /eqiad/C/ hosts to be Excluded from HDFS and YARN.

@Stevemunene I still see the following from the hdfs topology:

Rack: /eqiad/default/rack
   10.64.21.113:50010 (analytics1061.eqiad.wmnet)
   10.64.21.114:50010 (analytics1062.eqiad.wmnet)
   10.64.21.115:50010 (analytics1063.eqiad.wmnet)

Are the fully decommed?

In T317861#8959123, @elukey wrote:
@Stevemunene I still see the following from the hdfs topology:
Rack: /eqiad/default/rack
   10.64.21.113:50010 (analytics1061.eqiad.wmnet)
   10.64.21.114:50010 (analytics1062.eqiad.wmnet)
   10.64.21.115:50010 (analytics1063.eqiad.wmnet)
Are the fully decommed?

The change to remove them from hdfs_topology 930581 was merged and the services restarted with the sre.hadoop.roll-restart-masters cookbook. Probably requires another restart. Running the sre.hosts.decommission on the hosts today.

Change 930582 merged by Stevemunene:

[operations/puppet@production] analytics: Decommission analytics106[4-6] from hadoop cluster

https://gerrit.wikimedia.org/r/930582

During the decommissioning of analytics106[1-3], we noticed that even after Excluding the hosts from yarn and hdfs. Then moving on to the next step which is removing them from the hdfs topology and restarting the hosts, the hosts were still part of the cluster as shown below
HDFS_NameNode_Status_Interface

And also present and available for yarn jobs

Which sparked a discussion on the possible cause with @elukey attributing this to the fact that hdfs only takesh short names but we only provided fqdn and not the short name to the hosts.exclude file. To test this we decided to manually add the short names and fqdn to an-master1001. To do this,
First Disable puppet on an-master1001 then add the hosts to the file /etc/hadoop/conf.analytics-hadoop/hosts.exclude as below

stevemunene@an-master1001:~$ cat /etc/hadoop/conf.analytics-hadoop/hosts.exclude
analytics1064.eqiad.wmnet
analytics1065.eqiad.wmnet
analytics1066.eqiad.wmnet
analytics1061.eqiad.wmnet
analytics1061
analytics1062.eqiad.wmnet
analytics1062
analytics1063.eqiad.wmnet
analytics1063

Lastly refresh the hdfs nodes list with;

stevemunene@an-master1001:~$ sudo -u hdfs hdfs dfsadmin -refreshNodes
Refresh nodes successful for an-master1001.eqiad.wmnet/10.64.5.26:8020
Refresh nodes successful for an-master1002.eqiad.wmnet/10.64.21.110:8020

The three analytics106[1-3] re appeared on the HDFS_NameNode_Status_Interface as decommissioning

The HDFS under replicated blocks graph also shows a spike after this

Change 933386 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Exclude analytics1061_1069 from HDFS and YARN

https://gerrit.wikimedia.org/r/933386

Change 933387 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics1064_1069 from hdfs net_topology

https://gerrit.wikimedia.org/r/933387

In T317861#8963868, @Stevemunene wrote:
During the decommissioning of analytics106[1-3], we noticed that even after Excluding the hosts from yarn and hdfs. Then moving on to the next step which is removing them from the hdfs topology and restarting the hosts, the hosts were still part of the cluster as shown below
HDFS_NameNode_Status_Interface

And also present and available for yarn jobs

Which sparked a discussion on the possible cause with @elukey attributing this to the fact that hdfs only takesh short names but we only provided fqdn and not the short name to the hosts.exclude file. To test this we decided to manually add the short names and fqdn to an-master1001. To do this,
First Disable puppet on an-master1001 then add the hosts to the file /etc/hadoop/conf.analytics-hadoop/hosts.exclude as below
stevemunene@an-master1001:~$ cat /etc/hadoop/conf.analytics-hadoop/hosts.exclude
analytics1064.eqiad.wmnet
analytics1065.eqiad.wmnet
analytics1066.eqiad.wmnet
analytics1061.eqiad.wmnet
analytics1061
analytics1062.eqiad.wmnet
analytics1062
analytics1063.eqiad.wmnet
analytics1063
Lastly refresh the hdfs nodes list with;
stevemunene@an-master1001:~$ sudo -u hdfs hdfs dfsadmin -refreshNodes
Refresh nodes successful for an-master1001.eqiad.wmnet/10.64.5.26:8020
Refresh nodes successful for an-master1002.eqiad.wmnet/10.64.21.110:8020
The three analytics106[1-3] re appeared on the HDFS_NameNode_Status_Interface as decommissioning

The HDFS under replicated blocks graph also shows a spike after this

After further investigations, this was determined to have been caused by the sequence of events below;

analytics10[61-63] were indeed in a decommissioned state on the HDFS NameNode Status Interface by the 19th of June 2022, when they were removed from the hdfs net_topology.
The hosts were however not physically decommissioned using the sre.hosts.decommission cookbook thus remained online.
When [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/930582 | analytics: Decommission analytics106[4-6] from hadoop cluster ]] was merged, the hosts analytics10[61-63] were removed from the hosts.exclude file and since they were still online they rejoined the cluster. Which is what led to the situation above.

Next steps:

Excluding all the remaining hosts analytics1061_1069 from HDFS and YARN
Removing them from the hdfs net_topology:
Decommissioning the nodes
Removing the nodes from master and standby exclude
Removing all the hosts from sites.pp and setting them in the role (spare::system)

Change 930583 abandoned by Stevemunene:

[operations/puppet@production] analytics: Remove analytics106[4-6] from the HDFS topology

Reason: