Page MenuHomePhabricator

Decommission analytics10[58-69]
Closed, ResolvedPublic3 Estimated Story Points

Assigned To
Authored By
BTullis
Sep 15 2022, 10:28 AM
Referenced Files
F37130236: image.png
Jul 6 2023, 7:50 AM
F37130238: image.png
Jul 6 2023, 7:50 AM
F37129612: image.png
Jul 5 2023, 1:05 PM
F37119974: image.png
Jun 26 2023, 12:43 PM
F37119970: image.png
Jun 26 2023, 12:43 PM
F37119949: image.png
Jun 26 2023, 12:43 PM
F37119943: image.png
Jun 26 2023, 12:43 PM
F37096591: image.png
Jun 7 2023, 9:58 AM

Description

We have now completed T311210: Add an-worker11[42-48] to the Hadoop cluster so the hosts that they replace may now be decomissioned from the cluster.

These are: analytics10[58-69]

They should be decommissioned according to the procedure outlined here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Decommissioning

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We're now watching this Under replicated blocks value decreasing slowly, as the data is copied to other hosts.

image.png (670×1 px, 68 KB)

Interestingly, the capacity value hasn't decreased; it's still at 42.97 TB. That means the data isn't being deleted from the host as it's being decommissioned, it's merely left in place as it's redundant after being copied away.
Once the under-replicated blocks value reaches zero, the host will be in a decommissioned state and we can proceed.

We can also check the value on this graph.

image.png (955×1 px, 49 KB)

https://grafana.wikimedia.org/d/000000585/hadoop?orgId=1&viewPanel=41

The overall impact on the cluster by migrating data from one host seems fine. I think what I would do @Stevemunene is to start the process to exclude more of the remaining hosts on this ticket as well, to speed up the process.
Perhaps you could start by excluding another 4 or 5 hosts and check the impact on the cluster I/O - it's it's still OK then you could go up to add all 11 simultaneously.

An alternative could be to group the nodes in the same rack and decom them in little batches:

analytics1058.eqiad.wmnet:  /eqiad/A/1   -> already started

analytics1059.eqiad.wmnet:  /eqiad/A/3  -> second batch
analytics1060.eqiad.wmnet:  /eqiad/A/3  -> second batch

analytics1061.eqiad.wmnet:  /eqiad/B/8  -> third batch
analytics1062.eqiad.wmnet:  /eqiad/B/8  -> etc..
analytics1063.eqiad.wmnet:  /eqiad/B/8

analytics1064.eqiad.wmnet:  /eqiad/C/3
analytics1065.eqiad.wmnet:  /eqiad/C/3
analytics1066.eqiad.wmnet:  /eqiad/C/3

analytics1067.eqiad.wmnet:  /eqiad/D/2
analytics1068.eqiad.wmnet:  /eqiad/D/2
analytics1076.eqiad.wmnet:  /eqiad/D/2

analytics1069.eqiad.wmnet:  /eqiad/D/8

The idea is to avoid to decom nodes in different racks, because we may impact the replication of some blocks. Worst case scenario is to decom 3 nodes in different racks that have the 3 replicas for certain blocks, ending up with some HDFS errors (no blocks available etc..).

An alternative could be to group the nodes in the same rack and decom them in little batches:
The idea is to avoid to decom nodes in different racks, because we may impact the replication of some blocks. Worst case scenario is to decom 3 nodes in different racks that have the 3 replicas for certain blocks, ending up with some HDFS errors (no blocks available etc..).

Thanks @elukey - I like the idea for a bit of extra safety, even though I suspect that the scenario you've described would probably not happen.
It seems to me that the datanodes can still server data when they're in a decomissioning state, so it's like adding an additional replica of these blocks, until the node enters the decommissioned state.
Still that's only my hypothesis, so your idea seems good in terms of additional safety and we're not really in a hurry, so 👍 from me.

yes yes good point, it should be safe, but I'd be cautious on the batch size just to be sure (HDFS is battle tested but we had some horror stories in the past :D)

Change 928465 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] analytics: add analytics19[59-60] to excluded_hosts

https://gerrit.wikimedia.org/r/928465

Change 928466 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] analytics: remove analytics10[58-60] from net_topology

https://gerrit.wikimedia.org/r/928466

Change 928466 abandoned by Gehel:

[operations/puppet@production] analytics: remove analytics10[58-60] from net_topology

Reason:

this was just for testing

https://gerrit.wikimedia.org/r/928466

Change 928465 abandoned by Gehel:

[operations/puppet@production] analytics: add analytics19[59-60] to excluded_hosts

Reason:

this was just for testing

https://gerrit.wikimedia.org/r/928465

Mentioned in SAL (#wikimedia-analytics) [2023-06-14T13:15:30Z] <stevemunene> running the puppet on an-master100[1-2] Remove analytics58_60 from the HDFS topology T317861

Change 930580 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics106[1-3] from hadoop cluster

https://gerrit.wikimedia.org/r/930580

Change 930581 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics106[1-3] from the HDFS topology

https://gerrit.wikimedia.org/r/930581

Change 930582 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics106[4-6] from hadoop cluster

https://gerrit.wikimedia.org/r/930582

Change 930583 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics106[4-6] from the HDFS topology

https://gerrit.wikimedia.org/r/930583

Change 930584 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics106[7-8] from hadoop cluster

https://gerrit.wikimedia.org/r/930584

Change 930585 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics106[7-8] from the HDFS topology

https://gerrit.wikimedia.org/r/930585

Change 930606 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Decommission analytics1069 from hadoop cluster

https://gerrit.wikimedia.org/r/930606

Change 930607 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics1069 from the HDFS topology

https://gerrit.wikimedia.org/r/930607

Mentioned in SAL (#wikimedia-analytics) [2023-06-15T12:47:20Z] <stevemunene> roll running sre.hadoop.roll-restart-masters to completely remove any reference of analytics1058-1060 for T317861

Done decommissioning the hosts in the first batches in /eqiad/A. Next is to Delete Kerberos principals and keytabs when a host is decommissioned and to remove any reference to the hosts in puppet as I move on to the next batch in /eqiad/B

Change 930580 merged by Stevemunene:

[operations/puppet@production] analytics: Decommission analytics106[1-3] from hadoop cluster

https://gerrit.wikimedia.org/r/930580

Change 930581 merged by Stevemunene:

[operations/puppet@production] analytics: Remove analytics106[1-3] from the HDFS topology

https://gerrit.wikimedia.org/r/930581

Mentioned in SAL (#wikimedia-analytics) [2023-06-22T14:02:44Z] <stevemunene> running sre.hadoop.roll-restart-masters restart the Namenodes to completely remove any reference of analytics106[1-3] T317861

/eqiad/B/ hosts to be decommissioned have been successfully Excluded from HDFS and YARN and removed from the HDFS topology, moving on to decommissioning them with cookbook sre.hosts.decommission.
Also adding /eqiad/C/ hosts to be Excluded from HDFS and YARN.

@Stevemunene I still see the following from the hdfs topology:

Rack: /eqiad/default/rack
   10.64.21.113:50010 (analytics1061.eqiad.wmnet)
   10.64.21.114:50010 (analytics1062.eqiad.wmnet)
   10.64.21.115:50010 (analytics1063.eqiad.wmnet)

Are the fully decommed?

@Stevemunene I still see the following from the hdfs topology:

Rack: /eqiad/default/rack
   10.64.21.113:50010 (analytics1061.eqiad.wmnet)
   10.64.21.114:50010 (analytics1062.eqiad.wmnet)
   10.64.21.115:50010 (analytics1063.eqiad.wmnet)

Are the fully decommed?

The change to remove them from hdfs_topology 930581 was merged and the services restarted with the sre.hadoop.roll-restart-masters cookbook. Probably requires another restart. Running the sre.hosts.decommission on the hosts today.

Change 930582 merged by Stevemunene:

[operations/puppet@production] analytics: Decommission analytics106[4-6] from hadoop cluster

https://gerrit.wikimedia.org/r/930582

During the decommissioning of analytics106[1-3], we noticed that even after Excluding the hosts from yarn and hdfs. Then moving on to the next step which is removing them from the hdfs topology and restarting the hosts, the hosts were still part of the cluster as shown below
HDFS_NameNode_Status_Interface

image.png (120×900 px, 41 KB)

And also present and available for yarn jobs

image.png (128×900 px, 56 KB)

Which sparked a discussion on the possible cause with @elukey attributing this to the fact that hdfs only takesh short names but we only provided fqdn and not the short name to the hosts.exclude file. To test this we decided to manually add the short names and fqdn to an-master1001. To do this,
First Disable puppet on an-master1001 then add the hosts to the file /etc/hadoop/conf.analytics-hadoop/hosts.exclude as below

stevemunene@an-master1001:~$ cat /etc/hadoop/conf.analytics-hadoop/hosts.exclude
analytics1064.eqiad.wmnet
analytics1065.eqiad.wmnet
analytics1066.eqiad.wmnet
analytics1061.eqiad.wmnet
analytics1061
analytics1062.eqiad.wmnet
analytics1062
analytics1063.eqiad.wmnet
analytics1063

Lastly refresh the hdfs nodes list with;

stevemunene@an-master1001:~$ sudo -u hdfs hdfs dfsadmin -refreshNodes
Refresh nodes successful for an-master1001.eqiad.wmnet/10.64.5.26:8020
Refresh nodes successful for an-master1002.eqiad.wmnet/10.64.21.110:8020

The three analytics106[1-3] re appeared on the HDFS_NameNode_Status_Interface as decommissioning

image.png (728×1 px, 83 KB)

The HDFS under replicated blocks graph also shows a spike after this

image.png (1×1 px, 105 KB)

Change 933386 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Exclude analytics1061_1069 from HDFS and YARN

https://gerrit.wikimedia.org/r/933386

Change 933387 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: Remove analytics1064_1069 from hdfs net_topology

https://gerrit.wikimedia.org/r/933387

During the decommissioning of analytics106[1-3], we noticed that even after Excluding the hosts from yarn and hdfs. Then moving on to the next step which is removing them from the hdfs topology and restarting the hosts, the hosts were still part of the cluster as shown below
HDFS_NameNode_Status_Interface

image.png (120×900 px, 41 KB)

And also present and available for yarn jobs

image.png (128×900 px, 56 KB)

Which sparked a discussion on the possible cause with @elukey attributing this to the fact that hdfs only takesh short names but we only provided fqdn and not the short name to the hosts.exclude file. To test this we decided to manually add the short names and fqdn to an-master1001. To do this,
First Disable puppet on an-master1001 then add the hosts to the file /etc/hadoop/conf.analytics-hadoop/hosts.exclude as below

stevemunene@an-master1001:~$ cat /etc/hadoop/conf.analytics-hadoop/hosts.exclude
analytics1064.eqiad.wmnet
analytics1065.eqiad.wmnet
analytics1066.eqiad.wmnet
analytics1061.eqiad.wmnet
analytics1061
analytics1062.eqiad.wmnet
analytics1062
analytics1063.eqiad.wmnet
analytics1063

Lastly refresh the hdfs nodes list with;

stevemunene@an-master1001:~$ sudo -u hdfs hdfs dfsadmin -refreshNodes
Refresh nodes successful for an-master1001.eqiad.wmnet/10.64.5.26:8020
Refresh nodes successful for an-master1002.eqiad.wmnet/10.64.21.110:8020

The three analytics106[1-3] re appeared on the HDFS_NameNode_Status_Interface as decommissioning

image.png (728×1 px, 83 KB)

The HDFS under replicated blocks graph also shows a spike after this

image.png (1×1 px, 105 KB)

After further investigations, this was determined to have been caused by the sequence of events below;

  • analytics10[61-63] were indeed in a decommissioned state on the HDFS NameNode Status Interface by the 19th of June 2022, when they were removed from the hdfs net_topology.
  • The hosts were however not physically decommissioned using the sre.hosts.decommission cookbook thus remained online.
  • When [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/930582 | analytics: Decommission analytics106[4-6] from hadoop cluster ]] was merged, the hosts analytics10[61-63] were removed from the hosts.exclude file and since they were still online they rejoined the cluster. Which is what led to the situation above.

Next steps:

  • Excluding all the remaining hosts analytics1061_1069 from HDFS and YARN
  • Removing them from the hdfs net_topology:
  • Decommissioning the nodes
  • Removing the nodes from master and standby exclude
  • Removing all the hosts from sites.pp and setting them in the role (spare::system)

Change 930583 abandoned by Stevemunene:

[operations/puppet@production] analytics: Remove analytics106[4-6] from the HDFS topology

Reason:

We opted to decommission the remaining hosts at a go as opposed to the original idea of doing it in batches

https://gerrit.wikimedia.org/r/930583

Change 930584 abandoned by Stevemunene:

[operations/puppet@production] analytics: Decommission analytics106[7-8] from hadoop cluster

Reason:

We opted to decommission the remaining hosts at a go as opposed to the original idea of doing it in batches

https://gerrit.wikimedia.org/r/930584

Change 930585 abandoned by Stevemunene:

[operations/puppet@production] analytics: Remove analytics106[7-8] from the HDFS topology

Reason:

We opted to decommission the remaining hosts at a go as opposed to the original idea of doing it in batches

https://gerrit.wikimedia.org/r/930585

Change 930606 abandoned by Stevemunene:

[operations/puppet@production] analytics: Decommission analytics1069 from hadoop cluster

Reason:

We opted to decommission the remaining hosts at a go as opposed to the original idea of doing it in batches

https://gerrit.wikimedia.org/r/930606

Change 930607 abandoned by Stevemunene:

[operations/puppet@production] analytics: Remove analytics1069 from the HDFS topology

Reason:

We opted to decommission the remaining hosts at a go as opposed to the original idea of doing it in batches

https://gerrit.wikimedia.org/r/930607

Change 933386 merged by Stevemunene:

[operations/puppet@production] analytics: Exclude analytics1061_1069 from HDFS and YARN

https://gerrit.wikimedia.org/r/933386

Change 933432 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the hadoop-worker-canary cumin alias

https://gerrit.wikimedia.org/r/933432

Icinga downtime and Alertmanager silence (ID=0453fd24-8db4-4ba7-9753-ae2833e9b5fb) set by btullis@cumin1001 for 7 days, 0:00:00 on 8 host(s) and their services with reason: Decommissioning

analytics[1061-1068].eqiad.wmnet

Change 933432 merged by Btullis:

[operations/puppet@production] Update the hadoop-worker-canary cumin alias

https://gerrit.wikimedia.org/r/933432

What's the current status of analytics1069? Is not present anymore in puppetdb but is still Active in Netbox hence it's reported as an error in the physical hosts Netbox report.

What's the current status of analytics1069? Is not present anymore in puppetdb but is still Active in Netbox hence it's reported as an error in the physical hosts Netbox report.

analytics1069 was removed as a journal node and is currently in a decommissioned state as per the HDFS namenode web interface. Puppet has been disabled on it for a while now, however we are moving on with the hosts decommissioning.

@Stevemunene Puppet should never be disabled for more than a couple of days, as documented in https://wikitech.wikimedia.org/wiki/Puppet#Maintenance

Ack, thanks @Volans. Are there any extra steps to take to remedy this before beginning the decommission?

While from one side the decommission cookbook can perfectly be run on an unreachable/down host, it will skip though some steps if it can't ssh into the host. As the host is now gone from puppetdb is also gone from the ssh known hosts and cumin will fails to execute commands there.
If it's safe for you to re-enable puppet on the host and let puppet run both on the host and after that on the cumin hosts, then the decommission cookbook will run all the steps.
Be aware that re-enabling puppet means also that monitoring will be re-created on the icinga/alertmanager side and hence if the host is not in an optimal state it will create alerts/pages accordingly.

While from one side the decommission cookbook can perfectly be run on an unreachable/down host, it will skip though some steps if it can't ssh into the host. As the host is now gone from puppetdb is also gone from the ssh known hosts and cumin will fails to execute commands there.
If it's safe for you to re-enable puppet on the host and let puppet run both on the host and after that on the cumin hosts, then the decommission cookbook will run all the steps.
Be aware that re-enabling puppet means also that monitoring will be re-created on the icinga/alertmanager side and hence if the host is not in an optimal state it will create alerts/pages accordingly.

Thanks, It would be safer to run the decommission cookbook since we are going to disable puppet on the hosts before the decommission as per https://gerrit.wikimedia.org/r/c/operations/puppet/+/933387/comments/71353c2b_33db9f28 I'll keep a keen eye on the host

analytics106[1-9] are in a decommissioned state on the hdfs namenode interface, thus we are ready to begin the decommissioning.

image.png (1×2 px, 429 KB)

The steps are as below;

  1. disable puppet on the nodes
  2. stop the datanodes systemctl stop hadoop-hdfs-datanode
  3. Merge 933387
  4. verify the topology (those nodes should go away in theory)
  5. decommission the nodes

Thanks, It would be safer to run the decommission cookbook since we are going to disable puppet on the hosts before the decommission as per https://gerrit.wikimedia.org/r/c/operations/puppet/+/933387/comments/71353c2b_33db9f28 I'll keep a keen eye on the host

What I meant is to re-enable puppet just for the time of a run to get it back into puppetdb and hence allowing then the decommission cookbook run later to be able to connect to the host. Puppet can totally be disabled at the time of running the decommission cookbook.
That of course if it's ok for you to temporarily re-enable it without creating issue. (also be aware of the potential alerts mentioned above)

Icinga downtime and Alertmanager silence (ID=2de210bb-f6e5-4b71-81d4-c9d978f2bed5) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 9 host(s) and their services with reason: Stopping puppet and hadoop-hdfs-datanode services then decommissioning the hosts

analytics[1061-1069].eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2023-07-06T07:11:52Z] <stevemunene> disable-puppet on analytics[1061-1069] Preparing to decommission the hosts - T317861

Mentioned in SAL (#wikimedia-analytics) [2023-07-06T07:17:59Z] <stevemunene> stop hadoop-hdfs-datanode service on analytics[1061-1069] Preparing to decommission the hosts - T317861

Change 933387 merged by Stevemunene:

[operations/puppet@production] analytics: Remove analytics1064_1069 from hdfs net_topology

https://gerrit.wikimedia.org/r/933387

Re-enabled puppet on analytics1069 to get it back into puppetdb to allow cumin commands to run properly. Then as per the steps discussed above,

  1. disabled puppet on the nodes
stevemunene@cumin1001:~$ sudo cumin 'analytics[1061-1069]*' "disable-puppet 'Preparing to decommission the hosts - T317861 - ${USER}'"
9 hosts will be targeted:
analytics[1061-1069].eqiad.wmnet
OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9
===== NO OUTPUT =====                                                                                                                                              
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (9/9) [00:42<00:00,  4.70s/hosts]
FAIL |                                                                                                                             |   0% (0/9) [00:42<?, ?hosts/s]
100.0% (9/9) success ratio (>= 100.0% threshold) for command: 'disable-puppet '...1 - stevemunene''.
100.0% (9/9) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

2.) stopped the datanodes

stevemunene@cumin1001:~$ sudo cumin 'analytics[1061-1069]*' "systemctl stop hadoop-hdfs-datanode"
9 hosts will be targeted:
analytics[1061-1069].eqiad.wmnet
OK to proceed on 9 hosts? Enter the number of affected hosts to confirm or "q" to quit: 9
===== NO OUTPUT =====                                                                                                                                              
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (9/9) [00:05<00:00,  1.57hosts/s]
FAIL |                                                                                                                             |   0% (0/9) [00:05<?, ?hosts/s]
100.0% (9/9) success ratio (>= 100.0% threshold) for command: 'systemctl stop hadoop-hdfs-datanode'.
100.0% (9/9) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

3.) Merged 933387 to Remove analytics1064_1069 from hdfs net_topology

4.) verify the topology (those nodes should go away in theory). From the hdfs_namenode interface the nodes are marked as decommissioned and dead

image.png (192×1 px, 28 KB)

image.png (1×2 px, 336 KB)

Moving on to decommissioning the nodes

Change 936051 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] analytics: remove puppet references for analytics[1058-1069]

https://gerrit.wikimedia.org/r/936051

Change 936051 merged by Stevemunene:

[operations/puppet@production] analytics: remove puppet references for analytics[1058-1069]

https://gerrit.wikimedia.org/r/936051

Mentioned in SAL (#wikimedia-analytics) [2023-07-07T09:28:56Z] <stevemunene> running sre.hadoop.roll-restart-masters restart the maters to completely remove any reference of analytics[1058-1069] T317861

Gehel triaged this task as Medium priority.Jul 21 2023, 12:50 PM
Gehel moved this task from Needs Reporting to Done on the Data-Platform-SRE board.