Page MenuHomePhabricator

Decommisison the Hadoop backup cluster and add the worker nodes to the main Hadoop cluster
Closed, ResolvedPublic

Description

Since we deployed successfully Bigtop and some days passed without major regressions, I think that we can think about wiping the Backup clusters and add the nodes to the main one.

High level steps:

  • Add the puppet configuration for new worker nodes to the main cluster's hdfs rack awareness config, and roll restart the namenodes (so we'll be able to add new nodes without any risk of hitting the default rack)
  • Stop the backup cluster daemons, remove all puppet config and set role(insetup) to all new workers.
  • Add a couple of nodes to the cluster and check that the Buster packages work fine etc..
  • Reimage all the workers with Buster and use the init worker cookbook to wipe them clean (we don't want any data from the previous datanode dirs)
  • Add the rest of the nodes in a couple of big batches (10/11 nodes each). This should alleviate the work of the hdfs balancer when spreading blocks over the new nodes.

Event Timeline

elukey triaged this task as High priority.Feb 15 2021, 3:28 PM
elukey created this task.

Change 664302 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: update the HDFS Namenode rack configuration

https://gerrit.wikimedia.org/r/664302

Change 664302 merged by Elukey:
[operations/puppet@production] hadoop: update the HDFS Namenode rack configuration

https://gerrit.wikimedia.org/r/664302

Change 665304 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Decommission the Hadoop backup cluster

https://gerrit.wikimedia.org/r/665304

Change 665304 merged by Elukey:
[operations/puppet@production] Decommission the Hadoop backup cluster

https://gerrit.wikimedia.org/r/665304

  • Stop the backup cluster daemons, remove all puppet config and set role(insetup) to all new workers.

@razzi adding some notes about what I did, so you have context. I used the hadoop stop cluster cookbook to gracefully stop the Backup cluster, that takes also care to add 2h of downtime for the hosts. At this point I waited some time to make sure that no side effect was registered to the main cluster (paranoid step, none was expected, but..). Then I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/665304 to remove all puppet configs, especially monitoring, about the Backup cluster. The idea was to use the 2h of downtime to allow icinga to update itself (after the puppet merge + run) and stop monitoring/alerting any node in the Backup cluster. Before running puppet with then new config, I executed some systemctl mask commands like systemctl mask hadoop-hdfs-datanode across all Backup cluster nodes, to be sure that daemons wouldn't accidentally come up again (say a for a reboot etc..). If you don't know systemctl mask check what it does, it is really handy :) Finally I did some check like systemctl --failed across all nodes to figure out if any unit was in a failed state (and hence that may have alarmed later on), and I have used systemctl reset-failed to fix them (there were a few).

At this point we have the an-worker1117+ nodes that need to be repurposed into the main Hadoop cluster. We want them to be on Buster and with HDFS datanode partitions cleaned up, so we'll do the following for all these nodes:

  • reimage to Buster
  • run (for each of them) the sre.hadoop.init-hadoop-workers cookbook with the --wipe option, that will take care of creating the /var/lib/hadoop/data/$letter partitions and tweak some settings. Our partman recipes for debian instlall do not take care of the datanode dirs, this is why we have the cookbook.

Once done, we'll be able to add the nodes to the Hadoop Analytics cluster (the main one), starting with a couple of nodes only to make sure that the Buster Bigtop packages work as expected etc..

Today I'll try to do a couple of nodes, to have some data on Monday about how they work, and then we'll split the work of wiping/repurposing. How does it sound?

@razzi to add more confusion: you'll note that among the new workers (that were part of the Backup cluster and that now we have to repurpose) there are some "gaps" in the an-worker11XX numbers, because 6 nodes are still in the rack/setup/deploy phase, see T260445. Once those will be finalized we'll bootstrap them like described above.

elukey changed the task status from Open to Stalled.Feb 19 2021, 2:26 PM

Before starting we'll need to solve the uid/gid issue, see T231067 :(

elukey changed the task status from Stalled to Open.Feb 25 2021, 1:42 PM

Current status:

  • an-worker111[7,8] already reimage/wiped and added to the Analytics Hadoop cluster
  • newest/last 6 nodes to be configured in https://phabricator.wikimedia.org/T275767 before proceeding
  • remaining nodes to reimage/wipe/etc..
# New hadoop worker nodes to add to the main Analytics cluster.
# Previously part of the Backup cluster.
node /^an-worker11(19|2[0-8]|3[0125678])\.eqiad\.wmnet$/ {
    role(insetup)
}

Blocker before proceeding:

We reserve a partition for the HDFS Journalnodes, /var/lib/hadoop/journal, a 10G LVM volume basically. It is needed only on the worker nodes that run the HDFS Journalnode (5 in total in the cluster, configured in hiera), but we should have it configured everywhere to be able to move the Journalnode daemons anywhere if needed (say a node is completely down for hw failure, etc..). Currently we don't have a good consistency:

elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'df -h | grep -q /var/lib/hadoop/journal && echo "present" || echo "absent"'
61 hosts will be targeted:
an-worker[1078-1118].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                           
(21) an-worker[1078-1096].eqiad.wmnet,analytics[1069,1072].eqiad.wmnet                                                                           
----- OUTPUT of 'df -h | grep -q ...|| echo "absent"' -----                                                                                      
present                                                                                                                                          
===== NODE GROUP =====                                                                                                                           
(40) an-worker[1097-1118].eqiad.wmnet,analytics[1058-1068,1070-1071,1073-1077].eqiad.wmnet                                                       
----- OUTPUT of 'df -h | grep -q ...|| echo "absent"' -----                                                                                      
absent

Two use cases:

  • for existing nodes we should create the partition
  • for new nodes, we should either have it added by partman automatically or by the init-workers cookbook.

Procedure to add the LVM volume (script to be tested):

#!/bin/bash

set -ex

if [ -e /dev/mapper/*unused* ]
then
    echo "Dropping unused volume";
    lvremove /dev/mapper/*--vg-unused -y
fi
if [ ! -e /dev/mapper/*journal* ]
then
    echo "Creating journalnode volume"
    VGNAME=$(vgs --noheadings -o vg_name | tr -d '  ')
    lvcreate -L 10g -n journalnode $VGNAME
    echo "Creating the ext4 partition"
    mkfs.ext4 /dev/$VGNAME/journalnode
    echo "Adding mountpoint to fstab"
    echo "# Hadoop JournalNode partition" >> /etc/fstab
    mkdir /var/lib/hadoop/journal
    echo "/dev/$VGNAME/journalnode	/var/lib/hadoop/journal	ext4	defaults,noatime	0	2" >> /etc/fstab
    mount -a
fi

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1120.eqiad.wmnet', 'an-worker1121.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103020709_elukey_4434.log.

Completed auto-reimage of hosts:

['an-worker1120.eqiad.wmnet', 'an-worker1121.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1122.eqiad.wmnet', 'an-worker1123.eqiad.wmnet', 'an-worker1124.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103020739_elukey_12875.log.

Completed auto-reimage of hosts:

['an-worker1122.eqiad.wmnet', 'an-worker1123.eqiad.wmnet', 'an-worker1124.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1125.eqiad.wmnet', 'an-worker1126.eqiad.wmnet', 'an-worker1127.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103020809_elukey_22991.log.

Completed auto-reimage of hosts:

['an-worker1125.eqiad.wmnet', 'an-worker1126.eqiad.wmnet', 'an-worker1127.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1128.eqiad.wmnet', 'an-worker1130.eqiad.wmnet', 'an-worker1131.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103020838_elukey_32282.log.

Completed auto-reimage of hosts:

['an-worker1128.eqiad.wmnet', 'an-worker1130.eqiad.wmnet', 'an-worker1131.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1119.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103020955_elukey_26605.log.

Completed auto-reimage of hosts:

['an-worker1119.eqiad.wmnet']

and were ALL successful.

Change 667860 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker11[19,20-28,30,31] to Analytics Hadoop

https://gerrit.wikimedia.org/r/667860

Change 667860 merged by Elukey:
[operations/puppet@production] Add an-worker11[19,20-28,30,31] to Analytics Hadoop

https://gerrit.wikimedia.org/r/667860

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1132.eqiad.wmnet', 'an-worker1135.eqiad.wmnet', 'an-worker1136.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103030822_elukey_30308.log.

Completed auto-reimage of hosts:

['an-worker1132.eqiad.wmnet', 'an-worker1135.eqiad.wmnet', 'an-worker1136.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1137.eqiad.wmnet', 'an-worker1138.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103030911_elukey_14092.log.

Completed auto-reimage of hosts:

['an-worker1137.eqiad.wmnet', 'an-worker1138.eqiad.wmnet']

and were ALL successful.

Change 668014 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::worker: add explicit require for profile::java

https://gerrit.wikimedia.org/r/668014

Change 668014 merged by Elukey:
[operations/puppet@production] profile::hadoop::worker: add explicit require for profile::java

https://gerrit.wikimedia.org/r/668014

Change 668018 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker113[2,5-8] to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/668018

Change 668018 merged by Elukey:
[operations/puppet@production] Add an-worker113[2,5-8] to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/668018

elukey changed the task status from Open to Stalled.Mar 3 2021, 11:21 AM

Last 6 nodes to rack, stalled until T276239 is solved :)

elukey removed elukey as the assignee of this task.Jun 1 2021, 8:08 AM
Ottomata claimed this task.