Page MenuHomePhabricator

Add 6 worker nodes to the HDFS Namenode config of the Analytics Hadoop cluster
Closed, ResolvedPublic

Description

In T260445 DC-ops racked the last 6 worker nodes that were pending, part of the original batch of 24 that we scheduled for the cluster expansion. We used 18 of them for the temporary backup cluster, and all those 18 are already in the HDFS Namenode config of the Analytics cluster (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/664302/3/hieradata/common.yaml).

We need to add the following nodes as well (from puppet's site.pp:

#staged an-workers via T260445
node /^an-worker11(29|33|34|39|40|41)\.eqiad\.wmnet$/ {
    role(insetup)
}

Things to do:

  1. Come up with a change like https://gerrit.wikimedia.org/r/c/operations/puppet/+/664302/3/hieradata/common.yaml for the above nodes (netbox.wikimedia.org will be useful to check rack locations)
  2. Merge and run puppet on an-master100[1,2]
  3. Roll restart the Namenodes (usual failover, restart 1001, failback, restart 1002 procedure - or using the cookbook)
  4. Check if the above nodes need to get kerberos keytabs in puppet private
  5. Check https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration and see if documentation is missing/outdatated

The end result will be to be able to move the 6 new nodes to the Analytics Hadoop cluster anytime :)

Event Timeline

Change 667032 had a related patch set uploaded (by Razzi; owner: Razzi):
[operations/puppet@production] hadoop: Add new worker nodes to hadoop_clusters

https://gerrit.wikimedia.org/r/667032

elukey changed the task status from Open to Stalled.Mar 2 2021, 4:19 PM

Blocked until T276239 is solved

elukey changed the task status from Stalled to Open.Jun 1 2021, 8:20 AM

@razzi @Ottomata this task can move on in theory, the DCops team moved an-worker1029 to a different rack and we have an-worker1139 scheduled to move asap (it is pending some rack clean up in A7).

So in theory we could move on with 5 out of 6 nodes (all except 1139 basically), and then add the last one when DCops moves it over to A7.

BTullis triaged this task as Medium priority.Aug 11 2021, 4:58 PM
BTullis moved this task from Paused to In Progress on the Analytics-Kanban board.

Before proceeding with what it is indicated in the task's description, there are some extra steps to do:

  1. run the init-hadoop-workers.py cookbook for every host (it also accepts multiple nodes selections IIRC) so that the 12 4TB disks are initialized correctly. The end result should be /etc/fstab listing 12 partitions (already mounted).
  2. do some final checks on the nodes (partitions look ok, nothing weird in the dmesg, etc..). They may need an apt-get upgrade + reboot but not really mandatory,.

After the above, we should be able to finally add them to prod :)

Beginning work on this now.
For my reference, the full names of the new hosts are:

  • an-worker1129.eqiad.wmnet
  • an-worker1133.eqiad.wmnet
  • an-worker1134.eqiad.wmnet
  • an-worker1139.eqiad.wmnet
  • an-worker1140.eqiad.wmnet
  • an-worker1141.eqiad.wmnet

I can confirm that the motd shows them all as being in setup.
findmnt --real shows only / and /boot (plus bfs and tracfefs)
lsblk shows the 12 x 4TB HDD drives.

btullis@an-worker1141:~$ findmnt --real
TARGET                      SOURCE                              FSTYPE  OPTIONS
/                           /dev/mapper/an--worker1141--vg-root ext4    rw,relatime,errors=remount-ro
├─/sys/fs/bpf               bpf                                 bpf     rw,nosuid,nodev,noexec,relatime,mode=700
├─/sys/kernel/debug/tracing tracefs                             tracefs rw,relatime
└─/boot                     /dev/sda1                           ext4    rw,relatime
btullis@an-worker1141:~$ lsblk
NAME                          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 446.6G  0 disk 
├─sda1                          8:1    0   953M  0 part /boot
├─sda2                          8:2    0     1K  0 part 
└─sda5                          8:5    0 445.7G  0 part 
  ├─an--worker1141--vg-root   254:0    0  55.9G  0 lvm  /
  ├─an--worker1141--vg-swap   254:1    0   9.3G  0 lvm  [SWAP]
  └─an--worker1141--vg-unused 254:2    0 291.4G  0 lvm  
sdb                             8:16   0   3.7T  0 disk 
sdc                             8:32   0   3.7T  0 disk 
sdd                             8:48   0   3.7T  0 disk 
sde                             8:64   0   3.7T  0 disk 
sdf                             8:80   0   3.7T  0 disk 
sdg                             8:96   0   3.7T  0 disk 
sdh                             8:112  0   3.7T  0 disk 
sdi                             8:128  0   3.7T  0 disk 
sdj                             8:144  0   3.7T  0 disk 
sdk                             8:160  0   3.7T  0 disk 
sdl                             8:176  0   3.7T  0 disk 
sdm                             8:192  0   3.7T  0 disk

sudo megacli -LDInfo -Lall -Aall shows 13 logical drives.
0 is the O/S mirror
1-12 are disk RAID0 volumes, each with a single physical 4TB SATA disk.

I've checked the above on all six of the new servers.

I don't think that I need to add any command-line options to the cookbook, because the defaults work.
We have 12 disks to initialize, we want to skip /dev/sda, there is no pre-existing partition table, and we want them all to succeed.

--disks-number DISKS_NUMBER
                      The number of datanode disks/partitions to initialize.
                      (default: 12)
--skip-disks SKIP_DISKS
                      The number devices, starting from a, to skip because
                      already hosting other partitions (like root). For
                      example: 1 means skipping /dev/sda, 2 means skipping
                      /dev/sd[a,b], etc.. (default: 1)
--partitions-basedir PARTITIONS_BASEDIR
                      The base directory of the partitions to initialize.
                      (default: /var/lib/hadoop/data)
--wipe-partitions     Use wipefs to remove any pre-existing partition table
                      on the disks. (default: False)
--success-percent [1-100]
                      Expected success percent when executing cumin commands
                      on multiple hosts.Useful to init old nodes with
                      potentially broken disks. (default: 100)

I can use the host pattern with clustershell nodeset syntax, but I think I'm going to try a single node first.

I have completed the hard drive initialization on all 6 nodes.
I'm just checking whether I should make the change to common.yaml before the change in the servers' roles, or afterwards.

I am making the change to common.yaml prior to changing the role of the servers. Patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/714331

I notice that there was an additional step in T274795#6861064 to create and mount /var/lib/hadoop/journal on the new worker nodes.
At the moment I can't see any reason why this shouldn't be run on these six nodes as well.

Three of the six nodes (29,33,34) already had principals and keytabs.
The remaining three (39,40,41) did not, so I created these and added them to the puppet secrets repo.

I have merged the change to the net topology and applied the change to an-master100[1-2], so now I need to carry out a rolling restart of the namenodes.
The cookbook restarts the Resourcemanager, Namenode, and Mapreduce history daemons, but in this case only the Namenode daemon requires a restart.

I'll investigate whether there are likely to be any benefits to doing a manual rolling restart of the Namenode daemons, as opposed to running the cookbook.

I am planning to run the cookbook at 14:00 UTC - I have reached out to product-analytics to let them know about the planned work, although I do not expect any interruption to service.

Change 714369 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add six worker nodes that were staged into service

https://gerrit.wikimedia.org/r/714369

The cookbook ran successfully and the restart of the daemons appears to have been without incident. I have now prepared a patch to add the six servers into service:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/714369

Previous changes have added these new workers in bulk, in order to reduce the work of the HDFS balancer components, so I have done the same by adding all six servers in one patch.

Great work! Going to add some notes to previous posts inline:

I am making the change to common.yaml prior to changing the role of the servers. Patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/714331

I notice that there was an additional step in T274795#6861064 to create and mount /var/lib/hadoop/journal on the new worker nodes.
At the moment I can't see any reason why this shouldn't be run on these six nodes as well.

The journalnode dir is needed only on the journal nodes (5 in total), but we keep it on everyone for consistency and to ease maintenance if we need to move a journalnode elsewhere. We should add it as mandatory step from now on since the partman reuse recipe for the 12 disks workers expects it (see modules/install_server/files/autoinstall/partman/custom/reuse-analytics-hadoop-worker-12dev.cfg). If it is not stated in Wikitech let's add it, and we may also want to add a step into the cookbook if not already present.

Previous changes have added these new workers in bulk, in order to reduce the work of the HDFS balancer components, so I have done the same by adding all six servers in one patch.

+1, I'd also keep an eye during the days after the new workers are added to the HDFS metrics and UI (available via ssh tunnel, it should be stated on Wikitech) to make sure that the HDFS balancer does it work moving blocks to the new workers (so that they will not remain with few data on top).

I have completed the hard drive initialization on all 6 nodes.
I'm just checking whether I should make the change to common.yaml before the change in the servers' roles, or afterwards.

It needs to be done before (including the HDFS Namenode restarts) to avoid having the Hadoop workers assigned to the default rack (that is not great for our config etc..). We have an alarm that fires if any worker is placed in the default rack by mistake, we got bitten in the past :(

OK, so I think I'm ready to go with the addition of these six nodes.
My only outstanding concern is what happens if we get a permissions problem upon installation, as we saw with Druid recently (e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/711103), which was an unforeseen consequence of the recent uid/gid changes.
In order to mitigate this I'll do the following:

  • Disable puppet on the new nodes
  • Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/714369
  • Run puppet manually on an-worker1129
  • Verify that it installs cleanly and joins the cluster without error
  • Run sudo -u hdfs hdfs dfsadmin -printTopology and verify that an-worker1029 joins rack /eqiad/A/2
  • Enable and run puppet on all of the other nodes.
  • Verify that they all joins their correct racks.
  • Start a balancing run with sudo -u hdfs hdfs balancer in a screen session.

Some notes :)

  • it may happen that the first puppet run doesn't finish cleanly (namely reporting some errors). Adding code to puppet without bootstrapping nodes for a while leads to some race conditions happening only during the first run, but usually after the first two everything runs smoothly.
  • when using the dfsadmin command you may need to use sudo -u hdfs kerberos-run-command hdfs etc.. to refresh credentials.
  • the hdfs balancer runs in a systemd timer on an-launcher1002, so it should do its work during the next hours (you can tail logs to see what it does to confirm if you want).

The rest looks good!

Change 714369 merged by Btullis:

[operations/puppet@production] Add six worker nodes that were staged into service

https://gerrit.wikimedia.org/r/714369

The only issue observed during first puppet run was that hadoop services tried to start before installing the Java packages.
I don't think that this warrants fixing right now, since they start cleanly on the second run. We can come back to it.

The datanode and nodemanager services are fine. I can see the datanode in the HDFS web UI and the number of nodes in the yarn web ui is correct. Running puppet on the other five nodes now.

84 nodes in service.

image.png (239×1 px, 16 KB)

I will start the balancer via the systemd service file and tail the log. I'd rather have it running now, than wait 19 hours for the next scheduled run.

The rebalancing operation is proceeding nicely. Each node is copying data from the network at around 190 MB/s. I'll mark this ticket as done, although I'm still monitoring for stability.

Change 667032 abandoned by Btullis:

[operations/puppet@production] hadoop: Add new worker nodes to hadoop_clusters

Reason:

Implemented in another CR

https://gerrit.wikimedia.org/r/667032