Create a cookbook to automate the bootstrap of new Hadoop workers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Sep 7 2020, 7:55 AM

Description

During the last batch of hadoop worker deployments (happened a year ago IIRC) I ran the following (please don't judge) scripts to automate the creation of the hadoop datanode partitions:

elukey@an-worker1080:~$ tail -n+1 step*
==> step1 <==
#!/bin/bash
set -e
set -x

# Create a logical volumne for JournalNode data.
# There should only be one VG, look up its name:
vgname=$(vgdisplay -C --noheadings -o vg_name | head -n 1 | tr -d ' ')
lvcreate -n journalnode -L 10G $vgname

# make an ext4 filesystem
mkfs.ext4 /dev/$vgname/journalnode

# Don't reserve any blocks for OS on this partition.
tune2fs -m 0 /dev/$vgname/journalnode

mount_point=/var/lib/hadoop/journal
mkdir -pv $mount_point
grep -q $mount_point /etc/fstab || echo -e "# # Hadoop JournalNode partition\n/dev/$vgname/journalnode\t${mount_point}\text4\tdefaults,noatime\t0\t2" | tee -a /etc/fstab

mount -v $mount_point

==> step2 <==
#!/bin/bash
set -e
set -x

for disk_letter in b c d e f g h i j k l m; do
    disk=/dev/sd${disk_letter}
    parted ${disk} --script mklabel gpt
    parted ${disk} --script mkpart primary ext4 0% 100%

    partition=${disk}1
    mkfs.ext4 $partition
done

==> step3 <==
#!/bin/bash
set -e
set -x

data_directory=/var/lib/hadoop/data
for disk_letter in b c d e f g h i j k l m; do
    partition_number=1
    partition="/dev/sd${disk_letter}${partition_number}"
    mount_point="${data_directory}/${disk_letter}"
    # Don't reserve any blocks for OS on these partitions.
    tune2fs -m 0 $partition

    # Make the mount point.
    mkdir -pv $mount_point
    # add it to fstab unless it is already there
    grep -q $mount_point /etc/fstab || (
        uuid=$(blkid | grep primary | grep ${partition} | awk '{print $2}' | sed -e 's/[:"]//g')
        echo -e "# Hadoop DataNode partition ${disk_letter}\n${uuid}\t${mount_point}\text4\tdefaults,noatime\t0\t2" | tee -a /etc/fstab
    )

    mount -v $mount_point
done

==> step4 <==
# ReadAhead Adaptive
megacli -LDSetProp ADRA -LALL -aALL

# Direct (No cache)
megacli -LDSetProp -Direct -LALL -aALL

# No write cache if bad BBU
megacli -LDSetProp NoCachedBadBBU -LALL -aALL

# Disable BBU auto-learn
echo "autoLearnMode=1" > /tmp/disable_learn && megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0

Some explanation about the why of the above horror:

every worker node has 2xSDD disks in a flex bay with hw RAID 1. This means that the OS sees the /dev/sda partition usually as single disk, that we use for the OS.
every worker node has also 12x4TB disks, that have a "special" config. They need to be configured as JBOD, but due to how the hw raid controller works (may have changed in recent versions) they need to be set up as single disk RAID0, to appear to the OS as single JBOD disks. These disks have not been configured in partman, so they are not formatted/accounted during Debian Install (that is also a plus when we upgrade, since we don't have to care about data being wiped etc..).

On top of what wrote above, we have a new config for the 6 nodes with GPUs:

no flexbay, 24x2TB disks (same raid0 single disks caveat as above)

We should write a cookbook to automate and document this procedure (and to improve it if needed).

Details

	Subject	Repo	Branch	Lines +/-
	sre.hadoop.init-hadoop-workers.py: add journalnode partition	operations/cookbooks	master	+23 -3
	Add sre.hadoop.init-hadoop-workers.py	operations/cookbooks	master	+80 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T244211 Analytics Hardware for Fiscal Year 2019/2020
Resolved	Ottomata	T243521 Hadoop Hardware Orders FY2019-2020
Resolved	elukey	T262189 Create a cookbook to automate the bootstrap of new Hadoop workers

Event Timeline

elukey created this task.Sep 7 2020, 7:55 AM

Milimetric triaged this task as High priority.Sep 10 2020, 4:10 PM

Milimetric edited projects, added Analytics-Clusters, Analytics-Kanban; removed Analytics.

mforns removed a project: Analytics-Kanban.Sep 14 2020, 4:10 PM

elukey moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.Sep 21 2020, 2:21 PM

Change 629384 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] Add sre.hadoop.init_hadoop_workers.py

https://gerrit.wikimedia.org/r/629384

gerritbot added a project: Patch-For-Review.Sep 23 2020, 2:01 PM

Change 629384 merged by Elukey:
[operations/cookbooks@master] Add sre.hadoop.init-hadoop-workers.py

https://gerrit.wikimedia.org/r/629384

Maintenance_bot removed a project: Patch-For-Review.Sep 23 2020, 3:10 PM

The cookbook now works, I was able to add all the partitions on an-worker1096->1101. In the current version I forgot to add the journalnode partition though, will amend the cookbook.

Change 629435 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/cookbooks@master] sre.hadoop.init-hadoop-workers.py: add journalnode partition

https://gerrit.wikimedia.org/r/629435

gerritbot added a project: Patch-For-Review.Sep 23 2020, 5:04 PM

Change 629435 abandoned by Elukey:
[operations/cookbooks@master] sre.hadoop.init-hadoop-workers.py: add journalnode partition

Reason:
Not needed for the moment

https://gerrit.wikimedia.org/r/629435

Tested it multiple times, works fine!

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Sep 30 2020, 8:26 AM

• Nuria closed this task as Resolved.Sep 30 2020, 8:38 PM

elukey mentioned this in rCCKBcacec98426cd: Add sre.hadoop.init-hadoop-workers.py.Dec 14 2022, 3:26 PM

Maintenance_bot removed a project: Patch-For-Review.Dec 14 2022, 3:31 PM

Create a cookbook to automate the bootstrap of new Hadoop workersClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create a cookbook to automate the bootstrap of new Hadoop workers
Closed, ResolvedPublic
Actions

Related Objects
Search...