During the last batch of hadoop worker deployments (happened a year ago IIRC) I ran the following (please don't judge) scripts to automate the creation of the hadoop datanode partitions:
elukey@an-worker1080:~$ tail -n+1 step* ==> step1 <== #!/bin/bash set -e set -x # Create a logical volumne for JournalNode data. # There should only be one VG, look up its name: vgname=$(vgdisplay -C --noheadings -o vg_name | head -n 1 | tr -d ' ') lvcreate -n journalnode -L 10G $vgname # make an ext4 filesystem mkfs.ext4 /dev/$vgname/journalnode # Don't reserve any blocks for OS on this partition. tune2fs -m 0 /dev/$vgname/journalnode mount_point=/var/lib/hadoop/journal mkdir -pv $mount_point grep -q $mount_point /etc/fstab || echo -e "# # Hadoop JournalNode partition\n/dev/$vgname/journalnode\t${mount_point}\text4\tdefaults,noatime\t0\t2" | tee -a /etc/fstab mount -v $mount_point ==> step2 <== #!/bin/bash set -e set -x for disk_letter in b c d e f g h i j k l m; do disk=/dev/sd${disk_letter} parted ${disk} --script mklabel gpt parted ${disk} --script mkpart primary ext4 0% 100% partition=${disk}1 mkfs.ext4 $partition done ==> step3 <== #!/bin/bash set -e set -x data_directory=/var/lib/hadoop/data for disk_letter in b c d e f g h i j k l m; do partition_number=1 partition="/dev/sd${disk_letter}${partition_number}" mount_point="${data_directory}/${disk_letter}" # Don't reserve any blocks for OS on these partitions. tune2fs -m 0 $partition # Make the mount point. mkdir -pv $mount_point # add it to fstab unless it is already there grep -q $mount_point /etc/fstab || ( uuid=$(blkid | grep primary | grep ${partition} | awk '{print $2}' | sed -e 's/[:"]//g') echo -e "# Hadoop DataNode partition ${disk_letter}\n${uuid}\t${mount_point}\text4\tdefaults,noatime\t0\t2" | tee -a /etc/fstab ) mount -v $mount_point done ==> step4 <== # ReadAhead Adaptive megacli -LDSetProp ADRA -LALL -aALL # Direct (No cache) megacli -LDSetProp -Direct -LALL -aALL # No write cache if bad BBU megacli -LDSetProp NoCachedBadBBU -LALL -aALL # Disable BBU auto-learn echo "autoLearnMode=1" > /tmp/disable_learn && megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0
Some explanation about the why of the above horror:
- every worker node has 2xSDD disks in a flex bay with hw RAID 1. This means that the OS sees the /dev/sda partition usually as single disk, that we use for the OS.
- every worker node has also 12x4TB disks, that have a "special" config. They need to be configured as JBOD, but due to how the hw raid controller works (may have changed in recent versions) they need to be set up as single disk RAID0, to appear to the OS as single JBOD disks. These disks have not been configured in partman, so they are not formatted/accounted during Debian Install (that is also a plus when we upgrade, since we don't have to care about data being wiped etc..).
On top of what wrote above, we have a new config for the 6 nodes with GPUs:
- no flexbay, 24x2TB disks (same raid0 single disks caveat as above)
We should write a cookbook to automate and document this procedure (and to improve it if needed).