The analytics1028->1069 worker nodes are still running Debian Jessie, meanwhile we have been successfully running analytics 1070->79 with Stretch without any issue. Managing two os at the same time could be cumbersome due to different package versions etc.., plus sooner or later we'll be asked to do so from the SRE team :)
This is not part of any goal up to now (April 2018), so we'll manage this task as best effort (so no deadlines set in stone about when it will be finished, the priority goes to all the goals).
High level procedure:
More details:
- Schedule downtime and drain Hadoop jobs
# On einsteinium sudo icinga-downtime -d 7200 -r "prep for reimage" -h analytics1068 # On the host sudo puppet agent --disable "$USER - drain & reimage" sudo systemctl stop hadoop-yarn-nodemanager watch 'ps aux | grep yarn | grep -c -v grep' # ...until no jvms/containers running sudo systemctl stop hadoop-hdfs-datanode
Be EXTREMELY careful with analytics10[28,35,52], since they are the journal nodes!
Ensure that all three of them are up and running before reimage, otherwise the namenode will shutdown
- launch wmf-reimage
# On Neodymium sudo -i wmf-auto-reimage -p T192557 analytics1068.eqiad.wmnet
When the script reboots the host, it will start another puppet run, that usually doesn't complete since it fails due to the /var/lib/hadoop dirs not present. Don't close the script, skip to next point.
- On the host, check with fdisk that all the disks have Linux fs partitions (a simple fdisk -l is fine).
- On the host:
sudo blkid | grep primary | awk '{print $2" "$1}' | sed -e 's/[:"]//g' | while read uuid partition; do letter=$(echo $partition| awk -F 'sd|1' '{print $2}'); echo -e "# Hadoop datanode $letter partition\n$uuid\t/var/lib/hadoop/data/${letter}\text4\tdefaults,noatime\t0\t2"; done >> /etc/fstab root@analytics1068:/# grep -c hadoop /etc/fstab 12 for el in b c d e f g h i j k l m; do mkdir -p /var/lib/hadoop/data/${el}; done mount -a
- If the hdfs and yarn users and groups were assigned differrent uid/gids, we need to chown all the hadoop lib files to the correct user:group.
Check to see if files are owned properly:
ls -ld /var/lib/hadoop/data/b/{yarn,hdfs} drwx------ 3 hdfs hdfs 4096 Jul 31 2015 /var/lib/hadoop/data/b/hdfs drwxr-xr-x 4 yarn yarn 4096 Jul 31 2015 /var/lib/hadoop/data/b/yarn
If these are each owned as 'hdfs:hdfs' and 'yarn:yarn', you can skip to step 6 now. Otherwise:
puppet agent --disable "$USER - chown step" for letter in $(ls /var/lib/hadoop/data); do sudo chown -Rv yarn:yarn /var/lib/hadoop/data/${letter}/yarn & sudo chown -Rv hdfs:hdfs /var/lib/hadoop/data/${letter}/hdfs & done
- Enable puppet and run, verifying that all the daemons are up and running as expected. The wmf-auto-reimage script will complete and report success in this task.