Change Details

The analytics1028->1069 worker nodes are still running Debian Jessie, meanwhile we have been successfully running analytics 1070->79 with Stretch without any issue. Managing two os at the same time could be cumbersome due to different package versions etc.., plus sooner or later we'll be asked to do so from the SRE team :) This is not part of any goal up to now (April 2018), so we'll manage this task as best effort (so no deadlines set in stone about when it will be finished, the priority goes to all the goals). High level procedure: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Worker_Reimage_(12_disk,_2_flex_bay_drives_-_analytics1028-analytics1057) More details: 1) Drain traffic first (if needed) ``` elukey@einsteinium:~$ sudo icinga-downtime -h analytics1068 -d 3600 -r "prep for reimage" elukey@analytics1068:~$ sudo puppet agent --disable "$USER - drain & reimage" elukey@analytics1068:~$ sudo systemctl stop hadoop-* ``` Or event more gracefully: ``` elukey@analytics1068:~$ sudo puppet agent --disable "$USER - drain & reimage" elukey@analytics1068:~$ sudo systemctl stop hadoop-yarn-nodemanager [..no jvms/containers running..] elukey@analytics1068:~$ sudo systemctl stop hadoop-hdfs-datanode ``` **Be EXTREMELY careful with analytics10[28,35,52], since they are the journal nodes!** **Ensure that all three of them are up and running before reimage, otherwise the namenode will shutdown** 2) launch wmf-reimage ``` sudo -i wmf-auto-reimage -p T192557 analytics1068.eqiad.wmnet ``` When the script reboots the host, it will start another puppet run, that usually doesn't complete since it fails due to the /var/lib/hadoop dirs not present. Don't close the script, skip to next point. 3) On the host, check with fdisk that all the disks have Linux fs partitions (a simple `fdisk -l` is fine). 4) On the host: ``` sudo blkid | grep primary | awk '{print $2" "$1}' | sed -e 's/[:"]//g' | while read uuid partition; do letter=$(echo $partition| awk -F 'sd|1' '{print $2}'); echo -e "# Hadoop datanode $letter partition\n$uuid\t/var/lib/hadoop/data/${letter}\text4\tdefaults,noatime\t0\t2"; done >> /etc/fstab root@analytics1068:/# grep -c hadoop /etc/fstab 12 for el in b c d e f g h i j k l m; do mkdir -p /var/lib/hadoop/data/${el}; done mount -a ``` 5) Chown all the hadoop lib files to the correct user:group, but disable puppet before to avoid any execution of the hadoop daemons. ``` puppet agent --disable "$USER - chown step" for letter in $(ls /var/lib/hadoop/data); do sudo chown -Rv yarn:yarn /var/lib/hadoop/data/${letter}/yarn & sudo chown -Rv hdfs:hdfs /var/lib/hadoop/data/${letter}/hdfs & done ``` 6) Enable puppet and run, verifying that all the daemons are up and running as expected. The wmf-auto-reimage script will complete and report success in this task.

The analytics1028->1069 worker nodes are still running Debian Jessie, meanwhile we have been successfully running analytics 1070->79 with Stretch without any issue. Managing two os at the same time could be cumbersome due to different package versions etc.., plus sooner or later we'll be asked to do so from the SRE team :) This is not part of any goal up to now (April 2018), so we'll manage this task as best effort (so no deadlines set in stone about when it will be finished, the priority goes to all the goals). High level procedure: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Worker_Reimage_(12_disk,_2_flex_bay_drives_-_analytics1028-analytics1057) More details: 1) Drain traffic first (if needed) ``` elukey@einsteinium:~$ sudo icinga-downtime -h analytics1068 -d 3600 -r "prep for reimage" elukey@analytics1068:~$ sudo puppet agent --disable "$USER - drain & reimage" elukey@analytics1068:~$ sudo systemctl stop hadoop-* ``` Or event more gracefully: ``` elukey@analytics1068:~$ sudo puppet agent --disable "$USER - drain & reimage" elukey@analytics1068:~$ sudo systemctl stop hadoop-yarn-nodemanager elukey@analytics1068:~$ watch 'ps aux | grep yarn | grep -c -v grep' # ...until no jvms/containers running elukey@analytics1068:~$ sudo systemctl stop hadoop-hdfs-datanode ``` **Be EXTREMELY careful with analytics10[28,35,52], since they are the journal nodes!** **Ensure that all three of them are up and running before reimage, otherwise the namenode will shutdown** 2) launch wmf-reimage ``` sudo -i wmf-auto-reimage -p T192557 analytics1068.eqiad.wmnet ``` When the script reboots the host, it will start another puppet run, that usually doesn't complete since it fails due to the /var/lib/hadoop dirs not present. Don't close the script, skip to next point. 3) On the host, check with fdisk that all the disks have Linux fs partitions (a simple `fdisk -l` is fine). 4) On the host: ``` sudo blkid | grep primary | awk '{print $2" "$1}' | sed -e 's/[:"]//g' | while read uuid partition; do letter=$(echo $partition| awk -F 'sd|1' '{print $2}'); echo -e "# Hadoop datanode $letter partition\n$uuid\t/var/lib/hadoop/data/${letter}\text4\tdefaults,noatime\t0\t2"; done >> /etc/fstab root@analytics1068:/# grep -c hadoop /etc/fstab 12 for el in b c d e f g h i j k l m; do mkdir -p /var/lib/hadoop/data/${el}; done mount -a ``` 5) Chown all the hadoop lib files to the correct user:group, but disable puppet before to avoid any execution of the hadoop daemons. ``` puppet agent --disable "$USER - chown step" for letter in $(ls /var/lib/hadoop/data); do sudo chown -Rv yarn:yarn /var/lib/hadoop/data/${letter}/yarn & sudo chown -Rv hdfs:hdfs /var/lib/hadoop/data/${letter}/hdfs & done ``` 6) Enable puppet and run, verifying that all the daemons are up and running as expected. The wmf-auto-reimage script will complete and report success in this task.

The analytics1028->1069 worker nodes are still running Debian Jessie, meanwhile we have been successfully running analytics 1070->79 with Stretch without any issue. Managing two os at the same time could be cumbersome due to different package versions etc.., plus sooner or later we'll be asked to do so from the SRE team :) This is not part of any goal up to now (April 2018), so we'll manage this task as best effort (so no deadlines set in stone about when it will be finished, the priority goes to all the goals). High level procedure: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Worker_Reimage_(12_disk,_2_flex_bay_drives_-_analytics1028-analytics1057) More details: 1) Drain traffic first (if needed) ``` elukey@einsteinium:~$ sudo icinga-downtime -h analytics1068 -d 3600 -r "prep for reimage" elukey@analytics1068:~$ sudo puppet agent --disable "$USER - drain & reimage" elukey@analytics1068:~$ sudo systemctl stop hadoop-* ``` Or event more gracefully: ``` elukey@analytics1068:~$ sudo puppet agent --disable "$USER - drain & reimage" elukey@analytics1068:~$ sudo systemctl stop hadoop-yarn-nodemanager [..elukey@analytics1068:~$ watch 'ps aux | grep yarn | grep -c -v grep' # ...until no jvms/containers running..] elukey@analytics1068:~$ sudo systemctl stop hadoop-hdfs-datanode ``` **Be EXTREMELY careful with analytics10[28,35,52], since they are the journal nodes!** **Ensure that all three of them are up and running before reimage, otherwise the namenode will shutdown** 2) launch wmf-reimage ``` sudo -i wmf-auto-reimage -p T192557 analytics1068.eqiad.wmnet ``` When the script reboots the host, it will start another puppet run, that usually doesn't complete since it fails due to the /var/lib/hadoop dirs not present. Don't close the script, skip to next point. 3) On the host, check with fdisk that all the disks have Linux fs partitions (a simple `fdisk -l` is fine). 4) On the host: ``` sudo blkid | grep primary | awk '{print $2" "$1}' | sed -e 's/[:"]//g' | while read uuid partition; do letter=$(echo $partition| awk -F 'sd|1' '{print $2}'); echo -e "# Hadoop datanode $letter partition\n$uuid\t/var/lib/hadoop/data/${letter}\text4\tdefaults,noatime\t0\t2"; done >> /etc/fstab root@analytics1068:/# grep -c hadoop /etc/fstab 12 for el in b c d e f g h i j k l m; do mkdir -p /var/lib/hadoop/data/${el}; done mount -a ``` 5) Chown all the hadoop lib files to the correct user:group, but disable puppet before to avoid any execution of the hadoop daemons. ``` puppet agent --disable "$USER - chown step" for letter in $(ls /var/lib/hadoop/data); do sudo chown -Rv yarn:yarn /var/lib/hadoop/data/${letter}/yarn & sudo chown -Rv hdfs:hdfs /var/lib/hadoop/data/${letter}/hdfs & done ``` 6) Enable puppet and run, verifying that all the daemons are up and running as expected. The wmf-auto-reimage script will complete and report success in this task.