Audit fstabs on Kafka and Hadoop nodes to use UUIDs instead of /dev paths
Closed, ResolvedPublic8 Story Points

Description

Once done, also edit Administration pages for Hadoop and Kafka that have scripts to create new nodes and partitions to make sure fstab is populated with UUID.

Ottomata created this task.Oct 11 2016, 6:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 11 2016, 6:54 PM
elukey triaged this task as High priority.Oct 18 2016, 4:19 PM

Did some Kafka reboots today and ran into the issue where /dev numbers are rearranged after reboot. I had to manually use UUIDs in fstab and reboot again to fix. I then proceeded to edit fstab and add UUIDs for every other broker as I rebooted it. The ones I did first still need their fstabs edited.

I did:

blkid, then copy/paste UUIDs for into fstab appropriately. Then umount /var/spool/kafka/* and mount -a, then look at mount -v | grep kafka | sort to see if things look right.

kafka1013, kafka1018, kafka1020 and kafka1022 have had fstabs edited with UUIDs. kafka1012 and kafka1014 still need to be done.

elukey added a comment.EditedNov 9 2016, 10:24 AM

Quick script to match UUID and mountpoint:

elukey@kafka1012:~$ cat /proc/mounts | grep spool | cut -d " " -f 1,2 | sort | while read partition mountpoint; do uuid=$(sudo blkid | grep $partition | cut -d " " -f 2| sed -e 's/"//g'); echo -e "$uuid\t$mountpoint" ; done

UUID=296c194d-2724-4ac1-a355-70e63d1580d7	/var/spool/kafka/a
UUID=25239f66-549f-4bee-9c25-adf9dc8d5ff4	/var/spool/kafka/b
UUID=efb2fbd4-37da-434f-82a5-4b78350b8466	/var/spool/kafka/c
UUID=b7de45ba-25b0-49de-80bf-e5484ed10542	/var/spool/kafka/d
UUID=47f6fe0c-2cf6-4df3-932d-9881571feaae	/var/spool/kafka/e
UUID=24a0117e-e24b-488c-8786-11b39d36cca6	/var/spool/kafka/f
UUID=c65f08a6-5bda-42b8-9436-0a15ca138426	/var/spool/kafka/g
UUID=139db99e-886d-41e3-bac4-bdefc4fdab7b	/var/spool/kafka/h
UUID=2ddaa0b6-715f-4f17-9340-4b2f94c996b7	/var/spool/kafka/i
UUID=457cdef8-ece1-42f5-984e-367b45224153	/var/spool/kafka/j
UUID=377e7539-ac72-4d46-be61-77c9d6e3aacd	/var/spool/kafka/k
UUID=23313468-69e2-4e55-a1e2-0cb472e97cd2	/var/spool/kafka/l

Validated on kafka1018:

elukey@kafka1018:~$ comm -12 <(cat /proc/mounts | grep spool | cut -d " " -f 1,2 | sort | while read partition mountpoint; do uuid=$(sudo blkid | grep $partition | cut -d " " -f 2| sed -e 's/"//g'); echo -e "$uuid\t$mountpoint" ; done | sort) <(cat /etc/fstab | grep spool |sort| awk '{print $1"\t"$2}')
UUID=0e4d589d-be06-4782-a292-a36de30117c6	/var/spool/kafka/k
UUID=1ea3a865-c906-4551-ab1e-70ae0ae27155	/var/spool/kafka/i
UUID=46bfcaa4-5746-4809-8075-209a20ec26d8	/var/spool/kafka/c
UUID=5cf6b088-d57e-48c8-a65a-a78cced66f6e	/var/spool/kafka/a
UUID=67a74799-0422-49b1-b79b-fd73ce0d1186	/var/spool/kafka/b
UUID=6b11de43-0a83-425e-9c22-eddc46530e71	/var/spool/kafka/l
UUID=7badbc40-0af6-491a-a366-495c19b6dd6c	/var/spool/kafka/h
UUID=b19afeae-6304-4547-aeeb-5bae57cc7a71	/var/spool/kafka/j
UUID=c63e2368-6896-4113-a315-83e6a5f7bb64	/var/spool/kafka/g
UUID=d301b1a1-c355-4b6c-bbf4-fd1a510ab598	/var/spool/kafka/f
UUID=ddd21113-abfa-4481-847d-489d1402edbb	/var/spool/kafka/d
UUID=e8c9e786-3c16-4430-94ff-73ff2c217ec2	/var/spool/kafka/e
elukey added a comment.Nov 9 2016, 1:33 PM
cat /proc/mounts | grep spool | cut -d " " -f 1,2 | sort | while read partition mountpoint; do uuid=$(sudo blkid | grep $partition | cut -d " " -f 2| sed -e 's/"//g'); letter=$(echo $mountpoint | cut -d "/" -f 5); echo -e "# Kafka log partition $letter\n$uuid\t$mountpoint\text4\tdefaults,noatime,data=writeback,nobh,delalloc\t0\t2" ; done

This one emits the fstab chunk of text to insert, like:

# Kafka log partition a
UUID=296c194d-2724-4ac1-a355-70e63d1580d7	/var/spool/kafka/a	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition b
UUID=25239f66-549f-4bee-9c25-adf9dc8d5ff4	/var/spool/kafka/b	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition c
UUID=efb2fbd4-37da-434f-82a5-4b78350b8466	/var/spool/kafka/c	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition d
UUID=b7de45ba-25b0-49de-80bf-e5484ed10542	/var/spool/kafka/d	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition e
UUID=47f6fe0c-2cf6-4df3-932d-9881571feaae	/var/spool/kafka/e	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition f
UUID=24a0117e-e24b-488c-8786-11b39d36cca6	/var/spool/kafka/f	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition g
UUID=c65f08a6-5bda-42b8-9436-0a15ca138426	/var/spool/kafka/g	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition h
UUID=139db99e-886d-41e3-bac4-bdefc4fdab7b	/var/spool/kafka/h	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition i
UUID=2ddaa0b6-715f-4f17-9340-4b2f94c996b7	/var/spool/kafka/i	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition j
UUID=457cdef8-ece1-42f5-984e-367b45224153	/var/spool/kafka/j	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition k
UUID=377e7539-ac72-4d46-be61-77c9d6e3aacd	/var/spool/kafka/k	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2
# Kafka log partition l
UUID=23313468-69e2-4e55-a1e2-0cb472e97cd2	/var/spool/kafka/l	ext4	defaults,noatime,data=writeback,nobh,delalloc	0	2

Mentioned in SAL (#wikimedia-operations) [2016-11-09T13:35:29Z] <elukey> stopping kafka* daemons on kafka1012 to upgrade its fstab with UUID (T147879)

elukey added a comment.EditedNov 9 2016, 1:53 PM

kafka1012 is up and running. To sanity check the kafka spools mount point I did the following:

  1. for el in $(ls); do sudo find $el; done > /home/elukey/spools.txt BEFORE changing fstab
  2. for el in $(ls); do sudo find $el; done > /home/elukey/spools2.txt right after changing fstab and umount/mount -a.
  3. for el in $(ls); do sudo find $el; done > /home/elukey/spools3.txt after the reboot.

diff shows no change!

Mentioned in SAL (#wikimedia-operations) [2016-11-09T13:58:38Z] <elukey> stopping kafka* daemons on kafka1014 to upgrade its fstab with UUID (T147879)

elukey added a comment.EditedNov 9 2016, 2:51 PM

Next steps:

  1. fix the fstabs of the hadoop nodes
  2. update the documentation for new hosts to use UUIDs.
elukey lowered the priority of this task from High to Normal.Nov 9 2016, 2:51 PM
Nuria moved this task from Q1 (July 2017) to Q4 (April 2017) on the Analytics board.
elukey added a comment.EditedJan 27 2017, 12:39 PM

Script adapted to an Hadoop node:

#!/bin/bash

set -e

# Root partition
cat /proc/mounts | grep "mapper" | grep "root" | cut -d " " -f 1,2 | sort | while read partition mountpoint; do uuid=$(sudo blkid | grep $partition | cut -d " " -f 2| sed -e 's/"//g'); echo -e "#Root partition\n$uuid\t$mountpoint\text4\terrors=remount-ro\t0\t1" ; done

# Journalnode partition
cat /proc/mounts | grep "journalnode" | cut -d " " -f 1,2 | sort | while read partition mountpoint; do uuid=$(sudo blkid | grep $partition | cut -d " " -f 2| sed -e 's/"//g'); echo -e "#Hadoop JournalNode partition\n$uuid\t$mountpoint\text4\tdefaults,noatime\t0\t2" ; done

# Swap partition
cat /etc/fstab | grep "mapper" | grep "swap" | cut -d " " -f 1,2 | sort | while read partition mountpoint; do uuid=$(sudo blkid | grep $partition | cut -d " " -f 2| sed -e 's/"//g'); echo -e "#Swap\n$uuid\tnone\tswap\tsw\t0\t0" ; done

# Hadoop partitions
cat /proc/mounts | grep "/var/lib/hadoop/data" | cut -d " " -f 1,2 | sort | while read partition mountpoint; do uuid=$(sudo blkid | grep $partition | cut -d " " -f 2| sed -e 's/"//g'); letter=$(echo $mountpoint | cut -d "/" -f 6); echo -e "# Kafka log partition $letter\n$uuid\t$mountpoint\text4\tdefaults,noatime\t0\t2" ; done
elukey claimed this task.Jan 27 2017, 12:52 PM
elukey moved this task from Backlog to Analytics Backlog on the User-Elukey board.

Mentioned in SAL (#wikimedia-operations) [2017-01-31T13:58:59Z] <elukey> rebooted analytics1039 to pick up uuids in fstab - T147879

Results look good!

elukey@analytics1039:~$ cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=32953052k,nr_inodes=8238263,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=6592852k,mode=755 0 0
/dev/mapper/analytics1039--vg-root / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
none /sys/fs/cgroup tmpfs rw,relatime,size=4k,mode=755 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /run/user tmpfs rw,nosuid,nodev,noexec,relatime,size=102400k,mode=755 0 0
none /sys/fs/pstore pstore rw,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /sys/fs/cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,relatime,hugetlb 0 0
/dev/sda1 /boot ext4 rw,relatime,data=ordered 0 0
/dev/sdb1 /var/lib/hadoop/data/b ext4 rw,noatime,data=ordered 0 0
/dev/sdk1 /var/lib/hadoop/data/k ext4 rw,noatime,data=ordered 0 0
/dev/sdl1 /var/lib/hadoop/data/l ext4 rw,noatime,data=ordered 0 0
/dev/sdm1 /var/lib/hadoop/data/m ext4 rw,noatime,data=ordered 0 0
/dev/sdc1 /var/lib/hadoop/data/c ext4 rw,noatime,data=ordered 0 0
/dev/sdd1 /var/lib/hadoop/data/d ext4 rw,noatime,data=ordered 0 0
/dev/sde1 /var/lib/hadoop/data/e ext4 rw,noatime,data=ordered 0 0
/dev/sdf1 /var/lib/hadoop/data/f ext4 rw,noatime,data=ordered 0 0
/dev/sdg1 /var/lib/hadoop/data/g ext4 rw,noatime,data=ordered 0 0
/dev/sdh1 /var/lib/hadoop/data/h ext4 rw,noatime,data=ordered 0 0
/dev/sdi1 /var/lib/hadoop/data/i ext4 rw,noatime,data=ordered 0 0
/dev/sdj1 /var/lib/hadoop/data/j ext4 rw,noatime,data=ordered 0 0
/dev/mapper/analytics1039--vg-journalnode /var/lib/hadoop/journal ext4 rw,noatime,data=ordered 0 0
systemd /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,name=systemd 0 0
elukey edited projects, added Analytics-Kanban; removed Analytics.Jan 31 2017, 2:07 PM

@Ottomata - sanity check before applying it to other nodes? :)

Milimetric set the point value for this task to 8.Feb 6 2017, 4:45 PM
elukey moved this task from In Progress to Stalled on the User-Elukey board.Feb 10 2017, 4:32 PM

Fixed some little mistakes and added documentation to https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Worker_Nodes_.28DataNode_.26_NodeManager.29

We are going to reimage all the analytics hosts to Debian, so this task can be closed now.

elukey moved this task from Next Up to Done on the Analytics-Kanban board.Mar 8 2017, 10:15 AM
Nuria closed this task as Resolved.Mar 8 2017, 7:59 PM