I'd like to reduce the number of moving parts for the upcoming main Kafka cluster upgrade. It will be easier to manage this upgrade if we switch as many things as we can before the actual Kafka version upgrade.
This task is about upgrading to Debian Stretch and Java 8.
# Procedure
Go to https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Kafka+Broker+Under+Replicated+Partitions and schedule downtime for other brokers in the cluster you are working on.
```
# On einsteinium
sudo icinga-downtime -d 3600 -r "prep for reimage" -h kafka2001
# On the host
sudo puppet agent --disable "$USER - reimage"
sudo depool
sudo service kafka stop
sudo service eventlogging-service-eventbus stop
# On neodymium
sudo -i wmf-auto-reimage -p T192832 kafka2001.codfw.wmnet
...
```
Log into host mgmt interface console com2, and wait for installer prompt to do manual partitioning.
/ should be ext4 50GB RAID10 across sd[abcd]1. And /srv should be left alone. Choose Manual Partitioning, and make the first RAID10 mount ext4, use as root, and format the partition.
The first puppet run will likely fail. We need to re-mount /srv and chown /srv/kafka files.
```
# On the host
sudo puppet agent --disable "$USER - /srv fix step"
# Puppet will have ensured files and directories in unmounted /srv directory, we can delete these.
sudo rm -rf /srv/*
# Put /srv in fstab
sudo blkid | grep md1 | awk '{print $2" "$1}' | sed -e 's/[:"]//g' | while read uuid partition; do
letter=$(echo $partition| awk -F 'sd|1' '{print $2}');
echo -e "$uuid\t/srv\text4\tdefaults,noatime,data=writeback,nobh,delalloc\t0\t2";
done >> /etc/fstab
# mount md1 as srv
sudo mount /srv
# remove possibly poorly chowned log files from /srv/log/eventlogging
sudo rm /srv/log/eventlogging/*.log
# Puppet usually ensures that kafka user is created, but puppet hasn't run successfully yet.
# Create the user manually so we can chown /srv/kafka to the new kafka uid.
sudo adduser --system --home /nonexistent --shell /bin/false --no-create-home --gecos 'Apache Kafka' --group kafka
# Make sure files are owned by kafka uid.
ls -ld /srv/kafka/data
# If this is owned by 'kafka:kafka', then the user added above was given the same uid
# it had before the reinstall. You can skip the next step.
# If /srv/kafka/data is owned by a numeric uid, then you need to run:
sudo chown -R kafka:kafka /srv/kafka/*
# Manually dl and install kafka 0.9.0.1. We'll be upgrading this to 1.x soon.
wget https://apt.wikimedia.org/wikimedia/pool/thirdparty/c/confluent-kafka/confluent-kafka-2.11.7_0.9.0.1-1_all.deb
sudo dpkg -i confluent-kafka-2.11.7_0.9.0.1-1_all.deb
```
Run puppet, and make sure Kafka and eventbus come back online fine. Once all is settled,
```
sudo pool
```
Wait for all Kafka topic partitions to have full ISRs, then run `kafka preferred-replica-election`.
(once all nodes have been reimaged, revert https://gerrit.wikimedia.org/r/#/c/429218/)