Page MenuHomePhabricator

Upgrade to Stretch and Java 8 for Kafka main cluster
Closed, ResolvedPublic8 Story Points

Description

I'd like to reduce the number of moving parts for the upcoming main Kafka cluster upgrade. It will be easier to manage this upgrade if we switch as many things as we can before the actual Kafka version upgrade.

This task is about upgrading to Debian Stretch and Java 8.

Procedure

Go to https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Kafka+Broker+Under+Replicated+Partitions and schedule downtime for other brokers in the cluster you are working on.

# On einsteinium
sudo icinga-downtime -d 3600 -r "prep for reimage" -h kafka2001 


# On the host
sudo puppet agent --disable "$USER - reimage"
sudo depool
sudo service eventlogging-service-eventbus stop
sudo service kafka-mirror-main-codfw_to_main-eqiad@0 stop
sudo service kafka stop

# On neodymium
sudo -i wmf-auto-reimage -p T192832 kafka2001.codfw.wmnet
...

Log into host mgmt interface console com2, and wait for installer prompt to do manual partitioning.
/ should be ext4 50GB RAID10 across sd[abcd]1. And /srv should be left alone. Choose Manual Partitioning, and make the first RAID10 mount ext4, use as root, and format the partition.

The first puppet run will likely fail. We need to re-mount /srv and chown /srv/kafka files.

# On the host
sudo puppet agent --disable "$USER - /srv fix step"

# Puppet will have ensured files and directories in unmounted /srv directory, we can delete these.
sudo rm -rf /srv/*

# Put /srv in fstab
sudo blkid | grep md1 | awk '{print $2" "$1}' | sed -e 's/[:"]//g' | while read uuid partition; do 
    letter=$(echo $partition| awk -F 'sd|1' '{print $2}');
    echo -e "$uuid\t/srv\text4\tdefaults,noatime,data=writeback,nobh,delalloc\t0\t2";
done >> /etc/fstab

# mount md1 as srv
sudo mount /srv

# Chown /srv/log/eventlogging
sudo chown -R eventlogging:eventlogging /srv/log/eventlogging

# Puppet usually ensures that kafka user is created, but puppet hasn't run successfully yet.
# Create the user manually so we can chown /srv/kafka to the new kafka uid.
sudo adduser --system --home /nonexistent --shell /bin/false --no-create-home --gecos 'Apache Kafka'  --group kafka


# Make sure files are owned by kafka uid.
ls -ld /srv/kafka/data

# If this is owned by 'kafka:kafka', then the user added above was given the same uid
# it had before the reinstall.  You can skip the next step.

# If /srv/kafka/data is owned by a numeric uid, then you need to run:
sudo chown -R kafka:kafka /srv/kafka/*

# Manually dl and install kafka 0.9.0.1.  We'll be upgrading this to 1.x soon.
wget https://apt.wikimedia.org/wikimedia/pool/thirdparty/c/confluent-kafka/confluent-kafka-2.11.7_0.9.0.1-1_all.deb
sudo dpkg -i confluent-kafka-2.11.7_0.9.0.1-1_all.deb

Run puppet, and make sure Kafka and eventbus come back online fine. Once all is settled,

sudo pool

Wait for all Kafka topic partitions to have full ISRs, then run kafka preferred-replica-election.

(once all nodes have been reimaged, revert https://gerrit.wikimedia.org/r/#/c/429218/)

Event Timeline

Ottomata triaged this task as Normal priority.Apr 23 2018, 6:50 PM
Ottomata created this task.

Change 428514 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Only require specific version of python-tornado in < stretch

https://gerrit.wikimedia.org/r/428514

Change 428514 merged by Ottomata:
[operations/puppet@production] Only require specific version of python-tornado in < stretch

https://gerrit.wikimedia.org/r/428514

Change 428517 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Move eventbus in deployment-prep to new stretch server

https://gerrit.wikimedia.org/r/428517

Change 428519 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Move eventbus in deployment-prep to new stretch server

https://gerrit.wikimedia.org/r/428519

Change 428517 merged by Ottomata:
[operations/mediawiki-config@master] Move eventbus in deployment-prep to new stretch server

https://gerrit.wikimedia.org/r/428517

Change 428519 merged by Ottomata:
[operations/puppet@production] Move eventbus in deployment-prep to new stretch server

https://gerrit.wikimedia.org/r/428519

Note: in netboot.cfg the kafka[12]00[123] are set with raid10-gpt-srv-ext4.cfg, that as it is formats the /srv/ partition. If we want to keep data in there, we should either temporary remove the rule (so a reimage will/should lead to the d-i prompt for manual settings) or add a more conservative partman recipe that keeps /srv.

Change 428575 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set PXE boot to Debian Stretch for kafka[12]00[123]

https://gerrit.wikimedia.org/r/428575

Since we are doing this work, I'd also add the interface::add_ip6_mapped { 'main': } puppet config to site.pp and the related AAAA records to the DNS repo.

Change 428575 merged by Ottomata:
[operations/puppet@production] Set PXE boot to Debian Stretch for kafka[12]00[123]

https://gerrit.wikimedia.org/r/428575

Change 428926 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/dns@master] Add IPv6 entries for kafka[12]00[123]

https://gerrit.wikimedia.org/r/428926

Change 428928 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add add_ip6_mapped to main-codfw hosts.

https://gerrit.wikimedia.org/r/428928

Change 428928 merged by Ottomata:
[operations/puppet@production] Add add_ip6_mapped to main-codfw hosts.

https://gerrit.wikimedia.org/r/428928

Change 428963 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add add_ip6_mapped to kafka100*

https://gerrit.wikimedia.org/r/428963

Change 428963 merged by Ottomata:
[operations/puppet@production] Add add_ip6_mapped to kafka100*

https://gerrit.wikimedia.org/r/428963

Change 428926 merged by Ottomata:
[operations/dns@master] Add IPv6 entries for kafka[12]00[123]

https://gerrit.wikimedia.org/r/428926

Ok, I think we are ready to do this! If there are no objections, I'll start on codfw tomorrow.

Change 429218 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Temporarly remove partman recipe for kafka main hosts

https://gerrit.wikimedia.org/r/429218

Change 429218 merged by Ottomata:
[operations/puppet@production] Temporarly remove partman recipe for kafka main hosts

https://gerrit.wikimedia.org/r/429218

fdans moved this task from Incoming to Kafka Work on the Analytics board.Apr 26 2018, 4:35 PM
Ottomata updated the task description. (Show Details)Apr 26 2018, 6:10 PM

Mentioned in SAL (#wikimedia-operations) [2018-04-26T18:12:02Z] <ottomata> reimaging (some?) kafka200* codfw main kafka nodes to stretch T192832

Ottomata updated the task description. (Show Details)Apr 26 2018, 6:15 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Apr 26 2018, 6:17 PM

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['kafka2001.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804261822_otto_23749.log.

Change 429262 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use /etc/prometheus as config_dir for kafka broker jmx exporter

https://gerrit.wikimedia.org/r/429262

Change 429262 merged by Ottomata:
[operations/puppet@production] Use /etc/prometheus as config_dir for kafka broker jmx exporter

https://gerrit.wikimedia.org/r/429262

Ottomata updated the task description. (Show Details)Apr 26 2018, 7:23 PM
Ottomata updated the task description. (Show Details)Apr 26 2018, 7:30 PM

Completed auto-reimage of hosts:

['kafka2001.codfw.wmnet']

and were ALL successful.

Alright, kafka2001 is now Stretch. Waiting until Monday to proceed with more.

Mentioned in SAL (#wikimedia-operations) [2018-04-30T13:21:08Z] <ottomata> beginning rolling reimage of kafka200[23] to stretch T192832

Ottomata updated the task description. (Show Details)Apr 30 2018, 1:23 PM

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['kafka2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804301326_otto_32348.log.

Ottomata updated the task description. (Show Details)Apr 30 2018, 1:34 PM

Completed auto-reimage of hosts:

['kafka2002.codfw.wmnet']

and were ALL successful.

Ottomata updated the task description. (Show Details)Apr 30 2018, 2:02 PM

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['kafka2003.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804301419_otto_14444.log.

Completed auto-reimage of hosts:

['kafka2003.codfw.wmnet']

and were ALL successful.

Ottomata updated the task description. (Show Details)Apr 30 2018, 3:00 PM

Done for main-codfw. Will proceed with main-eqiad this afternoon.

Mentioned in SAL (#wikimedia-operations) [2018-04-30T18:16:09Z] <ottomata> starting rolling reimage of kafka main-eqiad brokers kafka100[123] - T192832

Ottomata updated the task description. (Show Details)Apr 30 2018, 6:19 PM

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['kafka1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804301821_otto_1853.log.

Completed auto-reimage of hosts:

['kafka1001.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['kafka1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201804301910_otto_13228.log.

Completed auto-reimage of hosts:

['kafka1002.eqiad.wmnet']

and were ALL successful.

only kafka1003 remains...will do tomorrow.

Script wmf-auto-reimage was launched by otto on neodymium.eqiad.wmnet for hosts:

['kafka1003.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201805011328_otto_19208.log.

Completed auto-reimage of hosts:

['kafka1003.eqiad.wmnet']

and were ALL successful.

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.May 1 2018, 2:18 PM
Nuria closed this task as Resolved.May 8 2018, 10:44 PM