Page MenuHomePhabricator

Reimage a Trusty Hadoop worker to Debian jessie
Closed, ResolvedPublic

Description

Test one Hadoop worker in labs with Debian Jessie to make sure that everything works as expected and then reimage a single Hadoop worker (no journal nodes). This should give us a good indication if the new nodes that we are buying will be able to run with Debian or not (and if everything will work fine or if some work is needed, e.g. with systemd units).

Event Timeline

Change 340980 had a related patch set uploaded (by elukey):
[operations/puppet] Allow analytics1040 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/340980

Just checked the labs instance (analytics project) cdh3-5.eqiad.wmlabs and everything seems working fine (no systemctl degraded state, all the Hadoop units seems to work fine, daemons up and running and no weird error in their log files).

analytics1040 seems a good candidate for the reimage.

Change 340980 merged by Elukey:
[operations/puppet] Allow analytics1040 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/340980

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703061103_elukey_13967.log.

Completed auto-reimage of hosts:

['analytics1040.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703061204_elukey_15938.log.

Completed auto-reimage of hosts:

['analytics1040.eqiad.wmnet']

and were ALL successful.

Change 341318 had a related patch set uploaded (by elukey):
[operations/puppet] Fix partman recipe for Analytics Hadoop worker nodes

https://gerrit.wikimedia.org/r/341318

Change 341318 merged by Elukey:
[operations/puppet] Fix partman recipe for Analytics Hadoop worker nodes

https://gerrit.wikimedia.org/r/341318

Change 341337 had a related patch set uploaded (by elukey):
[operations/puppet] Rework analytics-flex partman recipe

https://gerrit.wikimedia.org/r/341337

Change 341337 merged by Elukey:
[operations/puppet] Rework analytics-flex partman recipe

https://gerrit.wikimedia.org/r/341337

Change 341553 had a related patch set uploaded (by elukey):
[operations/puppet] Replace the journal volume name with unused in analytics-flex.cfg

https://gerrit.wikimedia.org/r/341553

Change 341553 merged by Ottomata:
[operations/puppet] Replace the journal volume name with unused in analytics-flex.cfg

https://gerrit.wikimedia.org/r/341553

Change 341765 had a related patch set uploaded (by Elukey):
[operations/puppet] Allow analytics1041 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/341765

Change 341765 merged by Elukey:
[operations/puppet] Allow analytics1041 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/341765

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1041.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703081012_elukey_30503.log.

Completed auto-reimage of hosts:

['analytics1041.eqiad.wmnet']

and were ALL successful.

Summary of today. I reimaged analytics1041 with the analytics-flex.cfg partman recipe, that does not mention all the disks from /dev/sdb -> /dev/sdm (it only modifies the root adding LVM volumes). Once booted in Jessie, fdisk showed each /dev/sdb1 -> /dev/sdm1 as "Basic MS partition", so I ran gdisk /dev/sdXY on all the partitions changing the partition type to Linux (blkid /dev/sdXY was showing ext4 as type). This allowed me to mount the partitions correctly, but only after I created manually the /var/lib/hadoop/data directory (maybe it needs to be ensured via puppet).

I ran puppet again and I hit another weird issue, namely permissions on /var/lib/hadoop/data/{b,c...}/hdfs/etc.. files to prometheus:jmxtrans. I didn't find the proofs but probably uid/gid stored in the inodes got "assigned" to a new user/group combination after reinstalling the OS. I chowned all the hdfs files to hdfs:hdfs and ran puppet again, finally getting the datanode up and running.

The procedure can be automated a little bit more but it is still hackish, I think that we can instruct partman to "keep" the /dev/sdX partitions avoiding the gdisk step (but probably not the permission part).

Overall I think it was a success, analytics1041 was moved to debian without wiping the datanode partitions.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1045.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703291303_elukey_10791.log.

Completed auto-reimage of hosts:

['analytics1045.eqiad.wmnet']

and were ALL successful.