Reimage a Trusty Hadoop worker to Debian jessie
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Mar 3 2017, 2:14 PM

Description

Test one Hadoop worker in labs with Debian Jessie to make sure that everything works as expected and then reimage a single Hadoop worker (no journal nodes). This should give us a good indication if the new nodes that we are buying will be able to run with Debian or not (and if everything will work fine or if some work is needed, e.g. with systemd units).

Details

Subject	Repo	Branch	Lines +/-
Allow analytics1041 to be reimaged with Debian Jessie	operations/puppet	production	+0 -2
Replace the journal volume name with unused in analytics-flex.cfg	operations/puppet	production	+5 -5
Rework analytics-flex partman recipe	operations/puppet	production	+25 -21
Fix partman recipe for Analytics Hadoop worker nodes	operations/puppet	production	+4 -3
Allow analytics1040 to be reimaged with Debian Jessie	operations/puppet	production	+0 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T157807 Reinstall Analytics Hadoop Cluster with Debian Jessie
		Resolved		elukey	T159530 Reimage a Trusty Hadoop worker to Debian jessie

Event Timeline

elukey created this task.Mar 3 2017, 2:14 PM

Change 340980 had a related patch set uploaded (by elukey):
[operations/puppet] Allow analytics1040 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/340980

gerritbot added a project: Patch-For-Review.Mar 3 2017, 2:20 PM

Just checked the labs instance (analytics project) cdh3-5.eqiad.wmlabs and everything seems working fine (no systemctl degraded state, all the Hadoop units seems to work fine, daemons up and running and no weird error in their log files).

analytics1040 seems a good candidate for the reimage.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 3 2017, 2:38 PM

Change 340980 merged by Elukey:
[operations/puppet] Allow analytics1040 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/340980

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703061103_elukey_13967.log.

Completed auto-reimage of hosts:

['analytics1040.eqiad.wmnet']

and were ALL successful.

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1040.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703061204_elukey_15938.log.

Completed auto-reimage of hosts:

['analytics1040.eqiad.wmnet']

and were ALL successful.

Change 341318 had a related patch set uploaded (by elukey):
[operations/puppet] Fix partman recipe for Analytics Hadoop worker nodes

https://gerrit.wikimedia.org/r/341318

Change 341318 merged by Elukey:
[operations/puppet] Fix partman recipe for Analytics Hadoop worker nodes

https://gerrit.wikimedia.org/r/341318

Change 341337 had a related patch set uploaded (by elukey):
[operations/puppet] Rework analytics-flex partman recipe

https://gerrit.wikimedia.org/r/341337

Change 341337 merged by Elukey:
[operations/puppet] Rework analytics-flex partman recipe

https://gerrit.wikimedia.org/r/341337

Change 341553 had a related patch set uploaded (by elukey):
[operations/puppet] Replace the journal volume name with unused in analytics-flex.cfg

https://gerrit.wikimedia.org/r/341553

Change 341553 merged by Ottomata:
[operations/puppet] Replace the journal volume name with unused in analytics-flex.cfg

https://gerrit.wikimedia.org/r/341553

Ottomata moved this task from Backlog to Q4 2019/2020 on the Analytics-Clusters board.Mar 7 2017, 3:52 PM

elukey edited projects, added User-Elukey; removed Patch-For-Review.Mar 8 2017, 9:22 AM

Change 341765 had a related patch set uploaded (by Elukey):
[operations/puppet] Allow analytics1041 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/341765

gerritbot added a project: Patch-For-Review.Mar 8 2017, 9:22 AM

Change 341765 merged by Elukey:
[operations/puppet] Allow analytics1041 to be reimaged with Debian Jessie

https://gerrit.wikimedia.org/r/341765

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1041.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703081012_elukey_30503.log.

Completed auto-reimage of hosts:

['analytics1041.eqiad.wmnet']

and were ALL successful.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Mar 8 2017, 1:14 PM

Summary of today. I reimaged analytics1041 with the analytics-flex.cfg partman recipe, that does not mention all the disks from /dev/sdb -> /dev/sdm (it only modifies the root adding LVM volumes). Once booted in Jessie, fdisk showed each /dev/sdb1 -> /dev/sdm1 as "Basic MS partition", so I ran gdisk /dev/sdXY on all the partitions changing the partition type to Linux (blkid /dev/sdXY was showing ext4 as type). This allowed me to mount the partitions correctly, but only after I created manually the /var/lib/hadoop/data directory (maybe it needs to be ensured via puppet).

I ran puppet again and I hit another weird issue, namely permissions on /var/lib/hadoop/data/{b,c...}/hdfs/etc.. files to prometheus:jmxtrans. I didn't find the proofs but probably uid/gid stored in the inodes got "assigned" to a new user/group combination after reinstalling the OS. I chowned all the hdfs files to hdfs:hdfs and ran puppet again, finally getting the datanode up and running.

The procedure can be automated a little bit more but it is still hackish, I think that we can instruct partman to "keep" the /dev/sdX partitions avoiding the gdisk step (but probably not the permission part).

Overall I think it was a success, analytics1041 was moved to debian without wiping the datanode partitions.

Just completed https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Worker_Nodes_.28DataNode_.26_NodeManager.29 and automated all the steps.

• Nuria closed this task as Resolved.Mar 17 2017, 8:27 PM

Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts:

['analytics1045.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703291303_elukey_10791.log.

Completed auto-reimage of hosts:

['analytics1045.eqiad.wmnet']

and were ALL successful.

Reimage a Trusty Hadoop worker to Debian jessieClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reimage a Trusty Hadoop worker to Debian jessie
Closed, ResolvedPublic
Actions

Related Objects
Search...